Interactive skill creation and eval-driven optimization. Triggers: create a skill, make a skill, new skill, scaffold skill, optimize skill, run evals, improve skill. Uses AskUserQuestion for interview; WebSearch for research; Bash for eval execution. Outputs: complete skill directory with SKILL.md, tile.json, evals, and repo integration.
93
94%
Does it follow best practices?
Impact
91%
1.26xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent follows the prescribed optimization workflow: prioritizing negative deltas first, then 0% criteria, then low-delta scenarios; producing specific actionable edits rather than vague suggestions; and appending to the benchmark log rather than overwriting it. The existing benchmark-log.md content must be preserved.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Benchmark log preserved",
"description": "The existing entries in benchmark-log.md are present unchanged in the output file — prior run data is NOT removed, truncated, or overwritten",
"max_score": 14
},
{
"name": "New entry appended",
"description": "A new dated entry for this eval run is appended AFTER the existing entries in benchmark-log.md (not inserted before them or replacing them)",
"max_score": 8
},
{
"name": "Negative delta addressed first",
"description": "The optimization proposals address the scenario/criterion that has a negative delta (scenario-2 / 'Changelog check') BEFORE addressing lower-priority issues",
"max_score": 14
},
{
"name": "Zero-percent criteria addressed",
"description": "The optimization proposals include a fix for the criterion that scored 0% with-skill (the 'Security patterns' criterion in scenario-1)",
"max_score": 7
},
{
"name": "Proposals are specific edits",
"description": "Each optimization proposal contains specific wording to add, remove, or restructure in SKILL.md — does NOT only say things like 'add more examples', 'clarify this section', or 'improve coverage'",
"max_score": 14
},
{
"name": "No vague direction",
"description": "None of the optimization proposals use vague direction phrases without a concrete edit: 'add more examples', 'make it clearer', 'consider adding', 'try to include' are absent from proposals",
"max_score": 10
},
{
"name": "Priority order followed",
"description": "The list of proposed edits is ordered: negative-delta issues first, then 0% criteria, then lowest-delta scenarios — not in arbitrary or alphabetical order",
"max_score": 10
},
{
"name": "Eval read-only respected",
"description": "The output does NOT include a modified SKILL.md or tile.json — optimization proposals are documented separately (e.g. in an analysis file), not applied directly to skill source files",
"max_score": 14
},
{
"name": "Result schema present",
"description": "The analysis output includes a structured summary with at minimum: scenario names, baseline scores, with-skill scores, and deltas",
"max_score": 4
},
{
"name": "Readout table format",
"description": "The benchmark-log.md new entry includes a Markdown table with columns for scenario, baseline, with-skill, and delta — matching the format of existing entries in the log",
"max_score": 5
}
]
}