Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
94%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent correctly identifies a regression as caused by ambiguous or contradictory content in the tile (rather than missing content), and proposes a removal or clarification fix rather than adding more emphasis on the failing behavior.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Contradicting clause identified",
"description": "The analysis identifies the 'documentation-only or clearly trivial... you may skip the test run' sentence (or equivalent 'at your discretion' exception) as the specific cause of the regression",
"max_score": 22
},
{
"name": "Contradiction mechanism explained",
"description": "The analysis explains that this clause contradicts 'Always run the full test suite' and gives agents a justification to skip tests — not just that 'it's confusing'",
"max_score": 18
},
{
"name": "Remove/clarify approach taken",
"description": "The proposed fix removes or substantially rewrites the exception clause — NOT just adding another 'always run tests' statement elsewhere to compensate",
"max_score": 22
},
{
"name": "Specific text targeted",
"description": "The fix targets the specific problematic sentence(s), not a broad rewrite of the entire Pre-Review Checklist section",
"max_score": 18
},
{
"name": "No compensating additions",
"description": "The updated SKILL.md does NOT add a new reinforcing instruction (e.g., 'Note: never skip tests even for trivial changes') alongside or instead of removing the contradictory clause",
"max_score": 10
},
{
"name": "Other sections preserved",
"description": "The Submitting for Review and Responding to Feedback sections are not modified in the updated SKILL.md",
"max_score": 7
},
{
"name": "Pre-review list intact",
"description": "The numbered list structure of the Pre-Review Checklist is maintained in the updated file — the fix does not replace the entire checklist with prose",
"max_score": 3
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions