Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
91
91%
Does it follow best practices?
Impact
92%
1.10xAverage score across 25 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent applies the correct triage logic from the optimization cycle skill: identifying regressions as highest priority, recognizing that high baselines suggest scenarios may be too easy (consider regenerating), and recommending the right corrective actions before proceeding to tile content edits.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Regression identified",
"description": "Report identifies the security-review scenario as a regression (with_context 79% < baseline 84%, delta -5)",
"max_score": 15
},
{
"name": "Regression is highest priority",
"description": "Report explicitly labels the regression as the highest priority issue — either using the word 'highest priority', 'most urgent', or equivalent strong language that makes it the top item to address",
"max_score": 18
},
{
"name": "High baseline warning present",
"description": "Report notes that all three scenarios have high baseline scores (84%, 87%, 82% — all above 80%), suggesting the scenarios may be too easy and agents can solve them without the tile",
"max_score": 18
},
{
"name": "Scenario regeneration suggested",
"description": "Report recommends considering regenerating harder scenarios (before or alongside tile content fixes) as a response to the high baseline scores",
"max_score": 15
},
{
"name": "Tile is actively hurting",
"description": "Report states or implies that the tile is actively hurting agent performance on security-review (not just failing to help) — because with-tile scores are lower than without-tile scores on that scenario",
"max_score": 12
},
{
"name": "Per-criterion regression analysis",
"description": "Report identifies at least two specific criteria within security-review that regressed (e.g., flags_injection_risks dropped from 9→7, checks_auth_bypass 8→6, references_cwe 7→5)",
"max_score": 12
},
{
"name": "Correct prioritization order",
"description": "Recommendations are ordered with regression fix BEFORE any suggestion to improve the tile for the other two scenarios — NOT presented as equally weighted items",
"max_score": 10
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions