Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
I rewrote the descriptions on two of my skills this morning, trying to make them clearer and add some natural trigger phrasings. The new wording feels better to me, but I'm worried I might have accidentally narrowed the trigger surface and broken activation for requests that used to route correctly.
I don't want to run a full content eval just to find this out — that's hours of agent time. I just want to know: do the new descriptions still pick up the right user requests?
What's the fastest way to verify this without rerunning the slow scored evals?
Tell me the right command and workflow for verifying that description edits didn't break routing. Include how to compare against the prior state if possible.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions