Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
I'm about to publish my tile and I have 4 skills inside it. Before I commit to a slow content eval that will take a few hours, I want a fast sanity check: are my skills actually reachable from the kinds of questions real users would ask?
In other words — if someone phrases a request the way I expect them to, will Claude pick the right skill out of my tile, or will it just answer from scratch and ignore the tile entirely?
What's the fastest way to verify this, and what should I do with the result before I run the full eval?
Tell me the fastest way to run this sanity check, what command to use, how to read the output, and what counts as "passing" the check vs. needing fixes before I move on to the full eval.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions