Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
94%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
For a first run, recommend keeping it simple:
"For a first run, I recommend just using
claude:claude-sonnet-4-6to keep eval time manageable (~10–15 minutes per scenario). Once you've validated the scenarios are good, you can add more agents to compare.Want to go with the default, or test multiple agents now?
Available agents:
Agent Models claudeclaude-sonnet-4-6(default),claude-opus-4-6,claude-sonnet-4-5,claude-opus-4-5,claude-haiku-4-5cursorauto,composer-1.5Note: Each additional agent multiplies the eval run time and cost."
Build the --agent flags based on their choice. For multi-agent, each agent is a separate --agent flag:
--agent=claude:claude-sonnet-4-6 --agent=cursor:autotessl eval run <tile-path> \
--agent=<agent1:model1> \
[--agent=<agent2:model2>]Note the eval run URL from the output and share it with the user so they can optionally watch progress in the browser.
tessl eval list --mine --limit 1Eval runs take ~10–15 minutes per scenario per agent. Each scenario runs twice (baseline without context + with-context). Update the user periodically:
"Evals are running... Status: in_progress. With N scenarios and 1 agent, expect about X–Y minutes total. I'll check again shortly."
Wait until status shows completed. If status shows failed, run:
tessl eval retry <id>evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions