Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
94%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
tessl eval view --lastShow the user the overall scores and per-scenario breakdown.
Present the key findings:
Eval Results Summary:
Scenario Baseline With-Tile Delta
checkout-flow 42% 87% +45
webhook-setup 38% 72% +34
error-recovery 65% 91% +26
Overall: 48% 83% +35
Key observations:
- checkout-flow: Tile adds significant value (+45 points)
- webhook-setup: Good improvement but still below 80% threshold
- error-recovery: Strong improvement, above 80%If multiple agents were tested, show a comparison:
Agent Comparison:
Agent Avg Score Best Scenario Worst Scenario
claude:claude-sonnet-4-6 80% checkout-flow (87%) webhook-setup (72%)
cursor:auto 74% error-recovery (85%) webhook-setup (58%)
Observations:
- Claude Sonnet scores highest on average
- Both agents struggle with webhook-setup — likely a tile gapevals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions