Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
This phase covers two distinct eval types. Both apply to every tile (single-skill and multi-skill alike).
--solver=activation): observes which skill self-activates per scenario. Does NOT force activation. Tests routing/description quality. Fast — completes in ~2–3 min.Ordering: for multi-skill tiles run activation first (catches routing problems before scored time is invested); for single-skill tiles either order is fine, parallel works too. Both are required — the variable is ordering, not whether to run them.
tessl eval run <tile-path> --solver=activation --label <run-label>This completes in ~2–3 min (no agent execution needed). Note the eval run URL from the output and share it with the user.
Activation results are reviewed in Phase 5 — they do not produce a numeric score; they produce a per-scenario firing pattern (which skill, if any, fired on each scenario). Pair them with content-eval baseline scores to distinguish "no activation but agent handles it fine" from "no activation and agent needs help" (a real routing gap).
Poll for completion as described in §Polling below.
For a first run, recommend keeping it simple:
"For a first run, I recommend just using
claude:claude-sonnet-4-6to keep eval time manageable (~10–15 minutes per scenario). Once you've validated the scenarios are good, you can add more agents to compare.Want to go with the default, or test multiple agents now?
Available agents:
Agent Models claudeclaude-sonnet-4-6(default),claude-opus-4-6,claude-sonnet-4-5,claude-opus-4-5,claude-haiku-4-5cursorauto,composer-1.5Note: Each additional agent multiplies the eval run time and cost."
Build the --agent flags based on their choice. For multi-agent, each agent is a separate --agent flag:
--agent=claude:claude-sonnet-4-6 --agent=cursor:autotessl eval run <tile-path> \
--agent=<agent1:model1> \
[--agent=<agent2:model2>] \
--label <run-label>Note the eval run URL from the output and share it with the user so they can optionally watch progress in the browser.
tessl eval list --mine --limit 1For content evals, runs take ~10–15 minutes per scenario per agent. Each scenario runs twice (baseline without context + with-context). Update the user periodically:
"Evals are running... Status: in_progress. With N scenarios and 1 agent, expect about X–Y minutes total. I'll check again shortly."
For activation evals, expect ~2–3 min total — much faster polling.
Wait until status shows completed. If status shows failed, run:
tessl eval retry <id>evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions