Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
This skill orchestrates optimize-skill-instructions, setup-skill-performance, and optimize-skill-performance into a single end-to-end optimization cycle.
The full cycle takes 1–2 hours depending on how many scenarios and improvement iterations are needed. Set this expectation with the user upfront.
Review SKILL.md → Apply quick wins → Generate scenarios → Activation check → Content evals → Analyze → Fix → Re-run → Report
└── optimize-skill-instructions ──┘ └─────────── setup-skill-performance ───────────┘ └──────────── optimize-skill-performance ────────────┘Two distinct eval types run in this cycle:
Both apply to every tile (single-skill and multi-skill alike). The variable is ordering, not whether to run them.
Every tessl eval run invocation MUST include --label <run-label> so the run is identifiable in tessl eval list. The label is a short, human-readable description of what the run is about — not a structured ID.
Compose <run-label> from whatever helps you recognise the run later when scanning the list. Typical ingredients:
activation, baseline, initial evals, verificationdescription rewrite, plan-solution fixes, clean scenario(haiku-4-5), (sonnet-4-6)v0.5.0, v4Examples:
repro-clean-scenariotask-prep v0.3.0 baselinetask-prep v0.5.0 plan-solution fixesv4-final-verificationskill-insights activation (haiku-4-5)skill-insights initial evals (haiku-4-5)Keep it concise — what the run was about should be obvious without opening it.
tessl skill review skills/<name>/SKILL.md # review a skill (Step 1)
tessl scenario generate <tile-path> --count=5 # generate scenarios (Step 2)
tessl eval run <tile-path> --solver=activation --label <run-label> # test skill routing
tessl eval run <tile-path> --agent=claude:claude-sonnet-4-6 --label <run-label> # scored eval
tessl eval view --last --json # check resultsInvoke the optimize-skill-instructions skill. This runs tessl skill review on the tile's skill(s), surfaces scoring dimensions and quick wins, and applies approved changes.
Entry criteria: The tile has at least one SKILL.md.
Exit criteria: Review score is presented, approved quick wins are applied. Move to Step 2.
If the review score is already high (>= 85%) and the user is satisfied, skip to Step 2 without changes.
Invoke the setup-skill-performance skill with scope = "Full pipeline". Skip the scope question — go straight to Phase 1.
Before invoking, decide eval ordering by skill count:
ls skills/*/SKILL.md 2>/dev/null | wc -lWork through all phases of setup-skill-performance (Find Tile → Generate Scenarios → Download & QC → Activation Check → Content Evals → View Results → Next Steps). Key parameters:
claude:claude-sonnet-4-6Decision point after results: If the activation check has been run and reviewed AND the content eval average is ≥ 85% with no regressions, stop and report success. Otherwise, continue to Step 3.
Before invoking optimize-skill-performance, do a quick triage of the results:
Invoke the optimize-skill-performance skill starting from Phase 1 (it will detect the existing results).
Work through the improve cycle:
Iteration rule: Run up to 2 improve iterations. After the second, report results and stop — the user should review before investing more time.
Present a final summary. Activation and content results are reported separately because they measure different things — activation observes natural firing, content forces activation and scores task performance.
Optimization Complete
Tile: <tile-name>
Review score: XX% → YY%
Scenarios: N scenarios
Iterations: X (1 setup + Y improve rounds)
Activation Results (natural activation, no forcing)
Scenarios where a skill fired:
- Scenario A → fired: skills/<name>
- Scenario C → fired: skills/<name>
Scenarios where NO skill fired:
- Scenario B
- Scenario D
Task Eval Results (forced activation)
Scenario A: baseline XX% → with-context YY% (Δ +ZZ)
Scenario B: baseline XX% → with-context YY% (Δ +ZZ)
...
Average: XX% → YY%
Cross-reference (where the two eval types meet)
No-activation but high baseline (no skill needed — routing is fine):
- Scenario B (88% baseline) — agent already handles it
No-activation AND low baseline (real routing gap — skill helps but doesn't fire):
- Scenario D (25% baseline → 90% with-context) — suggested description edit: …
Criteria improved: [list]
Still failing: [list with brief reason]
Eval runs:
Activation: [URL]
Content: [URL]If criteria remain stuck after 2 iterations, note whether the gap is addressable via documentation (suggest specific follow-up) or is inherently hard for the agent (suggest accepting or replacing the scenario).
Stop when:
Note: activation findings (zero-firing skills, scenarios with no activation) drive follow-up actions (description rewrites, scenario edits) but are not a numeric pass/fail gate. The gate is "ran and reviewed", not a coverage percentage — natural activation is scenario-driven, so "X of Y skills fired" is not a useful score.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions