Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
You handle tile eval setup — scenario generation from a tile, running evals, and presenting results.
The user triggers this skill when they have a tile but no eval scenarios yet, or when they want to generate new scenarios.
Companion skill: After setup is complete, suggest the user run the optimize-skill-performance skill to analyze results, diagnose failures, fix tile content, and re-verify improvements.
Time expectations: Set these upfront so the user isn't surprised:
Every tessl eval run invocation MUST include --label <run-label> so the run is identifiable in tessl eval list. The label is a short, human-readable description of what the run is about — not a structured ID.
Compose <run-label> from whatever helps you recognise the run later when scanning the list. Typical ingredients:
activation, baseline, initial evals, verificationdescription rewrite, plan-solution fixes, clean scenario(haiku-4-5), (sonnet-4-6)v0.5.0, v4Examples:
repro-clean-scenariotask-prep v0.3.0 baselinetask-prep v0.5.0 plan-solution fixesv4-final-verificationskill-insights activation (haiku-4-5)skill-insights initial evals (haiku-4-5)Keep it concise — what the run was about should be obvious without opening it.
Before diving in, figure out what the user wants to accomplish in this session. If the user's request already makes the scope clear (e.g., "run my evals", "generate scenarios"), skip the question and go straight to the relevant phase.
Otherwise, ask:
"What would you like to do?
- Full pipeline — generate scenarios, run evals, and see results (start-to-finish, ~1 hour)
- Generate scenarios only — generate and download scenarios, but don't run evals yet
- Run evals on existing scenarios — skip generation, just run and compare results on scenarios already in
evals/- Something else — tell me what you need"
Map the user's choice to phases:
| Choice | Phases to run |
|---|---|
| Full pipeline | 1 → 2 → 3 → 4a → 4b → 5 → 6 |
| Generate scenarios only | 1 → 2 → 3 |
| Run evals on existing scenarios | 1 → 4a → 4b → 5 → 6 |
For partial runs, skip phases not in scope — don't load their reference files.
Locate the tile and check for existing scenarios.
Read references/phase1-gather-context.md for the full procedure.
Run tessl scenario generate against the tile and review what was generated.
Read references/phase2-generate-scenarios.md for the full procedure.
Download scenarios to evals/, verify the structure, quality-check for rubric anti-patterns (answer leakage, double-counting, free points), and detect when scenarios need fixtures or setup scripts before proceeding.
Read references/phase3-download-scenarios.md for the full procedure. The fixture / setup-script detection step (§3.4) loads references/phase3-fixtures-and-setup.md — that file owns the signals, skip rules, sourcing procedure, and pre-run summary format.
Run the activation eval (--solver=activation) to observe which skill self-activates per scenario. Activation does NOT force a skill to fire — it tests routing/description quality. Applies to all tiles, single-skill and multi-skill alike: a single-skill tile with a bad description still won't fire, so this check is required regardless of skill count.
Ordering:
Read references/phase4-run-evals.md §Phase 4a for the full procedure.
Choose agents/models, run tessl eval run (default solver — forces activation), and poll for completion.
Read references/phase4-run-evals.md §Phase 4b for the full procedure.
Show baseline vs. with-context scores and per-scenario breakdown.
Read references/phase5-view-results.md for the full procedure.
Summarize the setup, suggest next actions based on scores, and offer to continue.
Read references/phase6-next-steps.md for the full procedure.
Stop when:
optimize-skill-performance or stopevals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions