Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
94%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
This skill orchestrates optimize-skill-instructions, setup-skill-performance, and optimize-skill-performance into a single end-to-end optimization cycle. Rather than duplicating their instructions, it sequences them and handles the handoff.
The full cycle takes 1–2 hours depending on how many scenarios and improvement iterations are needed. Set this expectation with the user upfront.
Review SKILL.md → Apply quick wins → Generate scenarios → Run evals → Analyze → Fix → Re-run → Report
└── optimize-skill-instructions ──┘ └── setup-skill-performance ──┘ └────────── optimize-skill-performance ──────────────┘Invoke the optimize-skill-instructions skill. This runs tessl skill review on the tile's skill(s), surfaces scoring dimensions and quick wins, and applies approved changes.
Entry criteria: The tile has at least one SKILL.md.
Exit criteria: Review score is presented, approved quick wins are applied. Move to Step 2.
If the review score is already high (>= 85%) and the user is satisfied, skip to Step 2 without changes.
Invoke the setup-skill-performance skill with scope = "Full pipeline". Skip the scope question — go straight to Phase 1.
Work through all phases of setup-skill-performance (Find Tile → Generate Scenarios → Download & QC → Run Evals → View Results → Next Steps). Key parameters:
claude:claude-sonnet-4-6Decision point after results: If the average eval score is already ≥ 85% with no regressions, stop and report success. Otherwise, continue to Step 3.
Before invoking optimize-skill-performance, do a quick triage of the results:
Invoke the optimize-skill-performance skill starting from Phase 1 (it will detect the existing results).
Work through the improve cycle:
Iteration rule: Run up to 2 improve iterations. After the second, report results and stop — diminishing returns set in quickly, and the user should review before investing more time.
Present a final summary:
Optimization Complete
Tile: <tile-name>
Review score: XX% → YY%
Scenarios: N scenarios
Iterations: X (1 setup + Y improve rounds)
Eval before (baseline): XX%
Eval after (with tile): YY% (Δ +ZZpp)
Criteria improved: [list]
Still failing: [list with brief reason]
Eval run: [URL to latest run]If criteria remain stuck after 2 iterations, note whether the gap is addressable via documentation (suggest specific follow-up) or is inherently hard for the agent (suggest accepting or replacing the scenario).
Stop when:
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions