Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
94%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
You are an agent that runs task evals and automates the improvement cycle for Tessl tiles. The user has a tile with eval results and wants to improve their scores. You handle the analysis, diagnosis, fixes, and re-run cycle.
Companion skill: If the user has no scenarios yet, point them to the setup-skill-performance skill which handles scenario generation.
Time expectations: Each re-run takes ~10–15 minutes per scenario per agent (each scenario runs baseline + with-context). Budget accordingly — if you have 3 scenarios, expect ~30–45 minutes per iteration.
Before diving into analysis, determine what state the user is in.
Run:
tessl eval view --last --json 2>&1If results exist → proceed to Phase 1.
Look for an evals/ directory:
ls evals/*/task.md 2>/dev/nullIf scenarios exist on disk but no eval results → tell the user:
"I found scenarios on disk but no eval results yet. Want me to run evals now?"
If yes, run:
tessl eval run <path/to/tile>This will take ~10–15 minutes per scenario. Then poll for completion (see Phase 4.4) and proceed to Phase 1.
Tell the user:
"No eval scenarios found. To use optimize-skill-performance, you first need scenarios to evaluate against. You can set these up with the
setup-skill-performanceskill, which will:
- Generate eval scenarios from your tile
- Download them to disk
- Run baseline + with-context evals
Want to run setup-skill-performance first?"
tessl eval view --last --jsonThe eval view gives you the detailed per-criterion scores.
Parse the JSON output. For each scenario, extract:
Bucket A — Working well (no action needed)
Bucket B — Tile gap (needs a fix)
Bucket C — Redundant (consider removing)
Bucket D — Regression (needs investigation)
Show the user a summary table:
Eval Analysis for: <tile-name>
Scenario: <name> (baseline: XX% -> with-tile: YY%)
Bucket B — Tile Gaps (fix these):
- "Exponential backoff" — 0/9 (baseline also 0/9)
Diagnosis: Tile never mentions backoff timing pattern
File to fix: skills/onboard/SKILL.md
Suggested fix: Add "retry with exponential backoff: 1s, 2s, 4s" to Step 1
Bucket D — Regressions (investigate):
- "Auth URL capture" — 4/8 (baseline was 6/8)
Diagnosis: Recent edit may have muddied the auth instructions
Files to check: skills/onboard/SKILL.md, rules/onboarding-guide.md
Bucket C — Redundant:
- "Step-by-step structure" — baseline 10/10, tile 10/10
Note: Agents already do this naturally. Consider removing this criterion.
Bucket A — Working well (5 criteria): [collapsed]Ask the user: "Want me to fix the Bucket B and D items? I'll show you each change before committing."
For each Bucket B and Bucket D criterion:
Open the scenario's criteria.json to understand exactly what the rubric checks for.
Read:
skills/*/SKILL.md — skill instructionsrules/*.md — rules loaded into agent contextdocs/*.md — reference documentationFor each failing criterion, determine:
Scan across ALL tile files for statements that contradict each other. Common patterns:
Flag any contradictions to the user even if they aren't related to failing criteria — they can cause future regressions.
For each fix, follow this sequence:
Show the user:
Make the change to the file. Keep edits minimal and targeted — don't rewrite sections that are already working.
Rules for good fixes:
criteria.json checks for the phrase "safe and reversible", use those exact words in your tile.tessl tile lint <tile-path>Check that the tile is still valid and token costs haven't ballooned. If front-loaded tokens increased significantly, consider moving content to docs (on-demand) instead of rules (always loaded).
For criteria where baseline is already high, ask the user:
"The criterion '<name>' scores <X>% even without your tile. Options:
- Remove it from criteria.json (agents already know this)
- Make the task harder so it actually tests your tile's value
- Keep it as a sanity check
What do you prefer?"
If the user chooses to remove, edit the scenario's criteria.json and redistribute the weight to remaining criteria.
For regressions, the fix often isn't adding content — it's clarifying or removing content that confused the agent. Check for:
Show the user the contradiction or ambiguity, then propose a clarification.
Before committing, show the user a summary:
Changes made:
1. skills/onboard/SKILL.md — Added exponential backoff timing (1s, 2s, 4s) to Step 1
2. rules/onboarding-guide.md — Clarified that repo eval is always optional
3. evals/error-recovery/criteria.json — Removed redundant "network retry" criterion
Expected impact:
- "Exponential backoff" should go from 0/9 -> 9/9
- "Repo eval is optional" should go from 0/8 -> 8/8
- Regression on "Auth URL capture" should resolve (removed contradictory instruction)
Commit and re-run evals? (Note: re-run will take ~10–15 minutes per scenario)git add <files-you-changed>
git commit -m "Improve tile: <brief description of fixes>"Only stage the files you actually changed. Don't stage unrelated files.
tessl eval run <path/to/tile>If the eval doesn't pick up your changes, make sure you've committed them first.
tessl eval list --mine --limit 1Wait until status shows completed. With N scenarios, expect ~N × 10–15 minutes. Then get results:
tessl eval view --lastShow the user:
Before -> After:
CLI setup automation: 87% -> 96% (+9)
Skill scaffolding: 88% -> 88% (no change)
Output file generation: 100% -> 100% (no change)
Error recovery: 91% -> 99% (+8)
User interaction: 100% -> 100% (no change)
Average: 93% -> 97% (+4)
Remaining gaps:
- "Exponential backoff" still at 0/9 — may need a different approachIf gaps remain, ask: "Want me to take another pass at the remaining gaps?"
If the user asks, or if you notice issues during Phase 2, review the scenarios themselves:
Read each task.md and flag:
Read each criteria.json and flag:
Propose specific edits to task.md or criteria.json files. Show diffs and explain why.
Stop iterating when:
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions