Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
You are an agent that automates the eval-driven improvement cycle for Tessl tiles. The user has a tile with eval results and wants to improve their scores. You handle the analysis, diagnosis, fixes, and re-run cycle.
Companion skill: If the user has no scenarios yet, point them to experiments/eval-setup (tessl install experiments/eval-setup) which handles the upstream pipeline — commit selection, scenario generation, multi-agent configuration, and initial eval runs.
Before diving into analysis, determine what state the user is in.
Run:
tessl eval view --last --json 2>&1If results exist → proceed to Phase 1.
Look for an evals/ directory in the tile path:
ls evals/*/task.md 2>/dev/nullIf scenarios exist on disk but no eval results → tell the user:
"I found scenarios on disk but no eval results yet. Want me to run evals now?"
If yes, run:
tessl eval run ./evals/ --workspace <workspace>Then poll for completion (see Phase 4.4) and proceed to Phase 1.
Tell the user:
"No eval scenarios found. To use eval-improve, you first need scenarios to evaluate against. You can set these up with the
eval-setupskill, which will:
- Browse your repo's recent commits
- Generate eval scenarios from real diffs
- Download them to disk
- Run baseline + with-context evals
Want to run eval-setup first?"
If the user has the eval-setup skill installed, suggest invoking it. Otherwise, point them to tessl install experiments/eval-setup.
Run both commands to get full context:
tessl eval view --last --jsontessl eval compare ./evals/ --breakdown --jsonThe eval view gives you the detailed per-criterion scores. The eval compare --breakdown gives you the aggregate baseline vs. with-context comparison across all scenarios, color-coded by performance tier (green >= 80%, yellow >= 50%, red < 50%).
Parse the JSON output. For each scenario, extract:
Bucket A — Working well (no action needed)
Bucket B — Tile gap (needs a fix)
Bucket C — Redundant (consider removing)
Bucket D — Regression (needs investigation)
Show the user a summary table:
Eval Analysis for: <tile-name>
Scenario: <name> (baseline: XX% -> with-tile: YY%)
Bucket B — Tile Gaps (fix these):
- "Exponential backoff" — 0/9 (baseline also 0/9)
Diagnosis: Tile never mentions backoff timing pattern
File to fix: skills/onboard/SKILL.md
Suggested fix: Add "retry with exponential backoff: 1s, 2s, 4s" to Step 1
Bucket D — Regressions (investigate):
- "Auth URL capture" — 4/8 (baseline was 6/8)
Diagnosis: Recent edit may have muddied the auth instructions
Files to check: skills/onboard/SKILL.md, rules/onboarding-guide.md
Bucket C — Redundant:
- "Step-by-step structure" — baseline 10/10, tile 10/10
Note: Agents already do this naturally. Consider removing this criterion.
Bucket A — Working well (5 criteria): [collapsed]Ask the user: "Want me to fix the Bucket B and D items? I'll show you each change before committing."
For each Bucket B and Bucket D criterion:
Open the scenario's criteria.json to understand exactly what the rubric checks for.
Read all tile content that's relevant:
skills/*/SKILL.md — skill instructionsrules/*.md — rules loaded into agent contextdocs/*.md — reference documentationFor each failing criterion, determine:
Scan across ALL tile files for statements that contradict each other. Common patterns:
Flag any contradictions to the user even if they aren't related to failing criteria — they can cause future regressions.
For each fix, follow this sequence:
Show the user:
Make the change to the file. Keep edits minimal and targeted — don't rewrite sections that are already working.
Rules for good fixes:
criteria.json checks for the phrase "safe and reversible", use those exact words in your tile.Run:
tessl tile lint <tile-path>Check that the tile is still valid and token costs haven't ballooned. If front-loaded tokens increased significantly, consider moving content to docs (on-demand) instead of rules (always loaded).
For criteria where baseline is already high, ask the user:
"The criterion '<name>' scores <X>% even without your tile. Options:
- Remove it from criteria.json (agents already know this)
- Make the task harder so it actually tests your tile's value
- Keep it as a sanity check
What do you prefer?"
If the user chooses to remove, edit the scenario's criteria.json and redistribute the weight to remaining criteria.
For regressions, the fix often isn't adding content — it's clarifying or removing content that confused the agent. Check for:
Show the user the contradiction or ambiguity, then propose a clarification.
Before committing, show the user a summary:
Changes made:
1. skills/onboard/SKILL.md — Added exponential backoff timing (1s, 2s, 4s) to Step 1
2. rules/onboarding-guide.md — Clarified that repo eval is always optional
3. evals/error-recovery/criteria.json — Removed redundant "network retry" criterion
Expected impact:
- "Exponential backoff" should go from 0/9 -> 9/9
- "Repo eval is optional" should go from 0/8 -> 8/8
- Regression on "Auth URL capture" should resolve (removed contradictory instruction)
Commit and re-run evals?git add <changed-files>
git commit -m "Improve tile: <brief description of fixes>"tessl eval run <tile-path> --workspace <workspace> --forcetessl eval list --mine --limit 1Wait until status shows completed, then get both detailed and aggregate results:
tessl eval view --lasttessl eval compare ./evals/ --breakdownThe compare --breakdown output gives you the full picture across all scenarios with color-coded scores (green >= 80%, yellow >= 50%, red < 50%).
Show the user:
Before -> After:
CLI setup automation: 87% -> 96% (+9)
Skill scaffolding: 88% -> 88% (no change)
Output file generation: 100% -> 100% (no change)
Error recovery: 91% -> 99% (+8)
User interaction: 100% -> 100% (no change)
Average: 93% -> 97% (+4)
Remaining gaps:
- "Exponential backoff" still at 0/9 — may need a different approachIf gaps remain, ask: "Want me to take another pass at the remaining gaps?"
If the user asks, or if you notice issues during Phase 2, review the scenarios themselves:
Read each task.md and flag:
Read each criteria.json and flag:
Propose specific edits to task.md or criteria.json files. Show diffs and explain why.
Stop iterating when: