Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
Eval scenarios start in an empty working directory unless they declare environment preparation. Two kinds exist:
fixtures in scenario.json.setup: ["./setup.sh"] in scenario.json, or auto-detected from a setup.sh placed next to the scenario.This reference covers when each is needed, how to source/generate them silently when possible, and the skip rules that protect user intent.
A scenario.json with both kinds of preparation looks like:
{
"fixtures": {
"codebase": {
"type": "commit",
"repoUrl": "https://github.com/acme/example.git",
"ref": "main"
}
},
"setup": ["./setup.sh"]
}Fixture types:
commit — { type: "commit", repoUrl, ref, installPath?, exclude? } — git snapshot at a ref.directory — { type: "directory", path, installPath } — local directory copied in.For each scenario, read its task.md, criteria.json, and the parent tile's SKILL.md / docs/ to classify needs.
Fire if any of these apply:
SKILL.md talks about modifying or refactoring existing code, or uses phrases like "this codebase", "your repo", or names file paths the user is expected to edit.task.md references files that must already exist — e.g. "fix the bug in src/foo.ts", "update the migration in db/".criteria.json checks for edits to existing files rather than from-scratch creation.Fire if any of these apply:
npm install, pip install, brew install, cargo build).Do not generate or overwrite when:
scenario.json already declares a fixtures record — leave it alone.scenario.json already declares setup, or a setup.sh already exists next to the scenario — leave it alone.Skip rules apply per-scenario. A tile with five scenarios may need fixture generation on three of them and nothing on the other two.
When the fixture signal fires for a scenario:
Scan the tile silently. Look in SKILL.md, the tile's docs/, and any README.md for:
https://github.com/… or a git@github.com:… reference) together with an obvious ref (branch name, tag, or commit), ORexamples/, fixtures/).If a plausible source is found, write it into scenario.json under fixtures.<name> using the schema above. Name the fixture descriptively — codebase for the primary repo snapshot is conventional; use examples or similar for directory fixtures.
If nothing plausible is found, ask the user once:
"Scenario
<slug>looks like it needs an existing codebase to operate on, but I couldn't find a repo URL or sample path in the tile. Want to provide one (<repo-url>#<ref>or a local directory path), or skip fixture generation for this scenario?"
If the user declines or can't provide a source, skip fixture generation for that scenario, record it in the pre-run summary, and continue. Do not stop the phase — the user has explicitly opted into a degraded eval.
Be silent when signals are clear. Only ask when sourcing genuinely can't be inferred.
When the setup-script signal fires for a scenario:
Generate setup.sh next to the scenario (in the same directory as scenario.json / task.md). The eval system auto-detects this file — there's no need to declare it in scenario.json unless you want to be explicit.
Script content: the minimum init commands implied by the signal. Include a shebang on the first line — the scenario lint rule warns without it.
#!/bin/bash
set -euo pipefail
npm installchmod +x setup.sh after writing so the eval runner can execute it.
If the signal isn't specific enough to write a concrete script (e.g. you can see the tile expects a database but not which migrations to run), ask the user to confirm the commands before writing:
"Scenario
<slug>looks like it needs setup before the agent runs. My best guess is:#!/bin/bash set -euo pipefail npm installUse this, or do you want to provide different commands?"
After processing all scenarios — and before kicking off tessl eval run — show the user one concise summary so they can catch wrong inferences early:
Generated 2 fixtures and 1 setup script across 4 scenarios.
- checkout-flow: fixture (commit: acme/example#main), setup.sh (npm install)
- webhook-setup: fixture (commit: acme/example#main)
- custom-scenario: fixture (user-declared), setup.sh (user-declared)
- empty-state: no preparation neededFor scenarios where skip rules fired (a fixtures record, setup array, or setup.sh was already present), list each as user-declared rather than no preparation needed, so the user can confirm their declarations were respected and catch any mismatches.
If any scenario was skipped because the user declined to provide a source, list it here so the user knows the eval will run that scenario in an empty workdir.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions