Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
86
91%
Does it follow best practices?
Impact
86%
1.22xAverage score across 29 eval scenarios
Advisory
Suggest reviewing before use
Models tested by default: claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6 (cheapest to most capable)
Eval command: tessl eval run <path/to/tile> --agent=... --label <run-label>
Every tessl eval run invocation MUST include --label <run-label> so the run is identifiable in tessl eval list. The label is a short, human-readable description of what the run is about — not a structured ID.
Compose <run-label> from whatever helps you recognise the run later when scanning the list. Typical ingredients:
activation, baseline, initial evals, verificationdescription rewrite, plan-solution fixes, clean scenario(haiku-4-5), (sonnet-4-6)v0.5.0, v4Examples:
repro-clean-scenariotask-prep v0.3.0 baselinetask-prep v0.5.0 plan-solution fixesv4-final-verificationskill-insights activation (haiku-4-5)skill-insights initial evals (haiku-4-5)Keep it concise — what the run was about should be obvious without opening it.
Look for a tile.json in the current directory or a parent/sibling directory. Exclude .tessl/ cache directories:
find . -name "tile.json" -not -path "*/node_modules/*" -not -path "*/.tessl/*" 2>/dev/null | head -10If the user provides a path inside a .tessl/tiles/ directory (an installed tile cache), stop and warn them: that path is Tessl's local install cache — running evals from there won't work and changes would be overwritten on the next tessl install. Offer two options: point to the original tile source, or copy the tile out of .tessl/tiles/ to a new location (cp -r .tessl/tiles/<workspace>/<tile> ./<tile>).
If multiple tiles are found outside .tessl/, ask the user which one to evaluate. If none are found, explain that this skill evaluates a packaged tile and suggest tessl tile new to get started.
ls <tile-dir>/evals/*/task.md 2>/dev/nullIf no scenarios exist, inform the user and provide the quickest path to generate them:
tessl scenario generate <path/to/tile> --count=3
tessl scenario download --last
mv ./evals/ <tile-dir>/evals/Note that scenario generation takes roughly 1–2 minutes per scenario. Also mention the setup-skill-performance skill for a guided walkthrough.
If scenarios exist, read the task.md from each scenario directory and list them:
Found N scenarios:
- <scenario-slug>: <one-line description from task.md>
- <scenario-slug>: ...tessl whoamiIf not logged in, ask the user to run tessl login before continuing.
Confirm the default model set with the user: claude-haiku-4-5 (fast/cheap), claude-sonnet-4-6 (default), claude-opus-4-6 (most capable). This runs 3 sequential eval jobs. Each scenario takes roughly 10–15 minutes per model, so with N scenarios expect around N×30–45 minutes total. Ask whether to proceed with all three or a subset.
Ask whether to run each scenario once (default, good for a first pass) or three times (recommended before publishing — triples the time but gives more stable averages). Remind the user of the time implications given N scenarios and 3 models.
If they choose more than 1, add --runs=<n> to all eval run commands in Phase 3.
Run models one at a time, fully completing each before starting the next. Do NOT start multiple eval runs back-to-back — even without bash & or background jobs, kicking off all three then polling them causes them to execute concurrently on the server, which inflates cost and creates noisy interactions between runs.
For each model in your chosen set, in order:
tessl eval run <path/to/tile> --agent=claude:<model> [--runs=<n>] --label <run-label>Eval run started: <id>). Store it mapped to the model name for Phase 5.https://tessl.io/eval-runs/<id>).Update the user as each model finishes (e.g., "✔ haiku complete (<id>). Starting sonnet…").
For the current model's run, poll with tessl eval view <id> every few minutes until you see Status: ✔ Completed or Status: ✖ Failed. Only then loop back to Phase 3 to start the next model.
If a run fails, retry it:
tessl eval retry <id>Once every model in the chosen set has reached Completed, proceed to Phase 5.
Fetch full results for each run:
tessl eval view <id> --jsonParse both the baseline (without skill) and with skill scores for every scenario and criterion.
Model Comparison — <tile-name>
Model Without Skill With Skill Delta
─────────────────────────────────────────────────────────────
claude:claude-haiku-4-5 XX% YY% +ZZpp
claude:claude-sonnet-4-6 XX% YY% +ZZpp
claude:claude-opus-4-6 XX% YY% +ZZppFor each scenario, show its name, a one-line description (from task.md), and both scores per model:
Scenario: <slug>
What it tests: <description>
Model Without Skill With Skill Delta
─────────────────────────────────────────────
haiku XX% YY% +ZZpp
sonnet XX% YY% +ZZpp
opus XX% YY% +ZZppUse symbols: ✅ ≥ 80% · 🟡 ≥ 50% · 🔴 < 50%
Criterion Breakdown — with skill
Criterion haiku sonnet opus
─────────────────────────────────────────────────
checks_prerequisites ✅100% ✅100% ✅100%
browses_commits 🔴 0% 🟡 33% ✅100%
...Before discussing the skill's impact, note what the without-skill scores reveal: high baselines (≥80%) mean the skill adds little; low baselines (<50%) are where the skill earns its place; baselines that diverge across models indicate which tier benefits most from the skill.
For criteria that score poorly (with skill):
Give a plain-language summary of the combined baseline + with-skill picture for each model, calling out which scenarios show the largest delta, any regressions, and whether the skill is earning its place across the tested model range.
If regressions exist on any model: Recommend fixing before publishing. Suggest running the optimize-skill-performance skill against the affected run IDs.
If haiku-specific gaps only: Note that the skill works well for sonnet and opus users, and offer to suggest specific wording changes to simplify instructions for haiku.
If all models score well (≥ 80% with skill):
tessl tile publish <path/to/tile>If results are variable / runs=1 was used: Recommend re-running with --runs=3 before publishing to average out variance.
Always offer: "Want me to re-run the comparison after any fixes to verify improvement?"
Stop when:
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions