Run task evals across multiple Claude models, compare results side-by-side, and identify which skill gaps are model-specific versus universal
96
Quality
97%
Does it follow best practices?
Impact
96%
1.65xAverage score across 3 eval scenarios
Passed
No known issues
Models tested by default: claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-6 (cheapest to most capable)
Eval command: tessl eval run <path/to/tile> --agent=...
Look for a tile.json in the current directory or a parent/sibling directory. Exclude .tessl/ cache directories:
find . -name "tile.json" -not -path "*/node_modules/*" -not -path "*/.tessl/*" 2>/dev/null | head -10If the user provides a path inside a .tessl/tiles/ directory (an installed tile cache), stop and warn them: that path is Tessl's local install cache — running evals from there won't work and changes would be overwritten on the next tessl install. Offer two options: point to the original tile source, or copy the tile out of .tessl/tiles/ to a new location (cp -r .tessl/tiles/<workspace>/<tile> ./<tile>).
If multiple tiles are found outside .tessl/, ask the user which one to evaluate. If none are found, explain that this skill evaluates a packaged tile and suggest tessl tile new or tessl-labs/tessl-skill-eval-scenarios to get started.
ls <tile-dir>/evals/*/task.md 2>/dev/nullIf no scenarios exist, inform the user and provide the quickest path to generate them:
tessl scenario generate <path/to/tile> --count=3
tessl scenario download --last
mv ./evals/ <tile-dir>/evals/Note that scenario generation takes roughly 1–2 minutes per scenario. Also mention tessl-labs/eval-setup for a guided walkthrough.
If scenarios exist, read the task.md from each scenario directory and list them:
Found N scenarios:
- <scenario-slug>: <one-line description from task.md>
- <scenario-slug>: ...tessl whoamiIf not logged in, ask the user to run tessl login before continuing.
Confirm the default model set with the user: claude-haiku-4-5 (fast/cheap), claude-sonnet-4-6 (default), claude-opus-4-6 (most capable). This runs 3 sequential eval jobs. Each scenario takes roughly 10–15 minutes per model, so with N scenarios expect around N×30–45 minutes total. Ask whether to proceed with all three or a subset.
Ask whether to run each scenario once (default, good for a first pass) or three times (recommended before publishing — triples the time but gives more stable averages). Remind the user of the time implications given N scenarios and 3 models.
If they choose more than 1, add --runs=<n> to all eval run commands in Phase 3.
Run one eval per model, in sequence from the tile's directory. Do NOT run them in parallel — capture each run ID before starting the next.
tessl eval run <path/to/tile> --agent=claude:<model> [--runs=<n>]Capture the eval run ID from the output (Eval run started: <id>). Store all run IDs mapped to model names. After each starts, briefly update the user (e.g., "✔ Started haiku: <id>. Starting sonnet…").
After all are started, share the browser monitoring URLs and note that you'll poll for completion:
https://tessl.io/eval-runs/<id>Poll with tessl eval view <id> every few minutes, checking for Status: ✔ Completed or Status: ✖ Failed.
Report status as runs complete (e.g., haiku ✔, sonnet → In Progress, opus → In Progress).
If a run fails, retry immediately:
tessl eval retry <id>Wait until all runs show Completed before proceeding.
Fetch full results for each run:
tessl eval view <id>Parse both the baseline (without skill) and with skill scores for every scenario and criterion.
Model Comparison — <tile-name>
Model Without Skill With Skill Delta
─────────────────────────────────────────────────────────────
claude:claude-haiku-4-5 XX% YY% +ZZpp
claude:claude-sonnet-4-6 XX% YY% +ZZpp
claude:claude-opus-4-6 XX% YY% +ZZppFor each scenario, show its name, a one-line description (from task.md), and both scores per model:
Scenario: <slug>
What it tests: <description>
Model Without Skill With Skill Delta
─────────────────────────────────────────────
haiku XX% YY% +ZZpp
sonnet XX% YY% +ZZpp
opus XX% YY% +ZZppUse symbols: ✅ ≥ 80% · 🟡 ≥ 50% · 🔴 < 50%
Criterion Breakdown — with skill
Criterion haiku sonnet opus
─────────────────────────────────────────────────
checks_prerequisites ✅100% ✅100% ✅100%
browses_commits 🔴 0% 🟡 33% ✅100%
...Assess what baseline (without-skill) scores reveal before discussing the skill's impact:
For criteria that score poorly (with skill):
Give a plain-language summary of the combined baseline + with-skill picture for each model, calling out which scenarios show the largest delta, any regressions, and whether the skill is earning its place across the tested model range.
If regressions exist on any model: Recommend fixing before publishing. Suggest running eval-improve against the affected run IDs:
tessl install tessl-labs/eval-improveThen invoke /eval-improve and share the run ID for the affected model.
If haiku-specific gaps only: Note that the skill works well for sonnet and opus users, and offer to suggest specific wording changes to simplify instructions for haiku.
If all models score well (≥ 80% with skill):
tessl tile publishIf results are variable / runs=1 was used: Recommend re-running with --runs=3 before publishing to average out variance.
Always offer: "Want me to re-run the comparison after any fixes to verify improvement?"
Stop when: