Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins
Overall
score
79%
Does it follow best practices?
Validation for skill structure
This is a Tessl skill (published as experiments/eval-setup) that automates the full eval setup pipeline: from browsing commits to generating scenarios, running multi-agent evals, and comparing results. It's the companion to eval-improve — this skill creates the eval foundation, that skill iterates on the results.
tessl install experiments/eval-setupCompanion skill: This skill pairs with eval-improve (
tessl install experiments/eval-improve), which takes over after evals are running — analyzing results, diagnosing failures, fixing tile content, and re-verifying. Useeval-setupfirst, theneval-improveto iterate.
Identifies the repo, workspace, and checks for existing scenarios on disk. Offers to merge new scenarios with existing ones or replace them.
Browses recent commits with tessl repo select-commits, supports filtering by keyword, author, and date range. Lets you pick which commits to generate scenarios from and asks how many scenarios you want.
Runs tessl scenario generate with your chosen commits and context patterns. Polls for completion, reviews what was generated, and asks for approval before downloading.
Downloads scenarios to evals/ with merge or replace strategy. Offers to review and edit task.md and criteria.json before running — you can adjust criteria weights or task descriptions.
Supports multi-agent comparison across:
| Agent | Models |
|---|---|
claude | claude-sonnet-4-6 (default), claude-haiku-4-5 |
cursor | auto, composer-1.5 |
codex | o3 |
Each agent runs baseline (no context) and with-context automatically. You choose the context reference point (infer, HEAD, or a specific commit SHA). The skill asks you which agents to test and explains the cost tradeoffs before running.
Uses tessl eval compare --breakdown for detailed baseline vs. with-context scoring per scenario. For multi-agent runs, shows a side-by-side comparison:
Agent Comparison:
Agent Avg Score Best Scenario Worst Scenario
claude:claude-sonnet-4-6 80% checkout-flow (87%) api-versioning (68%)
cursor:auto 74% error-recovery (85%) webhook-setup (58%)
codex:o3 71% checkout-flow (82%) webhook-setup (52%)Based on scores, suggests whether to run eval-improve, generate more diverse scenarios, or tighten eval criteria.
The skill asks for your confirmation at every decision point:
eval-improve after seeing resultseval-setup eval-improve
───────────────────────── ─────────────────────────
commits → scenarios → run evals → analyze → diagnose → fix → re-run → verify
↑ │
└─────────── generate new scenarios for next round ─────────────┘| What you need to do | eval-setup | eval-improve |
|---|---|---|
| Pick which commits to use | Guides the decision with filtering | — |
| Choose context patterns | Explains patterns, suggests defaults | — |
| Generate scenarios from diffs | Runs generation, polls, reviews | — |
| Edit scenarios before running | Offers review of task.md and criteria.json | — |
| Choose agents/models | Presents options, explains cost tradeoffs | — |
| Run evals | Runs with configured agents, polls, retries failures | Re-runs after fixes |
| Compare baseline vs. with-context | eval compare --breakdown + multi-agent tables | eval compare --breakdown on every iteration |
| Interpret what scores mean | Observations + recommendations | 4-bucket classification (Working/Gap/Redundant/Regression) |
| Diagnose why a score is low | — | Reads rubric + tile files, finds gaps and contradictions |
| Fix the tile content | — | Proposes minimal edits matching rubric language, lints |
| Verify fixes worked | — | Re-runs, compares before/after, offers another pass |
| Audit scenario quality | — | Reviews task realism, criteria weighting, coverage gaps |
The official Tessl docs at docs.tessl.io/evaluate/evaluating-your-codebase describe the CLI commands and flags. These two skills turn that reference into an opinionated, agent-driven workflow:
--agent flag; eval-setup turns it into a guided experience with comparison tableseval-improve introduces a structured way to classify and act on resultseval-improve scans tile files for conflicting instructionsInstall with Tessl CLI
npx tessl i experiments/eval-setup