Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
This is a Tessl skill (published as experiments/eval-setup) that automates the full eval setup pipeline: from browsing commits to generating scenarios, running multi-agent evals, and comparing results. It's the companion to eval-improve — this skill creates the eval foundation, that skill iterates on the results.
tessl install experiments/eval-setupCompanion skill: This skill pairs with eval-improve (
tessl install experiments/eval-improve), which takes over after evals are running — analyzing results, diagnosing failures, fixing tile content, and re-verifying. Useeval-setupfirst, theneval-improveto iterate.
Identifies the repo, workspace, and checks for existing scenarios on disk. Offers to merge new scenarios with existing ones or replace them.
Uses a two-stage analysis to find genuinely challenging commits. First, scans the last 50 commits with hard-skip gates (e.g., <4 source files, <50 lines of source code) and prefer signals (new modules, cross-directory changes, 100+ lines). Then, deep-reads shortlisted diffs and scores each on 7 structural complexity signals — new abstractions, cross-cutting scope, wiring/registration, non-obvious control flow, domain-specific logic, interdependent changes, and no single-point solution. Recommends commits scoring 5+/7 and saves the full analysis to evals/commit-analysis.md as an audit trail.
Runs tessl scenario generate with your chosen commits and context patterns. Polls for completion, reviews what was generated, and asks for approval before downloading.
Downloads scenarios to evals/ with merge or replace strategy. Automatically quality-checks downloaded scenarios for common rubric anti-patterns (answer leakage, double-counting criteria, free-point criteria like no_unrelated_changes, trivially easy tasks). Offers to review and edit task.md and criteria.json before running.
Supports multi-agent comparison across:
| Agent | Models |
|---|---|
claude | claude-sonnet-4-6 (default), claude-opus-4-6, claude-sonnet-4-5, claude-opus-4-5, claude-haiku-4-5 |
cursor | auto, composer-1.5 |
Each agent runs baseline (no context) and with-context automatically. You choose the context reference point (infer, HEAD, or a specific commit SHA). The skill asks you which agents to test and explains the cost tradeoffs before running.
Uses tessl eval compare --breakdown for detailed baseline vs. with-context scoring per scenario. For multi-agent runs, shows a side-by-side comparison:
Agent Comparison:
Agent Avg Score Best Scenario Worst Scenario
claude:claude-sonnet-4-6 80% checkout-flow (87%) api-versioning (68%)
cursor:auto 74% error-recovery (85%) webhook-setup (58%)Based on scores, suggests whether to run eval-improve, generate more diverse scenarios, or tighten eval criteria.
The skill asks for your confirmation at every decision point:
eval-improve after seeing resultseval-setup eval-improve
───────────────────────── ─────────────────────────
commits → scenarios → run evals → analyze → diagnose → fix → re-run → verify
↑ │
└─────────── generate new scenarios for next round ─────────────┘| What you need to do | eval-setup | eval-improve |
|---|---|---|
| Pick which commits to use | Guides the decision with filtering | — |
| Choose context patterns | Explains patterns, suggests defaults | — |
| Generate scenarios from diffs | Runs generation, polls, reviews | — |
| Edit scenarios before running | Offers review of task.md and criteria.json | — |
| Choose agents/models | Presents options, explains cost tradeoffs | — |
| Run evals | Runs with configured agents, polls, retries failures | Re-runs after fixes |
| Compare baseline vs. with-context | eval compare --breakdown + multi-agent tables | eval compare --breakdown on every iteration |
| Interpret what scores mean | Observations + recommendations | 4-bucket classification (Working/Gap/Redundant/Regression) |
| Diagnose why a score is low | — | Reads rubric + tile files, finds gaps and contradictions |
| Fix the tile content | — | Proposes minimal edits matching rubric language, lints |
| Verify fixes worked | — | Re-runs, compares before/after, offers another pass |
| Audit scenario quality | — | Reviews task realism, criteria weighting, coverage gaps |
The official Tessl docs at docs.tessl.io/evaluate/evaluating-your-codebase describe the CLI commands and flags. These two skills turn that reference into an opinionated, agent-driven workflow:
--agent flag; eval-setup turns it into a guided experience with comparison tableseval-improve introduces a structured way to classify and act on resultseval-improve scans tile files for conflicting instructions