experiments/eval-setup

Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins

Quality

97%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

eval-setup — Codebase Eval Generation & Execution Pipeline

Name: experiments/eval-setup
Rating: 77.60000000000001 (1 reviews)
Author: experiments

This is a Tessl skill (published as experiments/eval-setup) that automates the full eval setup pipeline: from browsing commits to generating scenarios, running multi-agent evals, and comparing results. It's the companion to eval-improve — this skill creates the eval foundation, that skill iterates on the results.

Install

tessl install experiments/eval-setup

Companion skill: This skill pairs with eval-improve (tessl install experiments/eval-improve), which takes over after evals are running — analyzing results, diagnosing failures, fixing tile content, and re-verifying. Use eval-setup first, then eval-improve to iterate.

What it does

Phase 1 — Gather Context

Identifies the repo, workspace, and checks for existing scenarios on disk. Offers to merge new scenarios with existing ones or replace them.

Phase 2 — Select Commits

Browses recent commits with tessl repo select-commits, supports filtering by keyword, author, and date range. Lets you pick which commits to generate scenarios from and asks how many scenarios you want.

Phase 3 — Generate Scenarios

Runs tessl scenario generate with your chosen commits and context patterns. Polls for completion, reviews what was generated, and asks for approval before downloading.

Phase 4 — Download Scenarios

Downloads scenarios to evals/ with merge or replace strategy. Offers to review and edit task.md and criteria.json before running — you can adjust criteria weights or task descriptions.

Phase 5 — Configure and Run Evals (multi-agent)

Supports multi-agent comparison across:

Agent	Models
`claude`	`claude-sonnet-4-6` (default), `claude-haiku-4-5`
`cursor`	`auto`, `composer-1.5`
`codex`	`o3`

Each agent runs baseline (no context) and with-context automatically. You choose the context reference point (infer, HEAD, or a specific commit SHA). The skill asks you which agents to test and explains the cost tradeoffs before running.

Phase 6 — View and Compare Results

Uses tessl eval compare --breakdown for detailed baseline vs. with-context scoring per scenario. For multi-agent runs, shows a side-by-side comparison:

Agent Comparison:

  Agent                     Avg Score   Best Scenario          Worst Scenario
  claude:claude-sonnet-4-6   80%       checkout-flow (87%)    api-versioning (68%)
  cursor:auto                74%       error-recovery (85%)   webhook-setup (58%)
  codex:o3                   71%       checkout-flow (82%)    webhook-setup (52%)

Phase 7 — Recommend Next Steps

Based on scores, suggests whether to run eval-improve, generate more diverse scenarios, or tighten eval criteria.

Human in the loop

The skill asks for your confirmation at every decision point:

Which commits to use for scenario generation
How many scenarios to generate
Whether to review/edit scenarios before running
Which agents and models to test
Whether to proceed to eval-improve after seeing results

How the two skills work together

eval-setup                           eval-improve
─────────────────────────           ─────────────────────────
commits → scenarios → run evals  →  analyze → diagnose → fix → re-run → verify
         ↑                                                               │
         └─────────── generate new scenarios for next round ─────────────┘

What each skill covers

What you need to do	eval-setup	eval-improve
Pick which commits to use	Guides the decision with filtering	—
Choose context patterns	Explains patterns, suggests defaults	—
Generate scenarios from diffs	Runs generation, polls, reviews	—
Edit scenarios before running	Offers review of task.md and criteria.json	—
Choose agents/models	Presents options, explains cost tradeoffs	—
Run evals	Runs with configured agents, polls, retries failures	Re-runs after fixes
Compare baseline vs. with-context	`eval compare --breakdown` + multi-agent tables	`eval compare --breakdown` on every iteration
Interpret what scores mean	Observations + recommendations	4-bucket classification (Working/Gap/Redundant/Regression)
Diagnose why a score is low	—	Reads rubric + tile files, finds gaps and contradictions
Fix the tile content	—	Proposes minimal edits matching rubric language, lints
Verify fixes worked	—	Re-runs, compares before/after, offers another pass
Audit scenario quality	—	Reviews task realism, criteria weighting, coverage gaps

Expanding on the docs

The official Tessl docs at docs.tessl.io/evaluate/evaluating-your-codebase describe the CLI commands and flags. These two skills turn that reference into an opinionated, agent-driven workflow:

Decision guidance at every step — the docs tell you what each command does; the skills tell the agent when to use it and what to ask you first
Multi-agent comparison workflow — the docs show the --agent flag; eval-setup turns it into a guided experience with comparison tables
The 4-bucket framework — not in the docs; eval-improve introduces a structured way to classify and act on results
Cross-file contradiction detection — not in the docs; eval-improve scans tile files for conflicting instructions
Iterative improvement loop — the docs describe a one-shot pipeline; together the skills create a repeatable cycle

Workspace: experiments
Visibility: Public
Created: 4 months ago
Last updated: 4 months ago
Publish Source: CLI
Badge