Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Download each generation run by its ID (not --last when multiple commits were used):
tessl scenario download <run-id-1> -o ./evals/Repeat for each generation run ID. Use --strategy merge when adding to existing scenarios, --strategy replace only if the user explicitly asked to replace.
If downloading to avoid conflicts with existing scenarios, use a subdirectory:
tessl scenario download <run-id-1> -o ./evals/<repo-name>/ls evals/*/task.mdShow the user the downloaded scenario structure:
Downloaded scenarios:
evals/
a1b2c3d-checkout-flow/
task.md
criteria.json
scenario.json
d4e5f6g-webhook-setup/
task.md
criteria.json
scenario.jsonBefore asking the user, read each criteria.json and task.md yourself and flag these common problems:
Rubric anti-patterns to catch:
task.md contain specific values (version numbers, URLs, class names) that are also rubric criteria? If a criterion just checks whether the agent copied a value from the task prompt, it's a free point. Remove the value from the task or remove the criterion.no_unrelated_changes included as a criterion? This scores 1 on nearly every solution and doesn't discriminate. Remove it unless the scenario specifically tests scope discipline on a large codebase.Present your findings:
"I reviewed the downloaded scenarios. Here's what I found:
checkout-flow — Looks good. 7 criteria covering integration, edge cases, and design patterns.
renovate-config — Problem: This is a single-line config change. The rubric has 3 criteria but they all check the same substitution. I recommend removing this scenario and picking a more complex commit.
api-versioning — Minor issue: criterion 'uses_correct_version' checks for version
3.18.0which is already stated in task.md. I'd remove the version from the task or drop this criterion.Want me to fix these issues, remove the weak scenarios, or proceed as-is?"
Then offer the standard review options:
"You can also:
- Review task.md — see what the agent will be asked to do
- Review criteria.json — see what the rubric checks for
- Edit criteria weights — adjust which criteria matter most
- Proceed to eval run — use the scenarios as-is"
If the user wants to review, read and display the relevant files. Apply any edits they request.