Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins
Overall
score
90%
Does it follow best practices?
Validation for skill structure
{
"context": "Testing whether an agent following the eval-setup skill correctly guides a new user through the full eval setup pipeline, including prerequisites, commit browsing, context file detection, scenario generation, and running the eval.",
"type": "weighted_checklist",
"checklist": [
{
"name": "checks_prerequisites",
"description": "The agent verifies the user is logged in (e.g., runs tessl whoami) before proceeding with setup steps.",
"max_score": 1
},
{
"name": "browses_commits",
"description": "The agent runs `tessl repo select-commits acme/backend` to show actual commits from the repo, rather than asking the user to supply commit hashes directly without any browsing step.",
"max_score": 3
},
{
"name": "auto_detects_context_files",
"description": "The agent searches the repository for context files (CLAUDE.md, *.mdc, AGENTS.md, tessl.json, etc.) automatically — rather than asking the user to specify them without any investigation.",
"max_score": 2
},
{
"name": "uses_context_flag",
"description": "The agent includes a `--context` flag when running `tessl scenario generate`, specifying appropriate glob patterns for the detected context files.",
"max_score": 2
},
{
"name": "workspace_in_eval_run",
"description": "The agent includes `--workspace=<name>` when running `tessl eval run`. Omitting --workspace would cause the command to fail.",
"max_score": 2
},
{
"name": "explains_baseline_vs_context",
"description": "The agent explains that each scenario runs twice — once without context files (baseline) and once with them injected — and that the delta shows whether CLAUDE.md is helping the agent.",
"max_score": 2
}
]
}