Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.
53
58%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./.mux/skills/eval-guide/SKILL.mdUse this guide when you are trying to answer "did this skill or prompt change actually help?" for agent-tty evals.
The short version: do not trust a single run. This eval stack now supports multi-trial sampling, parallel execution, trial aggregation, and paired baseline comparison because the underlying model behavior is noisy enough that one pass/fail result is not decision-grade.
Always set --trials for real prompt or skill experiments.
Recommended trial counts:
--trials 5 to --trials 10--trials 3--trials 2 to --trials 3Use concurrency to keep those sample sizes affordable:
--concurrency 4 for real-provider runs.--concurrency 1 only when you explicitly want fully serial behavior.When --trials is greater than 1, reports automatically include Trial Aggregation in report.md and report.json, including per-case:
Use a paired baseline comparison whenever you want to know whether a change helped. The comparison report uses paired bootstrap confidence intervals and paired win/loss/tie counts, so it is much more reliable than eyeballing two single runs.
report.json path.--compare-baseline <baseline-report-path>.improved, regressed, or inconclusive.Practical reading rules:
0 and the effect is practically large enough to matter.0.05 score delta and, for overall pass rate, 0.05 absolute pass-rate delta.A reliable prompt-lane A/B loop looks like this:
BASELINE_JSON=$(npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane prompt \
--condition self-load \
--trials 5 \
--concurrency 4 \
--output evals/reports/prompt-baseline \
--json | jq -r '.jsonReportPath')
# edit the skill or prompt
npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane prompt \
--condition self-load \
--trials 5 \
--concurrency 4 \
--output evals/reports/prompt-candidate \
--compare-baseline "$BASELINE_JSON" \
--jsoninconclusive is the default, healthy outcome when nothing meaningful changed. In our same-skill sanity check, 23 of 24 cases were inconclusive and the paired win/loss/tie total was 14W / 15L / 43T.--trials plus --compare-baseline.--concurrency 1 remains the default and preserves serial behavior.--concurrency 4-20 is a reasonable operating range when provider limits and budget allow.1.finally blocks, so parallel runs do not share session state.Use serial mode only when debugging; use parallel mode when sampling.
npx tsx evals/run.ts \
--provider claude \
--model claude-opus-4-6 \
--lane prompt \
--condition self-load \
--trials 5 \
--concurrency 4 \
--output evals/reports/prompt-self-loadnpx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane execution \
--case hello-prompt \
--case resize-demo \
--trials 3 \
--concurrency 4 \
--output evals/reports/execution-smokenpx tsx evals/run.ts \
--provider claude \
--model claude-opus-4-6 \
--lane dogfood \
--case exploratory-qa \
--case evidence-completeness \
--trials 2 \
--concurrency 4 \
--output evals/reports/dogfood-sampleBASELINE_JSON=$(npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane all \
--condition all \
--trials 3 \
--concurrency 4 \
--output evals/reports/baseline \
--json | jq -r '.jsonReportPath')
npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane all \
--condition all \
--trials 3 \
--concurrency 4 \
--output evals/reports/candidate \
--compare-baseline "$BASELINE_JSON" \
--jsonnpx tsx evals/run.ts --provider stub --lane prompt --trials 3 --concurrency 4Use stub to validate wiring, not to judge whether a real-provider prompt or skill change helped.
Checklist:
evals/authoring/* (promptCase(), executionCase(), dogfoodCase()) over hand-assembled schema objects.rawWorkflowCheck(), rawVerifier(), rawArtifactRequirement(), and rawReportRequirement().npm run test.Need lifecycle events plus local progress and a machine-readable trace?
npx tsx evals/run.ts \
--provider stub \
--lane execution \
--reporter jsonl \
--reporter-output evals/reports/execution-events.jsonl \
--progressWhen you omit --reporter, the default final reporter still writes report.json and report.md.
Add .workspace('agent-tty-smoke') to an executionCase() or dogfoodCase() when the case needs preset bootstrap/env/template setup. Register custom presets with registerPreset() in a module that loads before runEvalCli().
Use --snapshot-update first, then --snapshot-check --snapshot-threshold 20 against the same --snapshot-dir when you want regression signals over time. Snapshot regressions are warnings only.
a05d4e5
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.