Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.
53
58%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./.mux/skills/eval-guide/SKILL.mdQuality
Discovery
40%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description identifies a clear niche (statistically meaningful agent-tty evals) which makes it distinctive, but it reads more like a table of contents than an actionable skill description. It lacks a 'Use when...' clause, which is critical for Claude to know when to select this skill, and the listed concepts are topics rather than concrete actions Claude will perform.
Suggestions
Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to run agent-tty evaluations, compare agent performance with A/B tests, or determine statistical significance of eval results.'
Replace 'Guide for running' with concrete action verbs describing what the skill does, e.g., 'Configures and executes statistically meaningful agent-tty evals, sets up parallel trial runs, performs A/B comparisons, and interprets statistical results.'
Include natural trigger terms users might say, such as 'benchmarks', 'eval results', 'statistical significance', 'compare runs', or 'performance testing'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description names the domain (agent-tty evals) and mentions several concepts (trials, parallelism, A/B comparison, non-determinism baseline, sample sizes, result interpretation), but these read more as topics covered than concrete actions the skill performs. It says 'Guide for running' which is somewhat vague about what actions Claude actually takes. | 2 / 3 |
Completeness | The description covers 'what' (a guide for running evals with various statistical features) but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' itself is more of a topic list than clear capability description, warranting a 1. | 1 / 3 |
Trigger Term Quality | Includes some relevant keywords like 'evals', 'A/B comparison', 'sample sizes', and 'trials', but 'agent-tty' is very specific jargon that most users might not use. Missing common variations like 'benchmarks', 'testing', 'evaluation runs', or 'statistical significance'. | 2 / 3 |
Distinctiveness Conflict Risk | The description targets a very specific niche — 'agent-tty evals' with statistical methodology — which is unlikely to conflict with other skills. The combination of agent-tty, A/B comparison, and statistical eval concepts creates a clear and distinct identity. | 3 / 3 |
Total | 8 / 12 Passed |
Implementation
77%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, highly actionable eval guide with excellent concrete commands and clear workflows for A/B comparison. Its main weakness is moderate verbosity in the empirical findings section and the monolithic structure—some sections (authoring, reporter lifecycle, workspace presets) could be split into referenced files. The statistical interpretation guidance and specific threshold values are particularly valuable.
Suggestions
Condense Section 1 (non-determinism findings) to 3-4 concise bullet points—the detailed percentages and cross-provider analysis could be moved to a separate FINDINGS.md reference.
Consider splitting Sections 7-10 (authoring, reporter, presets, snapshots) into a separate ADVANCED.md or REFERENCE.md file, keeping SKILL.md focused on the core eval workflow.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is mostly efficient and information-dense, but includes some unnecessary elaboration—e.g., Section 1's detailed retelling of empirical findings (flip rates, cross-provider checks) could be condensed to a few bullet points. Some sections like the concurrency safety rationale repeat points. However, it largely avoids explaining things Claude already knows. | 2 / 3 |
Actionability | Excellent actionability throughout. Every section provides concrete, copy-paste-ready bash commands with specific flags, trial counts, and concurrency settings. The A/B comparison workflow in Section 3 is fully executable with real CLI invocations and jq piping. Recommended parameter values are specific (e.g., --trials 5, --concurrency 4, 0.05 score delta cutoffs). | 3 / 3 |
Workflow Clarity | The A/B comparison workflow in Section 3 is clearly sequenced (run baseline → save path → make change → run candidate with --compare-baseline → read verdicts). Section 4 provides explicit interpretation rules including what 'inconclusive' means and when to add more trials. The guide covers the full lifecycle from running evals to interpreting results with clear decision criteria and feedback loops (add more trials if noise-dominated). | 3 / 3 |
Progressive Disclosure | The content is well-structured with numbered sections and clear headings, making navigation easy. However, it's a fairly long monolithic document (~170 lines of substantive content) with no references to external files for deeper topics like authoring APIs, reporter details, or workspace presets—these could benefit from being split out. For a standalone skill with no bundle files, the organization is decent but the length pushes against ideal progressive disclosure. | 2 / 3 |
Total | 10 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
a05d4e5
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.