CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-guide

Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.

53

Quality

58%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./.mux/skills/eval-guide/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

40%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description identifies a clear niche (statistically meaningful agent-tty evals) which makes it distinctive, but it reads more like a table of contents than an actionable skill description. It lacks a 'Use when...' clause, which is critical for Claude to know when to select this skill, and the listed concepts are topics rather than concrete actions Claude will perform.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to run agent-tty evaluations, compare agent performance with A/B tests, or determine statistical significance of eval results.'

Replace 'Guide for running' with concrete action verbs describing what the skill does, e.g., 'Configures and executes statistically meaningful agent-tty evals, sets up parallel trial runs, performs A/B comparisons, and interprets statistical results.'

Include natural trigger terms users might say, such as 'benchmarks', 'eval results', 'statistical significance', 'compare runs', or 'performance testing'.

DimensionReasoningScore

Specificity

The description names the domain (agent-tty evals) and mentions several concepts (trials, parallelism, A/B comparison, non-determinism baseline, sample sizes, result interpretation), but these read more as topics covered than concrete actions the skill performs. It says 'Guide for running' which is somewhat vague about what actions Claude actually takes.

2 / 3

Completeness

The description covers 'what' (a guide for running evals with various statistical features) but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' itself is more of a topic list than clear capability description, warranting a 1.

1 / 3

Trigger Term Quality

Includes some relevant keywords like 'evals', 'A/B comparison', 'sample sizes', and 'trials', but 'agent-tty' is very specific jargon that most users might not use. Missing common variations like 'benchmarks', 'testing', 'evaluation runs', or 'statistical significance'.

2 / 3

Distinctiveness Conflict Risk

The description targets a very specific niche — 'agent-tty evals' with statistical methodology — which is unlikely to conflict with other skills. The combination of agent-tty, A/B comparison, and statistical eval concepts creates a clear and distinct identity.

3 / 3

Total

8

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable eval guide with excellent concrete commands and clear workflows for A/B comparison. Its main weakness is moderate verbosity in the empirical findings section and the monolithic structure—some sections (authoring, reporter lifecycle, workspace presets) could be split into referenced files. The statistical interpretation guidance and specific threshold values are particularly valuable.

Suggestions

Condense Section 1 (non-determinism findings) to 3-4 concise bullet points—the detailed percentages and cross-provider analysis could be moved to a separate FINDINGS.md reference.

Consider splitting Sections 7-10 (authoring, reporter, presets, snapshots) into a separate ADVANCED.md or REFERENCE.md file, keeping SKILL.md focused on the core eval workflow.

DimensionReasoningScore

Conciseness

The content is mostly efficient and information-dense, but includes some unnecessary elaboration—e.g., Section 1's detailed retelling of empirical findings (flip rates, cross-provider checks) could be condensed to a few bullet points. Some sections like the concurrency safety rationale repeat points. However, it largely avoids explaining things Claude already knows.

2 / 3

Actionability

Excellent actionability throughout. Every section provides concrete, copy-paste-ready bash commands with specific flags, trial counts, and concurrency settings. The A/B comparison workflow in Section 3 is fully executable with real CLI invocations and jq piping. Recommended parameter values are specific (e.g., --trials 5, --concurrency 4, 0.05 score delta cutoffs).

3 / 3

Workflow Clarity

The A/B comparison workflow in Section 3 is clearly sequenced (run baseline → save path → make change → run candidate with --compare-baseline → read verdicts). Section 4 provides explicit interpretation rules including what 'inconclusive' means and when to add more trials. The guide covers the full lifecycle from running evals to interpreting results with clear decision criteria and feedback loops (add more trials if noise-dominated).

3 / 3

Progressive Disclosure

The content is well-structured with numbered sections and clear headings, making navigation easy. However, it's a fairly long monolithic document (~170 lines of substantive content) with no references to external files for deeper topics like authoring APIs, reporter details, or workspace presets—these could benefit from being split out. For a standalone skill with no bundle files, the organization is decent but the length pushes against ideal progressive disclosure.

2 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
coder/agent-tty
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.