eval-guide

Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.

Quality

72%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Fix and improve this skill with Tessl

tessl review fix ./.mux/skills/eval-guide/SKILL.md

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is highly actionable with executable commands and a well-sequenced, validated A/B workflow, but it carries some explanatory prose that could be trimmed and is entirely inline with no progressive-disclosure structure.

Suggestions

Tighten the narrative in section 1 and the statistical reasoning in section 4 to data + rule, dropping restatements of why the noise floor matters.

Consider extracting the full command catalog in section 6 into a reference file so SKILL.md stays an overview, lifting progressive_disclosure toward 3.

Dimension	Reasoning	Score
Conciseness	Mostly efficient and packed with specific values, but sections like the non-determinism narrative and statistical reasoning in section 4 include explanatory prose that could be tightened, rather than the lean every-token-earns-its-place form of a 3.	2 / 3
Actionability	Provides fully executable, copy-paste-ready bash commands with concrete flags (--trials 5, --concurrency 4, --compare-baseline) plus specific practical cutoffs (0.05 score delta, 0.05 pass-rate delta), matching the copy-paste-ready anchor.	3 / 3
Workflow Clarity	Section 3 gives a clearly numbered A/B sequence (run baseline, save path, change, run candidate, read verdicts) with explicit validation/reading checkpoints (CI excludes 0 and effect >= 0.05) and a checklist in section 7, matching the clear-sequence-with-validation anchor.	3 / 3
Progressive Disclosure	Well-organized into 10 numbered sections with a quick-reference, but it is a single self-contained ~200-line file with no progressive splitting or references to separate detail files; the comprehensive content stays inline rather than being an overview pointing to deeper materials.	2 / 3
	Total	10 / 12 Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is specific and occupies a clear niche, but it lacks an explicit 'Use when...' trigger clause and its trigger terms are somewhat technical rather than natural phrasings a user would say.

Suggestions

Add an explicit 'Use when ...' clause listing natural triggers, e.g. 'Use when the user wants to run an agent-tty eval, A/B test a skill or prompt change, or judge whether a change actually helped.'

Soften jargon in the trigger phrasing by adding natural variations users would say ('run an eval', 'compare two versions', 'is this change an improvement').

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions — 'running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison' plus 'recommended sample sizes' and 'result interpretation' — matching the anchor for several specific concrete actions.	3 / 3
Completeness	It clearly answers 'what does this do' but has no 'Use when...' clause or equivalent explicit trigger guidance, which caps completeness at 2 per the rubric.	2 / 3
Trigger Term Quality	Terms like 'agent-tty evals', 'A/B comparison', and 'trials' are relevant but lean technical; common natural variations a user might say ('run an eval', 'compare two versions', 'A/B test my skill') are not enumerated, so coverage is partial rather than comprehensive.	2 / 3
Distinctiveness Conflict Risk	The narrow niche of 'statistically meaningful agent-tty evals' with A/B comparison and trial sampling is clearly distinct and unlikely to trigger for unrelated skills.	3 / 3
	Total	10 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository: coder/agent-tty
Commit: fae02cb

Reviewed: 23 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.