CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-guide

Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.

62

Quality

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is highly actionable with executable commands and a well-sequenced, validated A/B workflow, but it carries some explanatory prose that could be trimmed and is entirely inline with no progressive-disclosure structure.

Suggestions

Tighten the narrative in section 1 and the statistical reasoning in section 4 to data + rule, dropping restatements of why the noise floor matters.

Consider extracting the full command catalog in section 6 into a reference file so SKILL.md stays an overview, lifting progressive_disclosure toward 3.

DimensionReasoningScore

Conciseness

Mostly efficient and packed with specific values, but sections like the non-determinism narrative and statistical reasoning in section 4 include explanatory prose that could be tightened, rather than the lean every-token-earns-its-place form of a 3.

2 / 3

Actionability

Provides fully executable, copy-paste-ready bash commands with concrete flags (--trials 5, --concurrency 4, --compare-baseline) plus specific practical cutoffs (0.05 score delta, 0.05 pass-rate delta), matching the copy-paste-ready anchor.

3 / 3

Workflow Clarity

Section 3 gives a clearly numbered A/B sequence (run baseline, save path, change, run candidate, read verdicts) with explicit validation/reading checkpoints (CI excludes 0 and effect >= 0.05) and a checklist in section 7, matching the clear-sequence-with-validation anchor.

3 / 3

Progressive Disclosure

Well-organized into 10 numbered sections with a quick-reference, but it is a single self-contained ~200-line file with no progressive splitting or references to separate detail files; the comprehensive content stays inline rather than being an overview pointing to deeper materials.

2 / 3

Total

10

/

12

Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is specific and occupies a clear niche, but it lacks an explicit 'Use when...' trigger clause and its trigger terms are somewhat technical rather than natural phrasings a user would say.

Suggestions

Add an explicit 'Use when ...' clause listing natural triggers, e.g. 'Use when the user wants to run an agent-tty eval, A/B test a skill or prompt change, or judge whether a change actually helped.'

Soften jargon in the trigger phrasing by adding natural variations users would say ('run an eval', 'compare two versions', 'is this change an improvement').

DimensionReasoningScore

Specificity

Lists multiple concrete actions — 'running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison' plus 'recommended sample sizes' and 'result interpretation' — matching the anchor for several specific concrete actions.

3 / 3

Completeness

It clearly answers 'what does this do' but has no 'Use when...' clause or equivalent explicit trigger guidance, which caps completeness at 2 per the rubric.

2 / 3

Trigger Term Quality

Terms like 'agent-tty evals', 'A/B comparison', and 'trials' are relevant but lean technical; common natural variations a user might say ('run an eval', 'compare two versions', 'A/B test my skill') are not enumerated, so coverage is partial rather than comprehensive.

2 / 3

Distinctiveness Conflict Risk

The narrow niche of 'statistically meaningful agent-tty evals' with A/B comparison and trial sampling is clearly distinct and unlikely to trigger for unrelated skills.

3 / 3

Total

10

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository
coder/agent-tty
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.