CtrlK
BlogDocsLog inGet started
Tessl Logo

anthropic-evaluations

This skill should be used when the user asks to "create evals", "evaluate an agent", "build evaluation suite", or mentions agent testing, graders, or benchmarks. Also suggest when building coding agents, conversational agents, or research agents that need quality assurance.

86

1.50x
Quality

Does it follow best practices?

Impact

98%

1.50x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

87%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is a well-structured, concise overview that delegates detail to a clean set of real reference files and includes concrete formulas and YAML. Its main gap is that the end-to-end eval-building workflow and its validation/feedback checkpoints live entirely in the referenced roadmap rather than being surfaced in SKILL.md.

Suggestions

Surface a short inline sequence of the eval-building steps (with the key validation checkpoint of reviewing transcripts/grades before trusting results) so the workflow is visible without opening the roadmap.

Add a brief feedback-loop note (e.g., 'read transcripts, confirm graders reject valid solutions, then iterate') to give the body an explicit validate-fix-retry cycle.

Consider a one-line "Start here" pointer to the roadmap at the top so first-time users immediately reach the sequenced process.

DimensionReasoningScore

Conciseness

The body is lean and table-driven with no padding; it does not re-explain concepts Claude already knows and every section earns its place.

3 / 3

Actionability

Provides concrete guidance — a real tracked_metrics YAML block, explicit pass@k/pass^k formulas with worked numbers (98%, 42%), and pointers to executable templates — making it actionable for an instruction/knowledge skill.

3 / 3

Workflow Clarity

The multi-step process (Steps 0-8) is delegated to the Roadmap reference rather than sequenced in the body, and the body itself lacks explicit validation checkpoints or feedback loops for the eval-building process.

2 / 3

Progressive Disclosure

Clear overview with well-signaled, one-level-deep references organized into categories (references, templates, annotated examples), all of which resolve to real files, giving easy navigation.

3 / 3

Total

11

/

12

Passed

Description

72%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has strong trigger-term coverage and a clear, low-conflict niche, but states "when" far more explicitly than "what" — the skill's actual capabilities are implied through triggers rather than named directly. Tightening it to lead with a concrete capability statement would raise specificity and completeness.

Suggestions

Lead with a concrete capability statement (e.g., 'Builds evaluation suites for AI agents: defines tasks, graders, and metrics') before the trigger clauses so the "what" is explicit, not implied.

Add a few concrete action verbs the skill performs (design tasks, configure graders, track metrics) to lift specificity from naming the domain to naming the actions.

Trim the longer "Also suggest when building..." clause slightly to keep the description concise while preserving its trigger coverage.

DimensionReasoningScore

Specificity

Names the domain (agent evaluations) and several trigger actions ("create evals", "evaluate an agent", "build evaluation suite", "graders", "benchmarks"), but describes when to invoke the skill rather than enumerating the concrete actions the skill itself performs, so it is not fully comprehensive.

2 / 3

Completeness

Explicitly answers "when" with multiple trigger clauses ("should be used when...", "Also suggest when..."), but the "what" is conveyed only indirectly through those triggers rather than as a direct capability statement.

2 / 3

Trigger Term Quality

Covers natural phrasings a user would actually say — "create evals", "evaluate an agent", "build evaluation suite", "agent testing", "graders", "benchmarks" — giving good coverage of common variations.

3 / 3

Distinctiveness Conflict Risk

The niche (agent evaluations, graders, benchmarks, QA for agents) is specific with distinct triggers, making it unlikely to fire for unrelated skills.

3 / 3

Total

10

/

12

Passed

Validation

93%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation15 / 16 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

Total

15

/

16

Passed

Repository
dwmkerr/claude-toolkit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.