anthropic-evaluations

This skill should be used when the user asks to "create evals", "evaluate an agent", "build evaluation suite", or mentions agent testing, graders, or benchmarks. Also suggest when building coding agents, conversational agents, or research agents that need quality assurance.

1.50x

Quality

80%

Does it follow best practices?

Impact

98%

1.50x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Fix and improve this skill with Tessl

tessl review fix ./plugins/toolkit/skills/anthropic-evaluations/SKILL.md

Quality

Content

87%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is a well-structured, concise overview that delegates detail to a clean set of real reference files and includes concrete formulas and YAML. Its main gap is that the end-to-end eval-building workflow and its validation/feedback checkpoints live entirely in the referenced roadmap rather than being surfaced in SKILL.md.

Suggestions

Surface a short inline sequence of the eval-building steps (with the key validation checkpoint of reviewing transcripts/grades before trusting results) so the workflow is visible without opening the roadmap.

Add a brief feedback-loop note (e.g., 'read transcripts, confirm graders reject valid solutions, then iterate') to give the body an explicit validate-fix-retry cycle.

Consider a one-line "Start here" pointer to the roadmap at the top so first-time users immediately reach the sequenced process.

Dimension	Reasoning	Score
Conciseness	The body is lean and table-driven with no padding; it does not re-explain concepts Claude already knows and every section earns its place.	3 / 3
Actionability	Provides concrete guidance — a real tracked_metrics YAML block, explicit pass@k/pass^k formulas with worked numbers (98%, 42%), and pointers to executable templates — making it actionable for an instruction/knowledge skill.	3 / 3
Workflow Clarity	The multi-step process (Steps 0-8) is delegated to the Roadmap reference rather than sequenced in the body, and the body itself lacks explicit validation checkpoints or feedback loops for the eval-building process.	2 / 3
Progressive Disclosure	Clear overview with well-signaled, one-level-deep references organized into categories (references, templates, annotated examples), all of which resolve to real files, giving easy navigation.	3 / 3
	Total	11 / 12 Passed

Description

72%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has strong trigger-term coverage and a clear, low-conflict niche, but states "when" far more explicitly than "what" — the skill's actual capabilities are implied through triggers rather than named directly. Tightening it to lead with a concrete capability statement would raise specificity and completeness.

Suggestions

Lead with a concrete capability statement (e.g., 'Builds evaluation suites for AI agents: defines tasks, graders, and metrics') before the trigger clauses so the "what" is explicit, not implied.

Add a few concrete action verbs the skill performs (design tasks, configure graders, track metrics) to lift specificity from naming the domain to naming the actions.

Trim the longer "Also suggest when building..." clause slightly to keep the description concise while preserving its trigger coverage.

Dimension	Reasoning	Score
Specificity	Names the domain (agent evaluations) and several trigger actions ("create evals", "evaluate an agent", "build evaluation suite", "graders", "benchmarks"), but describes when to invoke the skill rather than enumerating the concrete actions the skill itself performs, so it is not fully comprehensive.	2 / 3
Completeness	Explicitly answers "when" with multiple trigger clauses ("should be used when...", "Also suggest when..."), but the "what" is conveyed only indirectly through those triggers rather than as a direct capability statement.	2 / 3
Trigger Term Quality	Covers natural phrasings a user would actually say — "create evals", "evaluate an agent", "build evaluation suite", "agent testing", "graders", "benchmarks" — giving good coverage of common variations.	3 / 3
Distinctiveness Conflict Risk	The niche (agent evaluations, graders, benchmarks, QA for agents) is specific with distinct triggers, making it unlikely to fire for unrelated skills.	3 / 3
	Total	10 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
allowed_tools_field	'allowed-tools' contains unusual tool name(s)	Warning

	Total	15 / 16 Passed

Repository: dwmkerr/claude-toolkit
Path: plugins/toolkit/skills/anthropic-evaluations/SKILL.md
Commit: 18a4d6c

Reviewed: 18 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.