Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
69
24%
Does it follow best practices?
Impact
100%
2.08xAverage score across 6 eval scenarios
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.mdQuality
Discovery
22%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description is too abstract and jargon-heavy, reading more like an academic title than an actionable skill description. It fails to specify concrete actions the skill performs and lacks any 'Use when...' guidance, making it difficult for Claude to know when to select it. The EDD acronym provides some niche identity but is not sufficient to compensate for the missing detail.
Suggestions
Add concrete actions the skill performs, e.g., 'Creates evaluation criteria, designs test cases, scores Claude Code session outputs, and tracks improvement metrics.'
Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user wants to evaluate Claude Code output quality, set up evals, measure performance, or implement eval-driven development workflows.'
Replace or supplement the jargon 'EDD principles' with plain-language equivalents that users would naturally say, such as 'testing', 'benchmarking', 'quality assessment', or 'evaluation scoring'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions. It doesn't specify what the skill actually does (e.g., create test cases, run benchmarks, score outputs, generate reports). | 1 / 3 |
Completeness | The description only vaguely addresses 'what' (a framework for evaluation) and completely lacks a 'when' clause. There are no explicit triggers or guidance on when Claude should select this skill. | 1 / 3 |
Trigger Term Quality | It includes some relevant terms like 'eval-driven development', 'EDD', and 'Claude Code sessions', but these are fairly niche/jargon-heavy. Common user phrases like 'evaluate', 'test', 'benchmark', 'score', or 'measure quality' are missing. | 2 / 3 |
Distinctiveness Conflict Risk | The mention of 'eval-driven development (EDD)' and 'Claude Code sessions' provides some specificity, but 'formal evaluation framework' is broad enough to potentially overlap with testing, QA, or code review skills. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is overly verbose, explaining many concepts Claude already understands (eval-driven development philosophy, what pass@k metrics are, basic testing concepts). The slash commands (/eval define, /eval check) appear to be fictional with no backing implementation, reducing actionability. The content would benefit greatly from being condensed to its unique value — the specific templates and file conventions — while splitting detailed examples and grader definitions into separate reference files.
Suggestions
Cut the philosophy section, metrics explanations, and best practices to ~2-3 lines each — Claude already understands testing concepts, pass@k, and why evals matter.
Either implement the /eval slash commands as actual scripts/tools or remove them and replace with concrete executable steps Claude can actually perform.
Split grader type definitions, the authentication example, and eval storage conventions into separate referenced files to improve progressive disclosure.
Add explicit validation checkpoints and error recovery steps in the workflow (e.g., what to do when capability evals fail, how to debug regression failures).
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose for what it conveys. Explains basic concepts Claude already knows (what pass@k means, what evals are, what regression testing is). The philosophy section, grader type explanations, and best practices are largely common knowledge for Claude. The document is ~150+ lines but could convey its unique value in ~40 lines. | 1 / 3 |
Actionability | Provides templates and markdown formats that are somewhat concrete, but the '/eval define', '/eval check', '/eval report' commands are not real executable commands — they appear to be aspirational slash commands with no implementation. The bash examples (grep, npm test) are concrete but generic. Most content is descriptive templates rather than executable guidance. | 2 / 3 |
Workflow Clarity | The 4-step workflow (Define → Implement → Evaluate → Report) is clearly sequenced, but lacks validation checkpoints and error recovery. There's no guidance on what to do when evals fail, no feedback loops for fixing issues, and the 'Implement' step is just '[write code]' with no detail. The integration pattern commands aren't real, undermining practical workflow execution. | 2 / 3 |
Progressive Disclosure | Monolithic wall of text with no references to external files despite mentioning a file structure (.claude/evals/). Everything is inline — the eval type definitions, grader types, metrics explanations, best practices, and full examples could be split into referenced files. No bundle files exist to support the referenced paths like '.claude/evals/'. | 1 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
79cc4e3
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.