Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
76
42%
Does it follow best practices?
Impact
100%
2.08xAverage score across 6 eval scenarios
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.mdQuality
Discovery
22%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too abstract and lacks actionable detail. It fails to explain what concrete actions the skill enables (e.g., creating test cases, running evaluations, scoring outputs) and provides no guidance on when Claude should select this skill. The technical jargon 'EDD principles' assumes familiarity without explanation.
Suggestions
Add specific concrete actions the skill performs, e.g., 'Creates evaluation test cases, runs automated scoring, tracks performance metrics across iterations'
Include an explicit 'Use when...' clause with natural trigger terms like 'evaluate my code', 'test quality', 'measure performance', 'run evals', 'benchmark'
Define what 'eval-driven development' means in practical terms so Claude knows the specific scenarios this skill addresses
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions Claude would perform. No specific capabilities are mentioned. | 1 / 3 |
Completeness | Only vaguely addresses 'what' (evaluation framework) with no explicit 'when' clause or trigger guidance. Missing a 'Use when...' statement entirely. | 1 / 3 |
Trigger Term Quality | Contains some relevant terms like 'eval-driven development', 'EDD', and 'evaluation' that users might say, but lacks common variations and natural phrases users would actually use when needing this skill. | 2 / 3 |
Distinctiveness Conflict Risk | The mention of 'eval-driven development' and 'EDD' provides some specificity, but 'evaluation framework' and 'Claude Code sessions' are generic enough to potentially overlap with other testing or development skills. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill provides a comprehensive framework for eval-driven development with good workflow structure and clear phases. However, it suffers from being overly explanatory about concepts Claude already understands (pass@k metrics, grader types) and includes hypothetical slash commands that aren't actually implemented. The content would benefit from being more concise and providing truly executable examples rather than templates.
Suggestions
Remove explanatory content about basic concepts like pass@k definitions and grader type explanations - Claude already knows these
Replace hypothetical '/eval' slash commands with actual executable bash scripts or provide implementation details for these commands
Split detailed examples and templates into separate reference files (e.g., TEMPLATES.md, EXAMPLES.md) and link from the main skill
Convert markdown template examples into actual executable test scripts that can be run
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably efficient but includes some explanatory content Claude already knows (e.g., explaining what pass@k means, basic grader concepts). The philosophy section and some descriptions could be tightened. | 2 / 3 |
Actionability | Provides templates and examples but many are pseudocode or markdown templates rather than executable code. The bash examples are concrete, but the workflow commands like '/eval define feature-name' appear to be hypothetical slash commands without implementation details. | 2 / 3 |
Workflow Clarity | Clear 4-phase workflow (Define → Implement → Evaluate → Report) with explicit sequencing. The example at the end demonstrates the complete flow with checkpoints and status tracking. | 3 / 3 |
Progressive Disclosure | Content is organized with clear sections and headers, but everything is in one monolithic file. References to '.claude/evals/' storage structure are mentioned but no actual linked files exist. The content could benefit from splitting detailed examples into separate reference files. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
ae2cadd
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.