Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
69
24%
Does it follow best practices?
Impact
100%
2.08xAverage score across 6 eval scenarios
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.mdQuality
Discovery
22%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too abstract and jargon-heavy to be effective for skill selection. It fails to specify concrete actions the skill performs and lacks any explicit trigger guidance ('Use when...'). The term 'EDD' provides some niche identity but the overall description would not help Claude reliably choose this skill from a large pool.
Suggestions
Add a 'Use when...' clause with explicit triggers, e.g., 'Use when the user wants to evaluate prompt quality, run evals, benchmark Claude responses, or implement eval-driven development workflows.'
Replace 'formal evaluation framework' with specific concrete actions, e.g., 'Creates evaluation test cases, runs scoring rubrics against Claude outputs, tracks eval metrics across iterations, and generates pass/fail reports.'
Include natural user-facing keywords like 'evals', 'benchmark', 'test cases', 'scoring', 'prompt testing', 'quality measurement' to improve trigger term coverage.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions. It doesn't specify what the skill actually does (e.g., create test cases, run benchmarks, score outputs, generate reports). | 1 / 3 |
Completeness | The description weakly addresses 'what' (a framework for evaluation) but provides no 'when' guidance. There is no 'Use when...' clause or equivalent trigger guidance, and the 'what' itself is too vague to be useful. | 1 / 3 |
Trigger Term Quality | It includes some relevant terms like 'eval-driven development', 'EDD', and 'evaluation framework', but these are fairly niche/jargon-heavy. Common user phrases like 'test my prompt', 'benchmark', 'measure quality', or 'score responses' are missing. | 2 / 3 |
Distinctiveness Conflict Risk | The mention of 'eval-driven development' and 'EDD' provides some distinctiveness, but 'formal evaluation framework' and 'Claude Code sessions' are broad enough to potentially overlap with testing, QA, or other assessment-related skills. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is overly verbose and descriptive, spending significant tokens explaining concepts (EDD philosophy, what pass@k means, grader type descriptions) rather than providing lean, actionable instructions. While it covers the topic comprehensively, it reads more like a tutorial or documentation page than a concise skill file. The content would benefit greatly from aggressive trimming, splitting into referenced sub-files, and adding concrete validation/error-recovery steps.
Suggestions
Cut the philosophy section and metric definitions entirely—Claude already understands these concepts. Focus only on the specific templates, commands, and file structures unique to this project's eval workflow.
Split grader types, eval templates, and the authentication example into separate referenced files (e.g., GRADERS.md, TEMPLATES.md, EXAMPLES.md) to reduce the main skill to an actionable overview.
Add explicit validation and error-recovery steps to the workflow: what happens when evals fail, how to diagnose failures, when to re-run vs. investigate.
Make the /eval commands actionable—either provide the actual implementation (scripts/aliases) or clarify these are conceptual patterns the user should adapt, with concrete alternatives.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose, explaining concepts like eval-driven development philosophy, what pass@k means, and general best practices that Claude already knows. The content could be reduced by 60%+ while preserving all actionable information. Many sections are descriptive rather than instructive. | 1 / 3 |
Actionability | There are some concrete examples (bash commands for code-based graders, markdown templates for eval definitions and reports), but much of the content is template/pseudocode rather than executable. The /eval commands appear to be hypothetical slash commands without implementation details. The grep/npm examples are useful but most content describes formats rather than providing copy-paste ready workflows. | 2 / 3 |
Workflow Clarity | The 4-step workflow (Define → Implement → Evaluate → Report) is clearly sequenced, but lacks validation checkpoints and error recovery. There's no guidance on what to do when evals fail, no feedback loops for fixing issues, and the 'Implementation' step is just '[write code]' with no concrete guidance. | 2 / 3 |
Progressive Disclosure | The entire skill is a monolithic wall of text with no references to external files. All content—philosophy, eval types, grader types, metrics, workflow, storage, best practices, and a full example—is inlined in a single document. The eval storage section mentions .claude/evals/ files but doesn't reference separate documentation for any of these topics. | 1 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
79cc4e3
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.