Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
69
24%
Does it follow best practices?
Impact
100%
2.08xAverage score across 6 eval scenarios
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.mdQuality
Discovery
22%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too abstract and jargon-heavy to be effective for skill selection. It fails to specify concrete actions the skill performs and lacks any 'Use when...' guidance. The term 'EDD' may help narrow the domain slightly, but without concrete capabilities or trigger terms, Claude would struggle to reliably select this skill.
Suggestions
Add a 'Use when...' clause with explicit triggers, e.g., 'Use when the user wants to evaluate Claude Code session quality, set up eval benchmarks, or implement eval-driven development workflows.'
Replace 'formal evaluation framework' with specific actions, e.g., 'Creates evaluation test cases, runs automated scoring against expected outputs, and generates performance reports for Claude Code sessions.'
Include natural trigger terms users might say, such as 'evaluate', 'benchmark', 'test quality', 'score responses', 'evals', or 'grading criteria'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions. It doesn't specify what the skill actually does (e.g., create test cases, run evaluations, score outputs, generate reports). | 1 / 3 |
Completeness | The description only vaguely addresses 'what' (a framework for evaluation) and completely lacks a 'when' clause. There are no explicit triggers or guidance on when Claude should select this skill. | 1 / 3 |
Trigger Term Quality | It includes some relevant terms like 'eval-driven development', 'EDD', and 'Claude Code sessions', but these are fairly niche/jargon-heavy. Common user phrases like 'evaluate', 'test', 'benchmark', 'score', or 'assess quality' are missing. | 2 / 3 |
Distinctiveness Conflict Risk | The mention of 'eval-driven development (EDD)' and 'Claude Code sessions' provides some specificity, but 'formal evaluation framework' is broad enough to potentially overlap with testing, QA, or benchmarking skills. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is a comprehensive but overly verbose conceptual guide to eval-driven development. It spends too many tokens explaining philosophy and definitions that Claude already understands, while the actual actionable guidance (specific commands, executable code, error handling) is thin. The content would benefit greatly from aggressive trimming and splitting into separate reference files.
Suggestions
Cut the philosophy section and conceptual explanations (eval types definitions, metrics definitions) to brief one-liners — Claude already understands these concepts.
Make the /eval commands actionable by either providing actual implementation scripts or clarifying these are conventions the user must implement, with concrete file templates.
Add explicit error recovery steps: what to do when capability evals fail, when regression evals fail, and how to iterate on the eval definitions themselves.
Split grader type details, metrics reference, and the authentication example into separate files referenced from a concise overview SKILL.md.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose, explaining concepts like eval-driven development philosophy, what pass@k means, and various grader types at length. Much of this is conceptual explanation that Claude already understands. The content could be reduced by 60%+ while preserving all actionable guidance. | 1 / 3 |
Actionability | There are some concrete examples (bash commands for code-based graders, markdown templates for eval definitions and reports), but most content is template/pseudocode rather than executable. The /eval commands appear to be hypothetical slash commands with no actual implementation details. The grader examples mix executable bash with non-executable markdown templates. | 2 / 3 |
Workflow Clarity | The 4-step workflow (Define → Implement → Evaluate → Report) is clearly sequenced, but validation checkpoints are weak. There's no explicit feedback loop for what to do when evals fail, no error recovery guidance, and the 'Implement' step is just 'write code.' The integration pattern commands (/eval define, /eval check, /eval report) lack detail on what happens when checks fail. | 2 / 3 |
Progressive Disclosure | This is a monolithic wall of text with everything inline. The skill references storing evals in .claude/evals/ but doesn't split its own content across files. Eval types, grader types, metrics, workflow, best practices, and a full example are all in one long document with no references to separate detailed guides. | 1 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
ae2cadd
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.