Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
Install with Tessl CLI
npx tessl i github:ysyecust/everything-claude-code --skill eval-harness76
Quality
42%
Does it follow best practices?
Impact
100%
2.08xAverage score across 6 eval scenarios
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.mdDiscovery
22%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too abstract and lacks actionable detail. It fails to specify what concrete actions the skill enables and provides no guidance on when Claude should select it. The technical jargon ('EDD principles') may not match natural user language.
Suggestions
Add a 'Use when...' clause with explicit triggers like 'Use when the user wants to evaluate code quality, run benchmarks, or implement eval-driven development workflows'
Replace abstract 'formal evaluation framework' with specific actions like 'Runs evaluation suites, scores model outputs, compares baseline vs candidate results, generates evaluation reports'
Include natural trigger terms users would say: 'eval', 'benchmark', 'test performance', 'measure accuracy', 'score outputs'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions Claude would perform. No specific capabilities like 'run tests', 'score outputs', or 'compare results' are mentioned. | 1 / 3 |
Completeness | The description only vaguely addresses 'what' (an evaluation framework) and completely lacks a 'when' clause. There is no explicit trigger guidance for when Claude should select this skill. | 1 / 3 |
Trigger Term Quality | Contains some relevant terms ('evaluation', 'eval-driven development', 'EDD') but uses technical jargon that users may not naturally say. Missing common variations like 'test', 'benchmark', 'assess', 'measure performance'. | 2 / 3 |
Distinctiveness Conflict Risk | The mention of 'Claude Code sessions' and 'EDD' provides some specificity, but 'evaluation framework' is generic enough to potentially conflict with testing, QA, or code review skills. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill provides a comprehensive framework for eval-driven development with clear workflow structure and good conceptual organization. However, it leans toward being a reference document rather than actionable guidance - the /eval commands aren't real, code examples are illustrative rather than executable, and the content could be more concise by removing explanations of concepts Claude already understands.
Suggestions
Replace placeholder /eval commands with actual executable implementations or scripts that Claude can run
Convert pseudocode examples (like the bash graders) into complete, copy-paste ready scripts with real file paths and error handling
Remove or condense the Philosophy and Metrics explanation sections - Claude understands pass@k concepts
Split detailed examples and templates into separate reference files (e.g., TEMPLATES.md, EXAMPLES.md) and keep SKILL.md as a concise overview
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably efficient but includes some explanatory content Claude already knows (e.g., explaining what pass@k means, basic grader concepts). The philosophy section and some descriptions could be tightened. | 2 / 3 |
Actionability | Provides templates and examples but many are pseudocode or placeholder patterns rather than executable code. The /eval commands reference a system that isn't defined, and the bash examples are illustrative rather than copy-paste ready for a real implementation. | 2 / 3 |
Workflow Clarity | Clear 4-phase workflow (Define → Implement → Evaluate → Report) with explicit sequencing. The example at the end demonstrates the complete flow with checkpoints, and the integration patterns section provides clear trigger points. | 3 / 3 |
Progressive Disclosure | Content is well-organized with clear sections, but everything is in one monolithic file. References to .claude/evals/ storage structure suggest external files but doesn't link to actual reference materials. The 200+ line document could benefit from splitting detailed examples into separate files. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.