eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

Install with Tessl CLI

npx tessl i github:ysyecust/everything-claude-code --skill eval-harness

What are skills?

2.08x

Quality

42%

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.md

SKILL.md

Review

Evals

Discovery

22%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is too abstract and lacks actionable detail. It fails to specify what concrete actions the skill enables and provides no guidance on when Claude should select it. The technical jargon ('EDD principles') may not match natural user language.

Suggestions

Add a 'Use when...' clause with explicit triggers like 'Use when the user wants to evaluate code quality, run benchmarks, or implement eval-driven development workflows'

Replace abstract 'formal evaluation framework' with specific actions like 'Runs evaluation suites, scores model outputs, compares baseline vs candidate results, generates evaluation reports'

Include natural trigger terms users would say: 'eval', 'benchmark', 'test performance', 'measure accuracy', 'score outputs'

Dimension	Reasoning	Score
Specificity	The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions Claude would perform. No specific capabilities like 'run tests', 'score outputs', or 'compare results' are mentioned.	1 / 3
Completeness	The description only vaguely addresses 'what' (an evaluation framework) and completely lacks a 'when' clause. There is no explicit trigger guidance for when Claude should select this skill.	1 / 3
Trigger Term Quality	Contains some relevant terms ('evaluation', 'eval-driven development', 'EDD') but uses technical jargon that users may not naturally say. Missing common variations like 'test', 'benchmark', 'assess', 'measure performance'.	2 / 3
Distinctiveness Conflict Risk	The mention of 'Claude Code sessions' and 'EDD' provides some specificity, but 'evaluation framework' is generic enough to potentially conflict with testing, QA, or code review skills.	2 / 3
	Total	6 / 12 Passed

Implementation

62%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill provides a comprehensive framework for eval-driven development with clear workflow structure and good conceptual organization. However, it leans toward being a reference document rather than actionable guidance - the /eval commands aren't real, code examples are illustrative rather than executable, and the content could be more concise by removing explanations of concepts Claude already understands.

Suggestions

Replace placeholder /eval commands with actual executable implementations or scripts that Claude can run

Convert pseudocode examples (like the bash graders) into complete, copy-paste ready scripts with real file paths and error handling

Remove or condense the Philosophy and Metrics explanation sections - Claude understands pass@k concepts

Split detailed examples and templates into separate reference files (e.g., TEMPLATES.md, EXAMPLES.md) and keep SKILL.md as a concise overview

Dimension	Reasoning	Score
Conciseness	The skill is reasonably efficient but includes some explanatory content Claude already knows (e.g., explaining what pass@k means, basic grader concepts). The philosophy section and some descriptions could be tightened.	2 / 3
Actionability	Provides templates and examples but many are pseudocode or placeholder patterns rather than executable code. The /eval commands reference a system that isn't defined, and the bash examples are illustrative rather than copy-paste ready for a real implementation.	2 / 3
Workflow Clarity	Clear 4-phase workflow (Define → Implement → Evaluate → Report) with explicit sequencing. The example at the end demonstrates the complete flow with checkpoints, and the integration patterns section provides clear trigger points.	3 / 3
Progressive Disclosure	Content is well-organized with clear sections, but everything is in one monolithic file. References to .claude/evals/ storage structure suggest external files but doesn't link to actual reference materials. The 200+ line document could benefit from splitting detailed examples into separate files.	2 / 3
	Total	9 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Reviewed: 5 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.