eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

2.08x

Quality

24%

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.md

Quality

Discovery

22%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is too abstract and jargon-heavy to be effective for skill selection. It fails to specify concrete actions the skill performs and lacks any 'Use when...' guidance. The term 'EDD' may help narrow the domain slightly, but without concrete capabilities or trigger terms, Claude would struggle to reliably select this skill.

Suggestions

Add a 'Use when...' clause with explicit triggers, e.g., 'Use when the user wants to evaluate Claude Code session quality, set up eval benchmarks, or implement eval-driven development workflows.'

Replace 'formal evaluation framework' with specific actions, e.g., 'Creates evaluation test cases, runs automated scoring against expected outputs, and generates performance reports for Claude Code sessions.'

Include natural trigger terms users might say, such as 'evaluate', 'benchmark', 'test quality', 'score responses', 'evals', or 'grading criteria'.

Dimension	Reasoning	Score
Specificity	The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions. It doesn't specify what the skill actually does (e.g., create test cases, run evaluations, score outputs, generate reports).	1 / 3
Completeness	The description only vaguely addresses 'what' (a framework for evaluation) and completely lacks a 'when' clause. There are no explicit triggers or guidance on when Claude should select this skill.	1 / 3
Trigger Term Quality	It includes some relevant terms like 'eval-driven development', 'EDD', and 'Claude Code sessions', but these are fairly niche/jargon-heavy. Common user phrases like 'evaluate', 'test', 'benchmark', 'score', or 'assess quality' are missing.	2 / 3
Distinctiveness Conflict Risk	The mention of 'eval-driven development (EDD)' and 'Claude Code sessions' provides some specificity, but 'formal evaluation framework' is broad enough to potentially overlap with testing, QA, or benchmarking skills.	2 / 3
	Total	6 / 12 Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is a comprehensive but overly verbose conceptual guide to eval-driven development. It spends too many tokens explaining philosophy and definitions that Claude already understands, while the actual actionable guidance (specific commands, executable code, error handling) is thin. The content would benefit greatly from aggressive trimming and splitting into separate reference files.

Suggestions

Cut the philosophy section and conceptual explanations (eval types definitions, metrics definitions) to brief one-liners — Claude already understands these concepts.

Make the /eval commands actionable by either providing actual implementation scripts or clarifying these are conventions the user must implement, with concrete file templates.

Add explicit error recovery steps: what to do when capability evals fail, when regression evals fail, and how to iterate on the eval definitions themselves.

Split grader type details, metrics reference, and the authentication example into separate files referenced from a concise overview SKILL.md.

Dimension	Reasoning	Score
Conciseness	The skill is extremely verbose, explaining concepts like eval-driven development philosophy, what pass@k means, and various grader types at length. Much of this is conceptual explanation that Claude already understands. The content could be reduced by 60%+ while preserving all actionable guidance.	1 / 3
Actionability	There are some concrete examples (bash commands for code-based graders, markdown templates for eval definitions and reports), but most content is template/pseudocode rather than executable. The /eval commands appear to be hypothetical slash commands with no actual implementation details. The grader examples mix executable bash with non-executable markdown templates.	2 / 3
Workflow Clarity	The 4-step workflow (Define → Implement → Evaluate → Report) is clearly sequenced, but validation checkpoints are weak. There's no explicit feedback loop for what to do when evals fail, no error recovery guidance, and the 'Implement' step is just 'write code.' The integration pattern commands (/eval define, /eval check, /eval report) lack detail on what happens when checks fail.	2 / 3
Progressive Disclosure	This is a monolithic wall of text with everything inline. The skill references storing evals in .claude/evals/ but doesn't split its own content across files. Eval types, grader types, metrics, workflow, best practices, and a full example are all in one long document with no references to separate detailed guides.	1 / 3
	Total	6 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: haniakrim21/everything-claude-code
Commit: ae2cadd

Reviewed: about 1 month ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.