eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

2.08x

Quality

—

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

Securityby

Advisory

Suggest reviewing before use

Quality

Content

50%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is a well-structured, mostly efficient overview of an eval-driven-development workflow with concrete grader examples and a clear four-phase sequence, but it is held at the mid level across all dimensions by inline placeholders/pseudo-commands, an implied rather than explicit retry loop, and a monolithic single-file layout with no progressive disclosure to separate reference files.

Suggestions

Make the eval-execution block executable: replace placeholders like '[執行每個能力 eval，記錄 PASS/FAIL]' with a concrete runner script or actual commands, and clarify whether '/eval define|check|report' are real commands or proposed slash-command conventions.

Add an explicit feedback loop in the Evaluate phase (e.g. 'If any eval FAILs: fix the code, re-run the failing eval, and only proceed to Report when all capability evals pass@3 and regression evals pass^3').

Split the reusable templates (eval-definition, model-grader prompt, report format) and the worked auth example into separate files under references/ and link to them from SKILL.md to enable progressive disclosure.

Dimension	Reasoning	Score
Conciseness	The body is mostly efficient and does not explain concepts Claude already knows, but at ~220 lines it carries redundant material — the full '範例：新增認證' example largely restates the workflow/templates — so it 'could be tightened' per the level-2 anchor rather than earning the every-token-earns-its-place level 3.	2 / 3
Actionability	It supplies concrete grader bash commands and clear templates, but several code blocks contain non-executable placeholders (e.g. '[執行每個能力 eval，記錄 PASS/FAIL]') and the '/eval define\|check\|report' integration commands are proposed slash commands rather than real runnable commands, matching 'some concrete guidance but incomplete; pseudocode instead of executable code'.	2 / 3
Workflow Clarity	The Define→Implement→Evaluate→Report sequence is clearly numbered with checklists and a '準備審查' status gate, but an explicit fail→fix→re-run feedback loop is only implied, not stated; per the rubric's batch-operation guidance the missing explicit validation/retry loop caps this at 2.	2 / 3
Progressive Disclosure	No bundle files exist (references/, scripts/, assets/ absent) and all content lives inline in a single ~220-line SKILL.md; sections are well-organized, but the templates and worked example that should be split into separate reference files are inline, matching 'some structure but... content that should be separate is inline' rather than the well-signaled one-level-deep level 3.	2 / 3
	Total	8 / 12 Passed

Description

50%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description clearly identifies a specific niche (eval-driven development for Claude Code sessions) but stops at domain-level abstraction: it lists no concrete actions and, critically, omits any 'Use when...' trigger guidance, capping completeness at 2. All four dimensions land at the mid-level anchor.

Suggestions

Add a 'Use when...' clause with explicit natural triggers, e.g. 'Use when implementing eval-driven development, defining capability/regression evals, tracking pass@k reliability, or generating eval reports for Claude Code sessions.'

Replace abstract phrasing with concrete actions: 'Defines capability and regression evals, runs code/model/human graders, tracks pass@k and pass^k metrics, and generates eval reports.'

Drop or gloss the 'EDD' acronym in favor of natural user phrasing like 'evals' and 'eval-driven development' to improve trigger-term quality.

Dimension	Reasoning	Score
Specificity	The phrase 'Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles' names a concrete domain (EDD for Claude Code sessions) but describes no specific actions such as defining evals, running regressions, or generating reports, matching the 'names domain and some actions, but not comprehensive' anchor rather than the multiple-concrete-actions level 3.	2 / 3
Completeness	It states what the skill does ('evaluation framework... implementing EDD principles') but provides no 'Use when...' clause or equivalent trigger guidance for when Claude should invoke it, so per the rubric guideline completeness is capped at 2 ('has what, but when is missing or only implied').	2 / 3
Trigger Term Quality	It includes relevant keywords like 'evaluation' and 'eval-driven development' but leans on the jargon acronym 'EDD' and omits natural variations a user would actually say ('evals', 'regression tests', 'pass@k'), fitting 'some relevant keywords but missing common variations' rather than the full-coverage level 3.	2 / 3
Distinctiveness Conflict Risk	The EDD-for-Claude-Code niche is reasonably specific and unlikely to collide with unrelated skills, but the absence of explicit distinct trigger terms means it could still overlap with generic testing/evaluation skills, matching 'somewhat specific but could still overlap' rather than the clear-distinct-triggers level 3.	2 / 3
	Total	8 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	15 / 16 Passed

Repository: ysyecust/everything-claude-code
Commit: a9cbecb

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.