eval-harness

克劳德代码会话的正式评估框架，实施评估驱动开发（EDD）原则

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Quality

Content

65%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is highly actionable with concrete templates and executable commands and a clear high-level workflow, but it is verbose through internal redundancy and monolithic, with no progressive disclosure and missing validation feedback in its batch-eval workflow.

Suggestions

De-duplicate the pass@k/pass^k explanations and grader-type lists (currently repeated in both the core sections and '产品评估 (v1.8)') into a single canonical treatment.

Add an explicit validation/feedback loop in the evaluate step — e.g. 'if any regression eval FAILS, stop, fix the cause, and re-run before reporting' — to raise workflow clarity.

Split the worked '添加身份验证' example and the v1.8 product-eval detail into separate reference files linked one level deep from SKILL.md to improve progressive disclosure.

Dimension	Reasoning	Score
Conciseness	The body is mostly efficient templates and lists, but it is padded with redundancy — pass@k/pass^k and grader types are each explained twice (in '指标'/'评分器类型' and again in '产品评估 (v1.8)') — so it could be tightened toward the lean level-3 anchor.	2 / 3
Actionability	It provides concrete, executable guidance — runnable bash (`grep -q ... && echo PASS`, `npm test -- --testPathPattern`, `npm run build`) and copy-paste-ready markdown templates for capability/regression evals and reports — matching the fully executable level-3 anchor.	3 / 3
Workflow Clarity	The define→implement→evaluate→report sequence is clearly laid out, but the evaluate step uses a placeholder ('[Run each capability eval, record PASS/FAIL]') with no explicit validation/failure-feedback checkpoint for batch regression runs, capping this at 2 per the destructive/batch feedback-loop guideline.	2 / 3
Progressive Disclosure	Sections are organized, but the file is monolithic: templates, grader types, the worked authentication example, and the v1.8 product-eval material all live inline with no one-level-deep references or navigation to separate files, fitting the level-2 'structure present but should be split' anchor.	2 / 3
	Total	9 / 12 Passed

Description

35%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description uses correct third-person voice and names its domain, but is generic and lacks any explicit trigger guidance or natural user keywords, leaving both completeness and trigger-term quality weak.

Suggestions

Add an explicit 'Use when...' clause naming concrete triggering situations, e.g. '...Use when setting up eval-driven development for Claude Code tasks, defining pass/fail criteria, or building regression test suites for prompts or agents.'

Surface natural keywords users would actually say ('eval-driven development', 'pass@k', 'agent regression tests', 'benchmarking agent reliability') instead of relying on the EDD acronym.

List 2–3 concrete actions the skill performs (define capability/regression evals, run graders, generate eval reports) to lift specificity toward the level-3 anchor.

Dimension	Reasoning	Score
Specificity	The description names the domain ('正式评估框架' / '评估驱动开发(EDD)') and one abstract action ('实施...原则'), but unlike a level-3 anchor it does not enumerate multiple concrete actions, only a single conceptual one.	2 / 3
Completeness	It states what the skill is (a formal evaluation framework implementing EDD), but provides no explicit 'Use when...' trigger guidance, so per the judging guidelines completeness is capped at 2.	2 / 3
Trigger Term Quality	The phrasing relies on technical jargon ('评估驱动开发(EDD)', 'EDD原则') and includes no natural keywords a user would actually say when reaching for this skill, matching the level-1 'technical jargon, no natural keywords' anchor.	1 / 3
Distinctiveness Conflict Risk	The EDD/eval-harness niche is somewhat specific and unlikely to broadly conflict, but the description is general enough that it could overlap with generic testing or QA skills rather than having a sharply distinct trigger.	2 / 3
	Total	7 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	15 / 16 Passed

Repository: affaan-m/everything-claude-code
Commit: 4130457

Reviewed: 4 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.