eval-harness

克劳德代码会话的正式评估框架，实施评估驱动开发（EDD）原则

Quality

17%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-CN/skills/eval-harness/SKILL.md

Quality

Discovery

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is too abstract and jargon-heavy, failing to specify concrete actions the skill performs or when it should be triggered. It reads more like a title than a functional description. The use of the niche term 'EDD' without explanation and the absence of natural trigger terms make it difficult for Claude to correctly select this skill.

Suggestions

Add specific concrete actions the skill performs, e.g., 'Creates evaluation rubrics, scores Claude Code session outputs against criteria, generates improvement recommendations'

Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks to evaluate, grade, or assess Claude Code session quality, or mentions EDD, evaluation-driven development, or session scoring'

Consider providing the description in English (or bilingually) and replace abstract jargon with plain-language explanations of what EDD means in practice

Dimension	Reasoning	Score
Specificity	The description mentions '正式评估框架' (formal evaluation framework) and '评估驱动开发（EDD）原则' (Evaluation-Driven Development principles), but these are abstract concepts without concrete actions. No specific actions like 'creates rubrics', 'scores outputs', or 'generates evaluation reports' are listed.	1 / 3
Completeness	The description only vaguely addresses 'what' (a formal evaluation framework) and completely lacks a 'when' clause. There is no explicit guidance on when Claude should select this skill, which per the rubric caps completeness at 2 at best, but the 'what' is also too vague to merit a 2.	1 / 3
Trigger Term Quality	The description uses specialized jargon like '评估驱动开发（EDD）' which is not a term users would naturally say. It lacks natural trigger keywords that a user might use when needing this skill. Additionally, the entire description is in Chinese, which limits discoverability for non-Chinese speakers.	1 / 3
Distinctiveness Conflict Risk	The mention of 'EDD' and '克劳德代码会话' (Claude Code sessions) provides some specificity to a niche domain, making it somewhat distinguishable. However, '评估框架' (evaluation framework) is still broad enough to potentially overlap with other evaluation or testing skills.	2 / 3
	Total	5 / 12 Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill attempts to be a comprehensive evaluation framework but suffers from significant verbosity and redundancy—concepts like pass@k, grader types, and eval categories are explained multiple times across different sections. The actionability is undermined by referencing slash commands (/eval define, /eval check) that don't appear to be implemented anywhere, making much of the workflow aspirational rather than executable. The monolithic structure with repeated content would benefit greatly from splitting into focused reference files.

Suggestions

Eliminate redundancy by merging the 'Product Evals (v1.8)' section into the main content—grader types and pass@k guidelines are defined twice with slight variations.

Either provide actual implementations for the /eval slash commands (as scripts or Claude Code custom commands) or remove them and replace with concrete executable steps.

Split detailed content (grader type examples, metric definitions, full example walkthroughs) into separate reference files and keep SKILL.md as a concise overview with links.

Add explicit error recovery steps in the workflow: what to do when capability evals fail, how to debug grader issues, and when to revise eval definitions vs. fix implementation.

Dimension	Reasoning	Score
Conciseness	The skill is extremely verbose at ~200+ lines, with significant redundancy. Concepts like pass@k are explained multiple times, eval types are defined then re-explained, and the 'Product Evals (v1.8)' section largely repeats earlier content. Much of this (what evals are, EDD philosophy, basic metrics definitions) is knowledge Claude already possesses.	1 / 3
Actionability	The skill provides some concrete examples (bash grader commands, markdown templates, file layouts) but most guidance is template/pseudocode rather than executable. The `/eval define`, `/eval check`, `/eval report` commands are referenced but never defined as actual implementations—they appear to be aspirational slash commands with no backing code or scripts.	2 / 3
Workflow Clarity	The 4-phase workflow (Define → Implement → Evaluate → Report) is clearly sequenced, but validation checkpoints are weak. There's no explicit error recovery or feedback loop for when evals fail—step 3 just says 'run evals' without guidance on what to do if capability evals fail partially, or how to iterate. For a framework involving potentially destructive changes and regression testing, this is insufficient.	2 / 3
Progressive Disclosure	The content is a monolithic wall of text with no references to external files despite mentioning a file structure (.claude/evals/). Everything is inline—eval type definitions, grader types, metrics, workflow, best practices, examples, and a 'Product Evals v1.8' addendum—all in one file with no separation or navigation aids. The v1.8 section at the end repeats earlier content without clear differentiation.	1 / 3
	Total	6 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: affaan-m/everything-claude-code
Commit: 928076c

Reviewed: 3 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.