克劳德代码会话的正式评估框架,实施评估驱动开发(EDD)原则
33
17%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-CN/skills/eval-harness/SKILL.mdQuality
Discovery
7%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too vague and abstract, relying on jargon ('评估驱动开发/EDD') without explaining concrete actions or providing trigger guidance. It fails to answer both 'what does this do' in specific terms and 'when should Claude use it'. The description needs substantial improvement across all dimensions to be useful for skill selection.
Suggestions
Add specific concrete actions the skill performs, e.g., 'Creates evaluation rubrics, scores Claude Code session outputs against criteria, generates improvement recommendations'
Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks to evaluate, assess, grade, or review Claude Code session quality, or mentions EDD or evaluation-driven development'
Replace or supplement the jargon-heavy framing with plain language that describes the practical outcomes, and consider providing the description in English or bilingually for broader discoverability
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions '正式评估框架' (formal evaluation framework) and '评估驱动开发(EDD)原则' (Evaluation-Driven Development principles), but these are abstract concepts without concrete actions. No specific actions like 'creates rubrics', 'scores outputs', or 'generates evaluation reports' are listed. | 1 / 3 |
Completeness | The description only vaguely addresses 'what' (a formal evaluation framework) and completely lacks a 'when' clause. There is no explicit guidance on when Claude should select this skill, which per the rubric caps completeness at 2 maximum, but the 'what' is also too vague to merit even a 2. | 1 / 3 |
Trigger Term Quality | The description uses specialized jargon like '评估驱动开发(EDD)' which is not a term users would naturally say. It lacks natural trigger keywords that a user might use when needing this skill. Additionally, being entirely in Chinese limits discoverability for non-Chinese speakers. | 1 / 3 |
Distinctiveness Conflict Risk | The mention of 'EDD' and '克劳德代码会话' (Claude Code sessions) provides some specificity to a niche domain, making it somewhat distinguishable. However, '评估框架' (evaluation framework) is still broad enough to potentially overlap with other evaluation or testing skills. | 2 / 3 |
Total | 5 / 12 Passed |
Implementation
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is a comprehensive but overly verbose conceptual framework that explains EDD philosophy and eval taxonomy at length, much of which Claude already understands. The actionability is undermined by pseudo-commands (`/eval define`) that have no actual implementation, and templates that are illustrative rather than executable. The content would benefit significantly from being condensed to ~50 lines of concrete, actionable guidance with supporting details moved to referenced files.
Suggestions
Cut the content by 60-70%: remove the philosophy section, deduplicate pass@k explanations (appears 3 times), and eliminate generic best practices that Claude already knows.
Make the `/eval define`, `/eval check`, `/eval report` commands actionable by providing actual shell scripts or Claude Code slash command definitions rather than aspirational placeholders.
Split detailed content (grader types, metrics definitions, examples) into separate referenced files like GRADERS.md, METRICS.md, and EXAMPLES.md to improve progressive disclosure.
Add explicit validation/error-recovery steps to the workflow: what to do when evals fail, how to debug flaky graders, and when to escalate to human review.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~200+ lines, with significant redundancy. Concepts like pass@k are explained multiple times, eval types are defined then re-explained, and the 'Product Evals (v1.8)' section largely repeats earlier content. Much of this (what evals are, EDD philosophy, basic concepts) is knowledge Claude already possesses. The 'best practices' section contains generic advice that doesn't earn its tokens. | 1 / 3 |
Actionability | The skill provides markdown templates and some bash commands (grep, npm test), but most 'code' is actually pseudocode or template placeholders rather than executable implementations. The `/eval define`, `/eval check`, `/eval report` commands are referenced but never defined as actual implementations—they appear to be aspirational slash commands with no backing code or scripts provided. | 2 / 3 |
Workflow Clarity | The 4-phase workflow (Define → Implement → Evaluate → Report) is clearly sequenced, and the integration pattern section shows when to run evals. However, there are no explicit validation checkpoints or error recovery steps. What happens when an eval fails? The skill says 'fix and re-run' implicitly but never provides a feedback loop for debugging failures or handling partial passes. | 2 / 3 |
Progressive Disclosure | The content is a monolithic wall of text with no references to external files despite mentioning a file structure (.claude/evals/). Everything is inline—the eval type definitions, grader types, metrics explanations, workflow, examples, and product evals section could all benefit from being split into separate referenced documents. No bundle files are provided to support the referenced paths. | 1 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
841beea
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.