克劳德代码会话的正式评估框架,实施评估驱动开发(EDD)原则
33
17%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-CN/skills/eval-harness/SKILL.mdQuality
Discovery
7%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is too abstract and jargon-heavy, failing to specify concrete actions the skill performs or when it should be triggered. It relies on the niche concept of 'Evaluation-Driven Development (EDD)' without explaining what that entails in practical terms. The lack of natural trigger terms and explicit 'Use when...' guidance makes it very difficult for Claude to correctly select this skill from a pool of options.
Suggestions
Add specific concrete actions the skill performs, e.g., 'Creates evaluation rubrics, scores Claude Code session outputs against criteria, generates quality reports'
Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks to evaluate, score, or assess Claude Code session quality, or mentions EDD or evaluation-driven development'
Replace abstract jargon with user-facing language that describes the practical outcomes, such as 'measures response quality', 'benchmarks session performance', or 'applies scoring criteria'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions '正式评估框架' (formal evaluation framework) and '评估驱动开发(EDD)原则' (Evaluation-Driven Development principles), but these are abstract concepts without concrete actions. No specific actions like 'creates rubrics', 'scores outputs', or 'generates evaluation reports' are listed. | 1 / 3 |
Completeness | The description vaguely addresses 'what' (a formal evaluation framework) but provides no 'when' clause or explicit trigger guidance. There is no 'Use when...' or equivalent statement indicating when Claude should select this skill. | 1 / 3 |
Trigger Term Quality | The description uses specialized jargon like '评估驱动开发(EDD)' which is not a term users would naturally say. It lacks natural trigger keywords that a user might use when needing this skill. The term '克劳德代码会话' (Claude Code sessions) is somewhat relevant but overly broad. | 1 / 3 |
Distinctiveness Conflict Risk | The mention of 'EDD' (Evaluation-Driven Development) provides some niche specificity that distinguishes it from generic skills, but '评估框架' (evaluation framework) for '克劳德代码会话' (Claude Code sessions) is still broad enough to potentially overlap with other evaluation or quality-related skills. | 2 / 3 |
Total | 5 / 12 Passed |
Implementation
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is an overly verbose conceptual framework document rather than a lean, actionable skill. It explains EDD philosophy and eval concepts at length but provides limited executable guidance—most examples are markdown templates rather than real code. The content would benefit greatly from aggressive trimming, splitting into referenced sub-files, and adding concrete validation/error-recovery steps.
Suggestions
Cut the content by at least 50%: remove the philosophy section, deduplicate pass@k explanations (appears 3 times), and eliminate the authentication example which just restates the workflow.
Split detailed content (grader types, metrics definitions, anti-patterns) into separate referenced files like GRADERS.md and METRICS.md, keeping SKILL.md as a concise overview.
Add explicit validation and error-recovery steps to the workflow: what to do when evals fail, how to debug failing graders, and when to escalate to human review.
Make the /eval commands actionable—either provide real implementation scripts or clarify these are conceptual patterns the user must implement, with a concrete starter script.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~200+ lines, with significant redundancy. Concepts like pass@k are explained multiple times, the authentication example repeats the workflow already described, and the 'Product Evals (v1.8)' section largely duplicates earlier content. Much of this (what evals are, EDD philosophy) is conceptual padding Claude doesn't need. | 1 / 3 |
Actionability | The skill provides some concrete examples (bash grader commands, markdown templates, file layout) but most 'code' is actually markdown templates or pseudocode. The /eval commands aren't real executable commands—they're aspirational patterns. The bash graders are the most actionable part but are simple grep/npm examples. | 2 / 3 |
Workflow Clarity | The 4-step workflow (Define → Implement → Evaluate → Report) is clearly sequenced, and the integration pattern section shows before/during/after phases. However, there are no explicit validation checkpoints or error recovery steps—what happens when evals fail? The 'fix and retry' loop is absent from the workflow. | 2 / 3 |
Progressive Disclosure | The entire skill is a monolithic wall of text with no references to external files despite the content being long enough to warrant splitting. Eval type definitions, grader details, metrics explanations, and examples could all be separate referenced files. Everything is inline, making it hard to navigate. | 1 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
5df943e
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.