CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-harness

克劳德代码会话的正式评估框架,实施评估驱动开发(EDD)原则

26

Quality

17%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-CN/skills/eval-harness/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is a conceptual framework document rather than an actionable skill. It suffers from significant verbosity and redundancy—pass@k metrics, grader types, and eval structures are explained multiple times across sections. The content reads more like documentation for a hypothetical eval system than concrete instructions Claude can execute, with pseudo-commands (/eval define) that don't correspond to real tooling.

Suggestions

Cut the content by at least 50%: remove the Philosophy section, deduplicate pass@k explanations (appears 3 times), merge 'Grader Types' with 'Product Evals' grader section, and remove the authentication example which just restates the workflow.

Make the skill actionable by providing actual executable scripts or concrete tool invocations rather than conceptual markdown templates and fake slash commands like '/eval define'.

Split into SKILL.md (overview + quick start) and separate reference files for eval templates, grader examples, and metrics definitions to improve progressive disclosure.

Add explicit validation/feedback loops: what should Claude do when an eval fails? Include concrete error recovery steps rather than just recording pass/fail.

DimensionReasoningScore

Conciseness

The skill is extremely verbose at ~250+ lines, with significant redundancy. Concepts like pass@k are explained multiple times, the eval types and grader types sections overlap with the later 'Product Evals' section, and the philosophy/best practices sections explain things Claude already understands. The authentication example essentially restates the entire workflow that was already described.

1 / 3

Actionability

The skill provides some concrete examples (bash grep commands, markdown templates, directory structures) but most content is template/pseudocode rather than executable. The /eval commands referenced (e.g., '/eval define feature-name') are not real commands and there's no actual implementation—just conceptual frameworks and markdown templates.

2 / 3

Workflow Clarity

The 4-step workflow (Define → Implement → Evaluate → Report) is clearly sequenced, but validation checkpoints are weak. There's no explicit error recovery loop—what happens when evals fail? The 'eval check' step lacks detail on how to actually run evaluations, and there's no feedback loop for fixing failures beyond the vague 'fix and re-run' implication.

2 / 3

Progressive Disclosure

This is a monolithic wall of text with no references to external files despite being long enough to warrant splitting. The eval types, grader types, metrics, workflow, integration patterns, best practices, examples, and product evals sections are all inline. The 'Product Evals (v1.8)' section at the end partially duplicates earlier content (grader types, pass@k) without clear differentiation.

1 / 3

Total

6

/

12

Passed

Description

7%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is too abstract and lacks concrete actions, natural trigger terms, and explicit 'when to use' guidance. It reads more like a title or tagline than a functional skill description. The use of specialized jargon (EDD) without explanation and the absence of actionable detail make it difficult for Claude to reliably select this skill in the right context.

Suggestions

Add specific concrete actions the skill performs, e.g., 'Creates evaluation rubrics, scores Claude Code session outputs, generates quality reports based on EDD principles'.

Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks to evaluate, score, or assess Claude Code session quality, or mentions EDD, evaluation rubrics, or session grading'.

Consider providing the description in English (or bilingually) to improve discoverability, and replace the abstract 'formal evaluation framework' with concrete examples of what the framework does.

DimensionReasoningScore

Specificity

The description mentions '正式评估框架' (formal evaluation framework) and '评估驱动开发(EDD)原则' (Evaluation-Driven Development principles), but these are abstract concepts without concrete actions. No specific actions like 'creates rubrics', 'scores outputs', or 'generates evaluation reports' are listed.

1 / 3

Completeness

The description only vaguely addresses 'what' (a formal evaluation framework) and completely lacks a 'when' clause or any explicit trigger guidance for when Claude should select this skill.

1 / 3

Trigger Term Quality

The description uses specialized jargon like '评估驱动开发(EDD)' which is not a commonly recognized term users would naturally say. It lacks natural trigger keywords that a user might use when seeking this skill. Additionally, the description is entirely in Chinese, which limits discoverability for non-Chinese speakers.

1 / 3

Distinctiveness Conflict Risk

The mention of 'EDD' and '克劳德代码会话' (Claude Code sessions) provides some specificity that narrows the domain, but the overall vagueness of 'evaluation framework' could overlap with other assessment or testing-related skills.

2 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
affaan-m/everything-claude-code
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.