CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

76

2.08x
Quality

42%

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

22%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is too abstract and lacks actionable detail. It fails to explain what concrete actions the skill enables (e.g., creating test cases, running evaluations, scoring outputs) and provides no guidance on when Claude should select this skill. The technical jargon 'EDD principles' assumes familiarity without explanation.

Suggestions

Add specific concrete actions the skill performs, e.g., 'Creates evaluation test cases, runs automated scoring, tracks performance metrics across iterations'

Include an explicit 'Use when...' clause with natural trigger terms like 'evaluate my code', 'test quality', 'measure performance', 'run evals', 'benchmark'

Define what 'eval-driven development' means in practical terms so Claude knows the specific scenarios this skill addresses

DimensionReasoningScore

Specificity

The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions Claude would perform. No specific capabilities are mentioned.

1 / 3

Completeness

Only vaguely addresses 'what' (evaluation framework) with no explicit 'when' clause or trigger guidance. Missing a 'Use when...' statement entirely.

1 / 3

Trigger Term Quality

Contains some relevant terms like 'eval-driven development', 'EDD', and 'evaluation' that users might say, but lacks common variations and natural phrases users would actually use when needing this skill.

2 / 3

Distinctiveness Conflict Risk

The mention of 'eval-driven development' and 'EDD' provides some specificity, but 'evaluation framework' and 'Claude Code sessions' are generic enough to potentially overlap with other testing or development skills.

2 / 3

Total

6

/

12

Passed

Implementation

62%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill provides a comprehensive framework for eval-driven development with good workflow structure and clear phases. However, it suffers from being overly explanatory about concepts Claude already understands (pass@k metrics, grader types) and includes hypothetical slash commands that aren't actually implemented. The content would benefit from being more concise and providing truly executable examples rather than templates.

Suggestions

Remove explanatory content about basic concepts like pass@k definitions and grader type explanations - Claude already knows these

Replace hypothetical '/eval' slash commands with actual executable bash scripts or provide implementation details for these commands

Split detailed examples and templates into separate reference files (e.g., TEMPLATES.md, EXAMPLES.md) and link from the main skill

Convert markdown template examples into actual executable test scripts that can be run

DimensionReasoningScore

Conciseness

The skill is reasonably efficient but includes some explanatory content Claude already knows (e.g., explaining what pass@k means, basic grader concepts). The philosophy section and some descriptions could be tightened.

2 / 3

Actionability

Provides templates and examples but many are pseudocode or markdown templates rather than executable code. The bash examples are concrete, but the workflow commands like '/eval define feature-name' appear to be hypothetical slash commands without implementation details.

2 / 3

Workflow Clarity

Clear 4-phase workflow (Define → Implement → Evaluate → Report) with explicit sequencing. The example at the end demonstrates the complete flow with checkpoints and status tracking.

3 / 3

Progressive Disclosure

Content is organized with clear sections and headers, but everything is in one monolithic file. References to '.claude/evals/' storage structure are mentioned but no actual linked files exist. The content could benefit from splitting detailed examples into separate reference files.

2 / 3

Total

9

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
haniakrim21/everything-claude-code
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.