CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

66

2.08x
Quality

17%

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is a conceptual framework document rather than an actionable skill. It over-explains eval-driven development concepts that Claude already understands while under-delivering on concrete, executable guidance. The content would benefit significantly from being condensed to its unique value (specific templates and file conventions) and splitting detailed sections into referenced files.

Suggestions

Cut the philosophy, metrics definitions, and best practices sections (Claude knows these concepts) and focus on the specific eval templates, file conventions, and slash command behaviors that are unique to this project's EDD approach.

Make the `/eval define`, `/eval check`, `/eval report` commands actionable by either providing actual implementation scripts or clearly stating these are conventions for Claude to interpret, with exact expected behaviors.

Add validation/error recovery steps to the workflow: what happens when evals fail, how to diagnose failures, when to re-run vs. fix code.

Split grader types, the authentication example, and storage conventions into separate referenced files to reduce the monolithic structure and improve progressive disclosure.

DimensionReasoningScore

Conciseness

Extremely verbose for what it conveys. Explains basic concepts Claude already knows (what pass@k means, what evals are, what regression testing is). The philosophy section, grader type explanations, and best practices are largely common knowledge. The document is ~150+ lines but could convey its unique value in under 50 lines.

1 / 3

Actionability

Provides some concrete templates (eval definition format, report format, bash grader examples) but much is pseudocode or placeholder markdown rather than truly executable. The `/eval define`, `/eval check`, `/eval report` commands appear to be aspirational slash commands with no actual implementation. The bash grader examples are the most actionable part.

2 / 3

Workflow Clarity

The 4-phase workflow (Define → Implement → Evaluate → Report) is clearly sequenced, but lacks validation checkpoints and error recovery. There's no guidance on what to do when evals fail, no feedback loops for fixing issues, and the 'Implementation' phase is just '[write code]' with no concrete steps.

2 / 3

Progressive Disclosure

Monolithic wall of text with no references to external files despite the content being long enough to warrant splitting. The eval types, grader types, metrics, workflow, integration patterns, storage, best practices, and example could each be separate referenced documents. No bundle files exist to support this either.

1 / 3

Total

6

/

12

Passed

Description

7%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is too abstract and jargon-heavy to be effective for skill selection. It fails to specify concrete actions the skill performs, lacks natural trigger terms users would use, and provides no guidance on when Claude should select this skill. The only slight strength is the niche reference to 'eval-driven development' which provides minimal distinctiveness.

Suggestions

List specific concrete actions the skill performs, e.g., 'Creates evaluation test cases, defines scoring rubrics, runs eval suites against Claude Code outputs, and tracks performance metrics.'

Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks to evaluate code quality, set up evals, benchmark Claude Code outputs, or implement eval-driven development workflows.'

Replace jargon like 'formal evaluation framework' and 'EDD principles' with plain-language descriptions of what the skill does, while keeping the acronym as a secondary trigger term.

DimensionReasoningScore

Specificity

The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions. It doesn't specify what the skill actually does (e.g., create test cases, run benchmarks, score outputs, generate reports).

1 / 3

Completeness

The description weakly addresses 'what' (a framework for evaluation) but provides no 'when' guidance whatsoever. There is no 'Use when...' clause or equivalent explicit trigger guidance.

1 / 3

Trigger Term Quality

The terms used are highly technical jargon ('eval-driven development', 'EDD principles', 'formal evaluation framework') that users are unlikely to naturally say. Common trigger terms like 'evaluate', 'test', 'benchmark', 'score', or 'assess' are missing.

1 / 3

Distinctiveness Conflict Risk

The mention of 'eval-driven development (EDD)' and 'Claude Code sessions' provides some specificity to a niche, but the vague 'evaluation framework' phrasing could overlap with testing, QA, or code review skills.

2 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
ysyecust/everything-claude-code
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.