CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

69

2.08x
Quality

24%

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

22%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is too abstract and jargon-heavy, reading more like a title than a functional description. It fails to specify concrete actions the skill performs and lacks any 'Use when...' guidance, making it difficult for Claude to know when to select it. The EDD acronym provides some niche identity but is not sufficient to compensate for the missing detail.

Suggestions

Add concrete actions the skill performs, e.g., 'Creates evaluation rubrics, runs test cases against Claude Code outputs, scores session quality, and tracks improvement metrics.'

Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user wants to evaluate Claude Code session quality, set up evals, measure coding performance, or implement eval-driven development workflows.'

Replace or supplement the jargon 'EDD principles' with plain-language equivalents that users would naturally say, such as 'testing', 'benchmarking', 'quality assessment', or 'evaluation scoring'.

DimensionReasoningScore

Specificity

The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions. It doesn't specify what the skill actually does (e.g., create test cases, run evaluations, score outputs, generate reports).

1 / 3

Completeness

The description only vaguely addresses 'what' (a framework for evaluation) and completely lacks a 'when' clause. There are no explicit triggers or guidance on when Claude should select this skill.

1 / 3

Trigger Term Quality

It includes some relevant terms like 'eval-driven development', 'EDD', and 'Claude Code sessions', but these are fairly niche/jargon-heavy. Common user phrases like 'evaluate', 'test', 'benchmark', 'score', or 'assess quality' are missing.

2 / 3

Distinctiveness Conflict Risk

The mention of 'eval-driven development (EDD)' and 'Claude Code sessions' provides some specificity, but 'formal evaluation framework' is broad enough to potentially overlap with testing, QA, or benchmarking skills.

2 / 3

Total

6

/

12

Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is a conceptual framework document rather than an actionable skill. It over-explains concepts Claude already understands (testing terminology, pass@k metrics, what regression tests are) while under-delivering on concrete, executable guidance. The integration commands appear to be aspirational rather than functional, and the entire document could be condensed to ~40 lines of templates and workflow steps.

Suggestions

Cut the philosophy, metrics definitions, and grader type explanations — Claude knows these concepts. Focus only on the specific eval format templates and workflow steps unique to this project.

Make the integration commands (/eval define, /eval check) actionable by either providing actual script implementations or replacing them with real shell commands that Claude can execute.

Add explicit error recovery steps to the workflow: what should Claude do when capability evals fail? When regression evals fail? Include a feedback loop (fix → re-run → verify).

Split detailed content (grader prompt templates, full authentication example, best practices) into separate referenced files to reduce the main skill's token footprint.

DimensionReasoningScore

Conciseness

Extremely verbose for what it conveys. Explains basic concepts Claude already knows (what pass@k means, what evals are, what regression testing is). The philosophy section, grader type explanations, and best practices are largely common knowledge for Claude. The document is ~180 lines but could convey its unique value in under 50 lines.

1 / 3

Actionability

Provides templates and markdown formats that are somewhat concrete, but the integration commands (/eval define, /eval check, /eval report) appear to be fictional slash commands with no actual implementation. The bash examples for code-based graders are real commands, but the overall framework is more conceptual than executable.

2 / 3

Workflow Clarity

The 4-phase workflow (Define → Implement → Evaluate → Report) is clearly sequenced, and the authentication example walks through all phases. However, there are no validation checkpoints or error recovery steps — what happens when evals fail? The 'fix and retry' feedback loop is absent, and the evaluate step just says 'run evals, record PASS/FAIL' without guidance on handling failures.

2 / 3

Progressive Disclosure

Monolithic wall of text with no references to external files despite being a complex, multi-faceted topic. The eval storage section mentions .claude/evals/ files but doesn't reference any actual bundle files. Content like grader types, metrics definitions, and the full authentication example could easily be split into separate reference documents.

1 / 3

Total

6

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
ysyecust/everything-claude-code
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.