CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

eval-harness

tessl i github:ysyecust/everything-claude-code --skill eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

54%

Overall

SKILL.md
Review
Evals

Validation

69%
CriteriaDescriptionResult

description_trigger_hint

Description may be missing an explicit 'when to use' trigger hint (e.g., 'Use when...')

Warning

metadata_version

'metadata' field is not a dictionary

Warning

license_field

'license' field is missing

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

body_output_format

No obvious output/return/format terms detected; consider specifying expected outputs

Warning

Total

11

/

16

Passed

Implementation

63%

This skill provides a comprehensive framework for eval-driven development with clear workflow phases and good structural organization. However, it suffers from verbosity in explaining concepts Claude already knows, and the actionability is limited by hypothetical slash commands and template-style examples rather than executable implementations.

Suggestions

Remove explanatory content about what evals are and why they matter - focus on the 'how' rather than the 'why'

Replace hypothetical '/eval' commands with actual executable scripts or clarify these are conventions to implement

Extract detailed grader implementations and metric calculations to separate reference files, keeping SKILL.md as a quick-start overview

Add concrete, copy-paste ready code for at least one complete eval implementation rather than markdown templates

DimensionReasoningScore

Conciseness

The skill contains useful information but includes some unnecessary explanations (e.g., explaining what pass@k means when Claude would know this). The philosophy section and some descriptions could be tightened.

2 / 3

Actionability

Provides templates and examples but many are pseudocode or markdown templates rather than executable code. The bash examples are concrete, but the workflow commands like '/eval define feature-name' appear to be hypothetical rather than actual implemented commands.

2 / 3

Workflow Clarity

Clear 4-phase workflow (Define → Implement → Evaluate → Report) with explicit sequencing. The example at the end demonstrates the complete flow with checkpoints at each stage.

3 / 3

Progressive Disclosure

Content is reasonably organized with clear sections, but everything is in one monolithic file. References to '.claude/evals/' storage structure are mentioned but no actual linked files for detailed content like grader implementations or baseline schemas.

2 / 3

Total

9

/

12

Passed

Activation

22%

This description is too abstract and lacks actionable detail. It fails to specify what concrete actions the skill performs and provides no guidance on when Claude should select it. The technical jargon ('EDD principles', 'formal evaluation framework') may not match natural user language.

Suggestions

Add a 'Use when...' clause with explicit triggers like 'Use when the user wants to evaluate code quality, run benchmarks, or implement test-driven workflows'

Replace abstract language with concrete actions such as 'Runs evaluation suites, scores code outputs against criteria, tracks performance metrics across sessions'

Include natural trigger terms users would say: 'evaluate', 'test', 'benchmark', 'score', 'assess', 'measure quality'

DimensionReasoningScore

Specificity

The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions Claude would perform. No specific capabilities like 'run tests', 'score outputs', or 'compare results' are mentioned.

1 / 3

Completeness

The description only vaguely addresses 'what' (an evaluation framework) and completely lacks any 'when' guidance. There is no 'Use when...' clause or explicit trigger guidance.

1 / 3

Trigger Term Quality

Contains some relevant terms ('evaluation', 'eval-driven development', 'EDD', 'Claude Code sessions') but uses technical jargon that users may not naturally say. Missing common variations like 'test', 'benchmark', 'assess', 'measure performance'.

2 / 3

Distinctiveness Conflict Risk

The mention of 'eval-driven development' and 'EDD' provides some specificity, but 'evaluation framework' and 'Claude Code sessions' are broad enough to potentially overlap with testing, QA, or other assessment-related skills.

2 / 3

Total

6

/

12

Passed

Reviewed

Table of Contents

ValidationImplementationActivation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.