tessl i github:ysyecust/everything-claude-code --skill eval-harnessFormal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
Validation
69%| Criteria | Description | Result |
|---|---|---|
description_trigger_hint | Description may be missing an explicit 'when to use' trigger hint (e.g., 'Use when...') | Warning |
metadata_version | 'metadata' field is not a dictionary | Warning |
license_field | 'license' field is missing | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
body_output_format | No obvious output/return/format terms detected; consider specifying expected outputs | Warning |
Total | 11 / 16 Passed | |
Implementation
63%This skill provides a comprehensive framework for eval-driven development with clear workflow phases and good structural organization. However, it suffers from verbosity in explaining concepts Claude already knows, and the actionability is limited by hypothetical slash commands and template-style examples rather than executable implementations.
Suggestions
Remove explanatory content about what evals are and why they matter - focus on the 'how' rather than the 'why'
Replace hypothetical '/eval' commands with actual executable scripts or clarify these are conventions to implement
Extract detailed grader implementations and metric calculations to separate reference files, keeping SKILL.md as a quick-start overview
Add concrete, copy-paste ready code for at least one complete eval implementation rather than markdown templates
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill contains useful information but includes some unnecessary explanations (e.g., explaining what pass@k means when Claude would know this). The philosophy section and some descriptions could be tightened. | 2 / 3 |
Actionability | Provides templates and examples but many are pseudocode or markdown templates rather than executable code. The bash examples are concrete, but the workflow commands like '/eval define feature-name' appear to be hypothetical rather than actual implemented commands. | 2 / 3 |
Workflow Clarity | Clear 4-phase workflow (Define → Implement → Evaluate → Report) with explicit sequencing. The example at the end demonstrates the complete flow with checkpoints at each stage. | 3 / 3 |
Progressive Disclosure | Content is reasonably organized with clear sections, but everything is in one monolithic file. References to '.claude/evals/' storage structure are mentioned but no actual linked files for detailed content like grader implementations or baseline schemas. | 2 / 3 |
Total | 9 / 12 Passed |
Activation
22%This description is too abstract and lacks actionable detail. It fails to specify what concrete actions the skill performs and provides no guidance on when Claude should select it. The technical jargon ('EDD principles', 'formal evaluation framework') may not match natural user language.
Suggestions
Add a 'Use when...' clause with explicit triggers like 'Use when the user wants to evaluate code quality, run benchmarks, or implement test-driven workflows'
Replace abstract language with concrete actions such as 'Runs evaluation suites, scores code outputs against criteria, tracks performance metrics across sessions'
Include natural trigger terms users would say: 'evaluate', 'test', 'benchmark', 'score', 'assess', 'measure quality'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses abstract language like 'formal evaluation framework' and 'EDD principles' without listing any concrete actions Claude would perform. No specific capabilities like 'run tests', 'score outputs', or 'compare results' are mentioned. | 1 / 3 |
Completeness | The description only vaguely addresses 'what' (an evaluation framework) and completely lacks any 'when' guidance. There is no 'Use when...' clause or explicit trigger guidance. | 1 / 3 |
Trigger Term Quality | Contains some relevant terms ('evaluation', 'eval-driven development', 'EDD', 'Claude Code sessions') but uses technical jargon that users may not naturally say. Missing common variations like 'test', 'benchmark', 'assess', 'measure performance'. | 2 / 3 |
Distinctiveness Conflict Risk | The mention of 'eval-driven development' and 'EDD' provides some specificity, but 'evaluation framework' and 'Claude Code sessions' are broad enough to potentially overlap with testing, QA, or other assessment-related skills. | 2 / 3 |
Total | 6 / 12 Passed |
Reviewed
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.