Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
Install with Tessl CLI
npx tessl i github:ysyecust/everything-claude-code --skill eval-harness76
Quality
42%
Does it follow best practices?
Impact
100%
2.08xAverage score across 6 eval scenarios
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.mdEDD pre-implementation eval definitions
CAPABILITY EVAL header
0%
100%
Success criteria checkboxes
0%
100%
Expected output field
0%
100%
REGRESSION EVAL header
0%
100%
Regression baseline reference
0%
100%
Regression test list with status
0%
100%
.claude/evals/ directory
0%
100%
Feature-named eval file
50%
100%
No implementation code
100%
100%
Covers new capabilities
50%
100%
Covers regression risk
50%
100%
Without context: $0.6419 · 2m 33s · 26 turns · 4,409 in / 8,982 out tokens
With context: $0.2723 · 1m 18s · 12 turns · 266 in / 4,389 out tokens
Grader type selection and scoring formats
Code-based scorer present
100%
100%
Code scorer applied correctly
100%
100%
MODEL GRADER PROMPT header
0%
100%
Model grader has numbered questions
0%
100%
Model grader score scale
100%
100%
HUMAN REVIEW REQUIRED header
0%
100%
Human review risk level
100%
100%
Human review applied to security
100%
100%
Deterministic preference stated
100%
100%
Model grader for qualitative checks
100%
100%
Without context: $0.3425 · 1m 43s · 16 turns · 23 in / 5,491 out tokens
With context: $0.3574 · 1m 50s · 15 turns · 18 in / 5,437 out tokens
EDD workflow report with pass@k metrics
Capability evals section
100%
100%
Regression evals section
100%
100%
Individual test results listed
100%
100%
Capability totals
100%
100%
pass@1 metric computed
0%
100%
pass@3 metric computed
0%
100%
pass^k for regression
40%
100%
Status line present
100%
100%
Report in .claude/evals/
0%
100%
Feature-named report file
100%
100%
Regression totals
100%
100%
Without context: $0.2994 · 1m 9s · 17 turns · 129 in / 4,172 out tokens
With context: $0.3450 · 1m 18s · 16 turns · 19 in / 4,486 out tokens
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.