CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

Install with Tessl CLI

npx tessl i github:ysyecust/everything-claude-code --skill eval-harness
What are skills?

76

2.08x

Quality

42%

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.md
SKILL.md
Review
Evals

Evaluation results

100%

79%

Real-Time Notifications Module: Eval Planning

EDD pre-implementation eval definitions

Criteria
Without context
With context

CAPABILITY EVAL header

0%

100%

Success criteria checkboxes

0%

100%

Expected output field

0%

100%

REGRESSION EVAL header

0%

100%

Regression baseline reference

0%

100%

Regression test list with status

0%

100%

.claude/evals/ directory

0%

100%

Feature-named eval file

50%

100%

No implementation code

100%

100%

Covers new capabilities

50%

100%

Covers regression risk

50%

100%

Without context: $0.6419 · 2m 33s · 26 turns · 4,409 in / 8,982 out tokens

With context: $0.2723 · 1m 18s · 12 turns · 266 in / 4,389 out tokens

100%

32%

AI Code Assistant Quality Evaluation Pipeline

Grader type selection and scoring formats

Criteria
Without context
With context

Code-based scorer present

100%

100%

Code scorer applied correctly

100%

100%

MODEL GRADER PROMPT header

0%

100%

Model grader has numbered questions

0%

100%

Model grader score scale

100%

100%

HUMAN REVIEW REQUIRED header

0%

100%

Human review risk level

100%

100%

Human review applied to security

100%

100%

Deterministic preference stated

100%

100%

Model grader for qualitative checks

100%

100%

Without context: $0.3425 · 1m 43s · 16 turns · 23 in / 5,491 out tokens

With context: $0.3574 · 1m 50s · 15 turns · 18 in / 5,437 out tokens

100%

43%

Search Autocomplete: Eval Report Generation

EDD workflow report with pass@k metrics

Criteria
Without context
With context

Capability evals section

100%

100%

Regression evals section

100%

100%

Individual test results listed

100%

100%

Capability totals

100%

100%

pass@1 metric computed

0%

100%

pass@3 metric computed

0%

100%

pass^k for regression

40%

100%

Status line present

100%

100%

Report in .claude/evals/

0%

100%

Feature-named report file

100%

100%

Regression totals

100%

100%

Without context: $0.2994 · 1m 9s · 17 turns · 129 in / 4,172 out tokens

With context: $0.3450 · 1m 18s · 16 turns · 19 in / 4,486 out tokens

Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.