CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

76

2.08x
Quality

42%

Does it follow best practices?

Impact

100%

2.08x

Average score across 6 eval scenarios

SecuritybySnyk

Advisory

Suggest reviewing before use

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/zh-TW/skills/eval-harness/SKILL.md
SKILL.md
Quality
Evals
Security

Evaluation results

100%

79%

Real-Time Notifications Module: Eval Planning

EDD pre-implementation eval definitions

Criteria
Without context
With context

CAPABILITY EVAL header

0%

100%

Success criteria checkboxes

0%

100%

Expected output field

0%

100%

REGRESSION EVAL header

0%

100%

Regression baseline reference

0%

100%

Regression test list with status

0%

100%

.claude/evals/ directory

0%

100%

Feature-named eval file

50%

100%

No implementation code

100%

100%

Covers new capabilities

50%

100%

Covers regression risk

50%

100%

100%

32%

AI Code Assistant Quality Evaluation Pipeline

Grader type selection and scoring formats

Criteria
Without context
With context

Code-based scorer present

100%

100%

Code scorer applied correctly

100%

100%

MODEL GRADER PROMPT header

0%

100%

Model grader has numbered questions

0%

100%

Model grader score scale

100%

100%

HUMAN REVIEW REQUIRED header

0%

100%

Human review risk level

100%

100%

Human review applied to security

100%

100%

Deterministic preference stated

100%

100%

Model grader for qualitative checks

100%

100%

100%

43%

Search Autocomplete: Eval Report Generation

EDD workflow report with pass@k metrics

Criteria
Without context
With context

Capability evals section

100%

100%

Regression evals section

100%

100%

Individual test results listed

100%

100%

Capability totals

100%

100%

pass@1 metric computed

0%

100%

pass@3 metric computed

0%

100%

pass^k for regression

40%

100%

Status line present

100%

100%

Report in .claude/evals/

0%

100%

Feature-named report file

100%

100%

Regression totals

100%

100%

Repository
haniakrim21/everything-claude-code
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.