CtrlK
BlogDocsLog inGet started
Tessl Logo

eval-harness

克劳德代码会话的正式评估框架,实施评估驱动开发(EDD)原则

Install with Tessl CLI

npx tessl i github:affaan-m/everything-claude-code --skill eval-harness
What are skills?

47

Does it follow best practices?

Validation for skill structure

SKILL.md
Review
Evals

Evaluation results

100%

59%

User Profile Update API

EDD pre-implementation eval definition

Criteria
Without context
With context

Capability eval present

0%

100%

Capability eval format: task

50%

100%

Capability eval format: criteria checkboxes

0%

100%

Capability eval format: expected output

37%

100%

Regression eval present

25%

100%

Regression eval format: baseline

0%

100%

Regression eval format: test list

37%

100%

Correct storage path

0%

100%

Plan describes define-first order

100%

100%

Four-step workflow present

70%

100%

No implementation code

100%

100%

Without context: $0.4005 · 1m 40s · 17 turns · 550 in / 6,104 out tokens

With context: $0.4221 · 1m 33s · 21 turns · 539 in / 5,399 out tokens

88%

26%

Payment Module Evaluation Strategy

Scorer type selection and formatting

Criteria
Without context
With context

Code-based scorer present

40%

100%

Code scorer format

12%

100%

Model-based scorer present

40%

100%

Model scorer format: numbered questions

37%

0%

Model scorer format: rating scale

100%

50%

Human review scorer present

40%

100%

Human review format: fields

12%

100%

Human review for security

100%

100%

Code scorer preferred for deterministic checks

90%

100%

Rationale distinguishes scorer types

100%

100%

No security check fully automated

100%

100%

Without context: $0.3470 · 2m · 18 turns · 25 in / 5,824 out tokens

With context: $0.4538 · 2m 1s · 22 turns · 26 in / 6,264 out tokens

100%

33%

Order Processing Reliability Report

pass@k metrics and eval report format

Criteria
Without context
With context

Capability section present

100%

100%

Regression section present

100%

100%

pass@k used for capability

0%

100%

pass^k used for critical path

75%

100%

pass@k computed correctly

0%

100%

pass^k computed correctly

100%

100%

Metrics section present

50%

100%

Status/conclusion present

100%

100%

Critical path justification

90%

100%

Pass@1 computed for capability

75%

100%

Report completeness

87%

100%

Without context: $0.1612 · 1m 5s · 8 turns · 13 in / 3,494 out tokens

With context: $0.3082 · 1m 24s · 14 turns · 17 in / 4,371 out tokens

Evaluated
Agent
Claude Code

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.