CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl-labs/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

94

1.30x

Quality

89%

Does it follow best practices?

Impact

98%

1.30x

Average score across 7 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Evaluation results

80%

40%

Scenario 1

Criteria
Without context
With context

reads_tile_files_before_fixing

0%

100%

proposes_before_applying

100%

100%

targeted_fix_not_rewrite

0%

100%

commits_before_rerun

66%

66%

workspace_in_eval_run

0%

0%

Without context: $0.3257 · 2m 38s · 40 turns · 412 in / 15,014 out tokens

With context: $0.1912 · 1m 36s · 33 turns · 312 in / 8,577 out tokens

83%

83%

Scenario 2

Criteria
Without context
With context

runs_eval_view

0%

100%

runs_eval_compare_with_workspace

0%

50%

classifies_into_four_buckets

0%

75%

prioritizes_bucket_d

0%

100%

asks_before_fixing

0%

100%

Without context: $0.2341 · 2m 16s · 29 turns · 283 in / 15,234 out tokens

With context: $0.1459 · 1m 57s · 22 turns · 434 in / 9,368 out tokens

100%

Data Pipeline Tile: Consistency Audit

Criteria
Without context
With context

Retry count contradiction found

100%

100%

Auth failure contradiction found

100%

100%

All three files referenced

100%

100%

File attribution per contradiction

100%

100%

Auth contradiction despite scope

100%

100%

Verbatim quotes included

100%

100%

Without context: $0.0621 · 49s · 10 turns · 111 in / 5,671 out tokens

With context: $0.0773 · 1m 1s · 11 turns · 128 in / 7,043 out tokens

85%

-11%

Payments Tile Eval Analysis

Criteria
Without context
With context

Bucket A: idempotency key

100%

100%

Bucket B: webhook signature

100%

87%

Bucket C: HTTP status codes

100%

100%

Bucket B: currency precision

100%

87%

Bucket D: API version pinning

100%

100%

Bucket D highest priority

100%

100%

Bucket B diagnosis present

100%

46%

Bucket C action suggested

70%

60%

Bucket A no-action

100%

100%

80% threshold applied

90%

90%

Without context: $0.0528 · 40s · 9 turns · 104 in / 4,286 out tokens

With context: $0.0497 · 35s · 8 turns · 82 in / 3,562 out tokens

100%

20%

API Integration Tile: Eval Rubric Review

Criteria
Without context
With context

All redundant criteria identified

100%

100%

Options presented per criterion

100%

100%

Useful criteria preserved

100%

100%

Weight redistribution correct

0%

100%

80% threshold applied

100%

100%

Non-redundant scores unchanged

100%

100%

Below-threshold excluded

100%

100%

Removal option named explicitly

100%

100%

Without context: $0.0705 · 1m 3s · 11 turns · 122 in / 6,441 out tokens

With context: $0.1113 · 1m 34s · 19 turns · 203 in / 8,521 out tokens

100%

Code Review Tile: Regression Investigation

Criteria
Without context
With context

Contradicting clause identified

100%

100%

Contradiction mechanism explained

100%

100%

Remove/clarify approach taken

100%

100%

Specific text targeted

100%

100%

No compensating additions

100%

100%

Other sections preserved

100%

100%

Pre-review list intact

100%

100%

Without context: $0.0513 · 36s · 11 turns · 129 in / 3,622 out tokens

With context: $0.0492 · 39s · 11 turns · 155 in / 3,666 out tokens

89%

-1%

Webhook Processor Tile: Retry Reliability Fix

Criteria
Without context
With context

Explicit retry intervals

100%

100%

Rubric language used

100%

100%

HMAC section unchanged

100%

100%

TLS section unchanged

100%

100%

Observability section unchanged

100%

100%

Processing section unchanged

100%

28%

Retry section only changed

100%

50%

Concise addition

0%

100%

Max retry count preserved

100%

100%

Fast acknowledgement preserved

100%

100%

Without context: $0.0347 · 28s · 7 turns · 81 in / 2,724 out tokens

With context: $0.0711 · 49s · 15 turns · 159 in / 4,431 out tokens

Evaluated
Agent
Claude
Model
Claude Haiku 4.5

Table of Contents