CtrlK
BlogDocsLog inGet started
Tessl Logo

experiments/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

Does it follow best practices?

Evaluation100%

1.02x

Agent success when using this tile

Validation for skill structure

Overview
Skills
Evals
Files

Evaluation results

100%

8%

Payments Tile Eval Analysis

Eval Bucket Classification

Criteria
Without context
With context

Bucket A: idempotency key

100%

100%

Bucket B: webhook signature

100%

100%

Bucket C: HTTP status codes

100%

100%

Bucket B: currency precision

100%

100%

Bucket D: API version pinning

100%

100%

Bucket D highest priority

100%

100%

Bucket B diagnosis present

100%

100%

Bucket C action suggested

50%

100%

Bucket A no-action

88%

100%

80% threshold applied

80%

100%

Without context: $0.2341 · 1m 11s · 9 turns · 58 in / 3,805 out tokens

With context: $0.6562 · 2m 52s · 22 turns · 69 in / 8,147 out tokens

100%

Webhook Processor Tile: Retry Reliability Fix

Targeted Tile Editing

Criteria
Without context
With context

Explicit retry intervals

100%

100%

Rubric language used

100%

100%

HMAC section unchanged

100%

100%

TLS section unchanged

100%

100%

Observability section unchanged

100%

100%

Processing section unchanged

100%

100%

Retry section only changed

100%

100%

Concise addition

100%

100%

Max retry count preserved

100%

100%

Fast acknowledgement preserved

100%

100%

Without context: $0.2012 · 46s · 9 turns · 10 in / 2,794 out tokens

With context: $0.5078 · 1m 51s · 20 turns · 267 in / 6,192 out tokens

100%

Data Pipeline Tile: Consistency Audit

Cross-file Contradiction Detection

Criteria
Without context
With context

Retry count contradiction found

100%

100%

Auth failure contradiction found

100%

100%

All three files referenced

100%

100%

File attribution per contradiction

100%

100%

Auth contradiction despite scope

100%

100%

Verbatim quotes included

100%

100%

Without context: $0.2704 · 1m 21s · 10 turns · 11 in / 4,402 out tokens

With context: $0.4628 · 2m 5s · 14 turns · 12 in / 6,910 out tokens

100%

Code Review Tile: Regression Investigation

Regression Root Cause Analysis

Criteria
Without context
With context

Contradicting clause identified

100%

100%

Contradiction mechanism explained

100%

100%

Remove/clarify approach taken

100%

100%

Specific text targeted

100%

100%

No compensating additions

100%

100%

Other sections preserved

100%

100%

Pre-review list intact

100%

100%

Without context: $0.2102 · 55s · 11 turns · 60 in / 2,721 out tokens

With context: $0.4734 · 1m 50s · 23 turns · 268 in / 5,486 out tokens

100%

API Integration Tile: Eval Rubric Review

Redundant Criteria Management

Criteria
Without context
With context

All redundant criteria identified

100%

100%

Options presented per criterion

100%

100%

Useful criteria preserved

100%

100%

Weight redistribution correct

100%

100%

80% threshold applied

100%

100%

Non-redundant scores unchanged

100%

100%

Below-threshold excluded

100%

100%

Removal option named explicitly

100%

100%

Without context: $0.3455 · 1m 42s · 12 turns · 13 in / 5,637 out tokens

With context: $0.4472 · 1m 47s · 21 turns · 17 in / 5,930 out tokens

Install with Tessl CLI

npx tessl i experiments/eval-improve@0.4.0
Evaluated
Agent
Claude Code

Table of Contents