Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
Eval Bucket Classification
Bucket A: idempotency key
100%
100%
Bucket B: webhook signature
100%
100%
Bucket C: HTTP status codes
100%
100%
Bucket B: currency precision
100%
100%
Bucket D: API version pinning
100%
100%
Bucket D highest priority
100%
100%
Bucket B diagnosis present
100%
100%
Bucket C action suggested
50%
100%
Bucket A no-action
88%
100%
80% threshold applied
80%
100%
Without context: $0.2341 · 1m 11s · 9 turns · 58 in / 3,805 out tokens
With context: $0.6562 · 2m 52s · 22 turns · 69 in / 8,147 out tokens
Targeted Tile Editing
Explicit retry intervals
100%
100%
Rubric language used
100%
100%
HMAC section unchanged
100%
100%
TLS section unchanged
100%
100%
Observability section unchanged
100%
100%
Processing section unchanged
100%
100%
Retry section only changed
100%
100%
Concise addition
100%
100%
Max retry count preserved
100%
100%
Fast acknowledgement preserved
100%
100%
Without context: $0.2012 · 46s · 9 turns · 10 in / 2,794 out tokens
With context: $0.5078 · 1m 51s · 20 turns · 267 in / 6,192 out tokens
Cross-file Contradiction Detection
Retry count contradiction found
100%
100%
Auth failure contradiction found
100%
100%
All three files referenced
100%
100%
File attribution per contradiction
100%
100%
Auth contradiction despite scope
100%
100%
Verbatim quotes included
100%
100%
Without context: $0.2704 · 1m 21s · 10 turns · 11 in / 4,402 out tokens
With context: $0.4628 · 2m 5s · 14 turns · 12 in / 6,910 out tokens
Regression Root Cause Analysis
Contradicting clause identified
100%
100%
Contradiction mechanism explained
100%
100%
Remove/clarify approach taken
100%
100%
Specific text targeted
100%
100%
No compensating additions
100%
100%
Other sections preserved
100%
100%
Pre-review list intact
100%
100%
Without context: $0.2102 · 55s · 11 turns · 60 in / 2,721 out tokens
With context: $0.4734 · 1m 50s · 23 turns · 268 in / 5,486 out tokens
Redundant Criteria Management
All redundant criteria identified
100%
100%
Options presented per criterion
100%
100%
Useful criteria preserved
100%
100%
Weight redistribution correct
100%
100%
80% threshold applied
100%
100%
Non-redundant scores unchanged
100%
100%
Below-threshold excluded
100%
100%
Removal option named explicitly
100%
100%
Without context: $0.3455 · 1m 42s · 12 turns · 13 in / 5,637 out tokens
With context: $0.4472 · 1m 47s · 21 turns · 17 in / 5,930 out tokens
Install with Tessl CLI
npx tessl i experiments/eval-improve