experiments/eval-improve

Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated

Review — 71%

Does it follow best practices?

Evaluation — 100%

↑ 1.02x

Agent success when using this tile

Validation — 11 / 11 Passed

Validation for skill structure

Payments Tile Eval Analysis

Name: experiments/eval-improve
Author: experiments

Problem Description

Your team has been running evals on your payments-gateway tile for three months. A senior engineer is planning the next round of tile improvements and wants a clear picture of where the tile is actually earning its weight versus where it's redundant or causing problems.

Specifically, they want to know:

Which criteria is the tile genuinely helping with (agents perform much better with it than without)?
Which criteria are agents already handling on their own, even without the tile?
Which criteria should be top priorities because agents are doing worse with the tile than without it?
For anything problematic: what's a plausible explanation and where in the tile should someone look?

The engineer doesn't want raw scores — they want an actionable breakdown that tells them exactly what to do next.

Output Specification

Produce a file called analysis_report.md that:

Categorizes each criterion into a status group with a short label
For each criterion, shows its scores (baseline and with-tile) and the percentage of max it achieves
For criteria that need action, includes a short diagnosis (1-2 sentences on what's likely missing or wrong) and which file in the tile to look at
Ends with a prioritized list of recommended next steps

Input Files

The following files are provided as inputs. Extract them before beginning.

=============== FILE: inputs/eval_results.json =============== { "tile": "payments-gateway", "eval_id": "eval_2026_03_15_payments", "scenarios": [ { "name": "checkout-flow", "criteria": [ { "name": "Stripe idempotency key", "description": "Uses idempotency key in Stripe charge requests to prevent duplicate charges", "max_score": 15, "baseline_score": 2, "with_context_score": 13 }, { "name": "Webhook signature validation", "description": "Validates Stripe webhook signatures using the signing secret before processing events", "max_score": 10, "baseline_score": 1, "with_context_score": 5 }, { "name": "HTTP status code handling", "description": "Returns appropriate HTTP status codes (200, 400, 422, 500) in API responses", "max_score": 10, "baseline_score": 9, "with_context_score": 10 }, { "name": "Currency precision", "description": "Represents all currency values as integer cents rather than floating point dollars", "max_score": 15, "baseline_score": 3, "with_context_score": 7 }, { "name": "API version pinning", "description": "Pins the Stripe API version string (e.g. '2023-10-16') in all API requests", "max_score": 10, "baseline_score": 6, "with_context_score": 4 } ] } ] }

Install with Tessl CLI

npx tessl i experiments/eval-improve@0.4.0

evals

scenario-1

rubric.json

task.md

scenario-2

scenario-3

scenario-4

scenario-5

skills

README.md

tile.json

experiments/eval-improve

task.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-1/

Payments Tile Eval Analysis

Problem Description

Output Specification

Input Files

task.mdevals/scenario-1/