Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
Your team has been running evals on your payments-gateway tile for three months. A senior engineer is planning the next round of tile improvements and wants a clear picture of where the tile is actually earning its weight versus where it's redundant or causing problems.
Specifically, they want to know:
The engineer doesn't want raw scores — they want an actionable breakdown that tells them exactly what to do next.
Produce a file called analysis_report.md that:
The following files are provided as inputs. Extract them before beginning.
=============== FILE: inputs/eval_results.json =============== { "tile": "payments-gateway", "eval_id": "eval_2026_03_15_payments", "scenarios": [ { "name": "checkout-flow", "criteria": [ { "name": "Stripe idempotency key", "description": "Uses idempotency key in Stripe charge requests to prevent duplicate charges", "max_score": 15, "baseline_score": 2, "with_context_score": 13 }, { "name": "Webhook signature validation", "description": "Validates Stripe webhook signatures using the signing secret before processing events", "max_score": 10, "baseline_score": 1, "with_context_score": 5 }, { "name": "HTTP status code handling", "description": "Returns appropriate HTTP status codes (200, 400, 422, 500) in API responses", "max_score": 10, "baseline_score": 9, "with_context_score": 10 }, { "name": "Currency precision", "description": "Represents all currency values as integer cents rather than floating point dollars", "max_score": 15, "baseline_score": 3, "with_context_score": 7 }, { "name": "API version pinning", "description": "Pins the Stripe API version string (e.g. '2023-10-16') in all API requests", "max_score": 10, "baseline_score": 6, "with_context_score": 4 } ] } ] }