Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
94%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
Your team has been running evals on your payments-gateway tile for three months. A senior engineer is planning the next round of tile improvements and wants a clear picture of where the tile is actually earning its weight versus where it's redundant or causing problems.
Specifically, they want to know:
The engineer doesn't want raw scores — they want an actionable breakdown that tells them exactly what to do next.
Produce a file called analysis_report.md that:
The following files are provided as inputs. Extract them before beginning.
=============== FILE: inputs/eval_results.json =============== { "tile": "payments-gateway", "eval_id": "eval_2026_03_15_payments", "scenarios": [ { "name": "checkout-flow", "criteria": [ { "name": "Stripe idempotency key", "description": "Uses idempotency key in Stripe charge requests to prevent duplicate charges", "max_score": 15, "baseline_score": 2, "with_context_score": 13 }, { "name": "Webhook signature validation", "description": "Validates Stripe webhook signatures using the signing secret before processing events", "max_score": 10, "baseline_score": 1, "with_context_score": 5 }, { "name": "HTTP status code handling", "description": "Returns appropriate HTTP status codes (200, 400, 422, 500) in API responses", "max_score": 10, "baseline_score": 9, "with_context_score": 10 }, { "name": "Currency precision", "description": "Represents all currency values as integer cents rather than floating point dollars", "max_score": 15, "baseline_score": 3, "with_context_score": 7 }, { "name": "API version pinning", "description": "Pins the Stripe API version string (e.g. '2023-10-16') in all API requests", "max_score": 10, "baseline_score": 6, "with_context_score": 4 } ] } ] }
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions