Optimize your skills and tiles: review SKILL.md quality, generate eval scenarios, run evals, compare across models, diagnose gaps, and re-run until scores improve.
88
94%
Does it follow best practices?
Impact
88%
1.07xAverage score across 24 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent correctly classifies eval criteria into four distinct performance buckets using the 80% threshold rules, and whether the analysis includes the right diagnosis, priority signals, and action recommendations for each bucket type.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Bucket A: idempotency key",
"description": "The 'Stripe idempotency key' criterion (13/15 = 87% with-context, significantly above 2/15 baseline) is classified as working well / no action needed",
"max_score": 8
},
{
"name": "Bucket B: webhook signature",
"description": "The 'Webhook signature validation' criterion (5/10 = 50% with-context, low baseline 1/10) is classified as a tile gap that needs a fix",
"max_score": 8
},
{
"name": "Bucket C: HTTP status codes",
"description": "The 'HTTP status code handling' criterion (9/10 baseline already high, >= 80% without tile) is classified as redundant / agents already know this",
"max_score": 8
},
{
"name": "Bucket B: currency precision",
"description": "The 'Currency precision' criterion (7/15 = 47% with-context, low baseline 3/15) is classified as a tile gap",
"max_score": 8
},
{
"name": "Bucket D: API version pinning",
"description": "The 'API version pinning' criterion (with-context 4/10 is LOWER than baseline 6/10) is classified as a regression",
"max_score": 15
},
{
"name": "Bucket D highest priority",
"description": "The regression criterion (API version pinning) is explicitly marked as highest priority or most urgent to fix",
"max_score": 10
},
{
"name": "Bucket B diagnosis present",
"description": "Both Bucket B criteria include a diagnosis (what the tile is likely missing) and a reference to which tile file should be updated",
"max_score": 15
},
{
"name": "Bucket C action suggested",
"description": "The Bucket C criterion (HTTP status codes) is flagged for potential removal from criteria.json or for making the eval task harder",
"max_score": 10
},
{
"name": "Bucket A no-action",
"description": "The Bucket A criterion (idempotency key) is noted as leave alone / no action needed / tile's strength",
"max_score": 8
},
{
"name": "80% threshold applied",
"description": "Classifications are consistent with the 80% of max threshold (e.g., 87% passes Bucket A; 50%, 47% fall into Bucket B; 90% baseline triggers Bucket C)",
"max_score": 10
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
skills
compare-skill-model-performance
optimize-skill-instructions
references
optimize-skill-performance
optimize-skill-performance-and-instructions