Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
{
"context": "Tests whether the agent correctly classifies eval criteria into four distinct performance buckets using the 80% threshold rules, and whether the analysis includes the right diagnosis, priority signals, and action recommendations for each bucket type.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Bucket A: idempotency key",
"description": "The 'Stripe idempotency key' criterion (13/15 = 87% with-context, significantly above 2/15 baseline) is classified as working well / no action needed",
"max_score": 8
},
{
"name": "Bucket B: webhook signature",
"description": "The 'Webhook signature validation' criterion (5/10 = 50% with-context, low baseline 1/10) is classified as a tile gap that needs a fix",
"max_score": 8
},
{
"name": "Bucket C: HTTP status codes",
"description": "The 'HTTP status code handling' criterion (9/10 baseline already high, >= 80% without tile) is classified as redundant / agents already know this",
"max_score": 8
},
{
"name": "Bucket B: currency precision",
"description": "The 'Currency precision' criterion (7/15 = 47% with-context, low baseline 3/15) is classified as a tile gap",
"max_score": 8
},
{
"name": "Bucket D: API version pinning",
"description": "The 'API version pinning' criterion (with-context 4/10 is LOWER than baseline 6/10) is classified as a regression",
"max_score": 15
},
{
"name": "Bucket D highest priority",
"description": "The regression criterion (API version pinning) is explicitly marked as highest priority or most urgent to fix",
"max_score": 10
},
{
"name": "Bucket B diagnosis present",
"description": "Both Bucket B criteria include a diagnosis (what the tile is likely missing) and a reference to which tile file should be updated",
"max_score": 15
},
{
"name": "Bucket C action suggested",
"description": "The Bucket C criterion (HTTP status codes) is flagged for potential removal from criteria.json or for making the eval task harder",
"max_score": 10
},
{
"name": "Bucket A no-action",
"description": "The Bucket A criterion (idempotency key) is noted as leave alone / no action needed / tile's strength",
"max_score": 8
},
{
"name": "80% threshold applied",
"description": "Classifications are consistent with the 80% of max threshold (e.g., 87% passes Bucket A; 50%, 47% fall into Bucket B; 90% baseline triggers Bucket C)",
"max_score": 10
}
]
}