Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
{
"context": "Tests whether the agent correctly identifies a regression as caused by ambiguous or contradictory content in the tile (rather than missing content), and proposes a removal or clarification fix rather than adding more emphasis on the failing behavior.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Contradicting clause identified",
"description": "The analysis identifies the 'documentation-only or clearly trivial... you may skip the test run' sentence (or equivalent 'at your discretion' exception) as the specific cause of the regression",
"max_score": 22
},
{
"name": "Contradiction mechanism explained",
"description": "The analysis explains that this clause contradicts 'Always run the full test suite' and gives agents a justification to skip tests — not just that 'it's confusing'",
"max_score": 18
},
{
"name": "Remove/clarify approach taken",
"description": "The proposed fix removes or substantially rewrites the exception clause — NOT just adding another 'always run tests' statement elsewhere to compensate",
"max_score": 22
},
{
"name": "Specific text targeted",
"description": "The fix targets the specific problematic sentence(s), not a broad rewrite of the entire Pre-Review Checklist section",
"max_score": 18
},
{
"name": "No compensating additions",
"description": "The updated SKILL.md does NOT add a new reinforcing instruction (e.g., 'Note: never skip tests even for trivial changes') alongside or instead of removing the contradictory clause",
"max_score": 10
},
{
"name": "Other sections preserved",
"description": "The Submitting for Review and Responding to Feedback sections are not modified in the updated SKILL.md",
"max_score": 7
},
{
"name": "Pre-review list intact",
"description": "The numbered list structure of the Pre-Review Checklist is maintained in the updated file — the fix does not replace the entire checklist with prose",
"max_score": 3
}
]
}