Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
{
"name": "experiments/eval-improve",
"version": "0.4.0",
"summary": "Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated",
"private": false,
"skills": {
"eval-improve": {
"path": "skills/eval-improve/SKILL.md"
}
}
}