Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
Does it follow best practices?
Evaluation — 100%
↑ 1.02xAgent success when using this tile
Validation for skill structure
Your team's code review tile has been performing well for months. Last week, a teammate added some "helpful flexibility" to the tile based on developer feedback — they wanted agents to be less rigid about certain steps. Shortly after the update was committed, the eval score for the "pre-review checklist" scenario dropped from 8/10 to 3/10. The teammate's changes seemed reasonable in isolation, but something is clearly wrong.
Your job is to figure out what in the recent tile changes is causing agents to underperform on this scenario, and propose a targeted fix. The eval criteria that is failing checks that agents "always run the full test suite before submitting code for review."
Produce a file called regression_analysis.md that:
Then apply the fix directly to inputs/SKILL.md.
The following files are provided as inputs. Extract them before beginning.
Before submitting any code for review, complete the following steps in order:
lint --fix if available)For straightforward changes, you may skip the linter step if the CI pipeline runs it automatically. If the changes are documentation-only or clearly trivial (typo fixes, comment updates), you may skip the test run at your discretion — the reviewer can always request tests if they feel it's needed.
Address all reviewer comments before merging. For each comment:
Do not merge until all conversations are resolved.