Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
94
Quality
89%
Does it follow best practices?
Impact
98%
1.30xAverage score across 7 eval scenarios
Passed
No known issues
Analysis of eval results has identified two failing criteria:
The user says: "Go ahead and fix these. Show me what you're going to change before you make the edits."
Fix both issues, commit the changes, and re-run the evals.