Analyze eval results, diagnose low-scoring criteria, fix tile content, and re-run evals — the full improvement loop automated
94
Quality
89%
Does it follow best practices?
Impact
98%
1.30xAverage score across 7 eval scenarios
Passed
No known issues
You have eval results for a tile. The user wants you to analyze them.
The user says: "My eval scores look bad. Can you look at the latest results and tell me what needs to be fixed and in what priority order?"
Analyze the eval results by running the appropriate commands, classify each criterion into the correct bucket, and present a clear summary to the user.