Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins
90
90%
Does it follow best practices?
Impact
91%
3.37xAverage score across 2 eval scenarios
Advisory
Suggest reviewing before use
checks_prerequisites
50%
100%
browses_commits
0%
16%
auto_detects_context_files
0%
100%
uses_context_flag
50%
100%
workspace_in_eval_run
0%
100%
explains_baseline_vs_context
100%
100%
does_not_use_last_only
0%
100%
finds_generation_ids
75%
100%
downloads_each_separately
33%
100%
explains_why
0%
75%
Table of Contents