Generate eval scenarios from repo commits, configure multi-agent runs, execute baseline + with-context evals, and compare results — the full setup pipeline before improvement begins
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Loading evals