General-purpose coding policy for Baruch's AI agents
91
92%
Does it follow best practices?
Impact
91%
1.24xAverage score across 9 eval scenarios
Advisory
Suggest reviewing before use
Merge flags: --merge
0%
100%
Merge flags: --delete-branch
100%
100%
Fast-forward only pull
0%
100%
Checkout main first
100%
100%
Local branch deletion
50%
100%
Remote ref pruning
100%
100%
Verify merge on main
100%
100%
Publish CI check
100%
100%
Report merged PR URL
100%
100%
Pre-merge: CI green gate
100%
100%
Patch: no manifest update
0%
100%
Patch: automation mentioned
0%
100%
Minor: manifest updated
100%
100%
Major: manifest updated
100%
100%
Readiness: tests
100%
100%
Readiness: linter
0%
100%
PR title convention
50%
100%
Accepted reply format
50%
100%
Declined reply format
0%
100%
All threads replied
80%
100%
CI watch command
0%
0%
Review state API call
50%
0%
Inline comments API call
50%
0%
PR number parameterized
100%
100%
Review state surfaced
100%
100%
Inline comments surfaced
100%
100%
scenario-a bleeding detected
0%
100%
scenario-a bleeding fixed
0%
100%
scenario-a leaking detected
100%
100%
scenario-a leaking fixed
100%
100%
scenario-b vague messages detected
100%
100%
scenario-b vague messages fixed
100%
100%
scenario-b misaligned criteria detected
100%
100%
scenario-b misaligned criteria fixed
100%
100%
scenario-c deleted
100%
100%
audit report produced
100%
100%
CI failure: fix required
100%
100%
Accepted reply format
33%
100%
Declined reply format
0%
100%
All threads replied
100%
100%
Push to same branch
100%
100%
Apply reasonable suggestion
100%
100%
Decline over-engineered suggestion
100%
100%
No dangling threads
100%
80%
Refuses to merge
100%
100%
No skip-CI suggestion
100%
100%
Addresses the test failure
100%
100%
Addresses the lint warnings
100%
100%
Time pressure acknowledged but not accepted
100%
100%
Actionable path to green
100%
100%
Detects main branch
100%
100%
Explains why it's blocked
100%
100%
Provides recovery steps
100%
100%
Does not create PR from main
100%
100%
Preserves the commit
100%
100%
Identifies missing production-without-approval case
100%
100%
Identifies missing unhealthy-rollback case
100%
100%
Writes production-rejection scenario
100%
100%
Writes rollback scenario
100%
100%
New scenarios have correct structure
50%
70%
No bleeding in new scenarios
80%
100%
New criteria have meaningful descriptions
100%
100%
Coverage analysis explains why gaps matter
100%
100%
GraphQL mutation used
0%
100%
Correct Copilot bot ID
0%
46%
PR node ID retrieval
0%
100%
Bot ID fallback included
20%
100%
Review request verification
87%
100%
Feature branch guard
30%
100%
PR title format
0%
100%
PR body Summary section
50%
100%
PR body Test plan section
100%
100%
Pre-push readiness
100%
100%
Table of Contents