General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "Tests whether the agent applies this tile's prescribed review-reply conventions — not whether a baseline agent reasons to some equivalent reply. The tile invents specific short phrases (`Fixed in <sha>` for accepts, `Declining — <reason>` for declines) so review threads scan consistently across a team. Baseline agents would produce ad-hoc replies (`done`, `no thanks`, `I disagree`). The specific-format criteria below measure whether the tile's convention was applied; the substance criteria measure whether the reply has the content the convention is meant to carry.",
"type": "weighted_checklist",
"checklist": [
{
"name": "CI failure: fix required",
"description": "For the CI failure (failing unit test), the guide states the test must be fixed. Does NOT suggest skipping, ignoring, deferring, or `[skip ci]`",
"max_score": 10
},
{
"name": "Applies the reasonable suggestion",
"description": "For the clearly-correct suggestion (simplify the nested conditional), the guide says to apply it — does NOT suggest declining or deferring",
"max_score": 10
},
{
"name": "Declines the over-engineered suggestion",
"description": "For the over-engineered suggestion (unnecessary abstraction for current scope), the guide says to push back — does NOT say to blindly implement it",
"max_score": 10
},
{
"name": "All three threads get replies",
"description": "The guide produces a reply for EVERY thread (CI failure, accept, decline) — not two of three. An empty reply or 'no response needed' on any thread scores zero",
"max_score": 12
},
{
"name": "Accept reply uses the `Fixed in <sha>` format",
"description": "The accept reply follows the tile's prescribed format — it begins with the literal word 'Fixed in' followed by a sha (either a real commit hash or, per the task's placeholder instruction, the literal `<sha>` marker). A reply that says 'Done' or 'Resolved' or 'Applied the fix' scores zero even if semantically equivalent, because the tile exists to standardize the specific phrasing across the team",
"max_score": 15
},
{
"name": "Decline reply uses the `Declining — <reason>` format",
"description": "The decline reply follows the tile's prescribed format — it begins with the literal word 'Declining', an em dash (—), then a reason (either prose or, per the task's placeholder instruction, the literal `<reason>` marker). A reply that says 'Disagree' or 'No thanks' or 'Not doing this' scores zero — the tile prescribes this specific opening precisely so declines are unambiguous in thread scans",
"max_score": 15
},
{
"name": "Decline reply cites a verifiable reference",
"description": "The decline reply names at least one externally-verifiable reference for the rejection: a file:line, quoted spec/docs clause, linked prior PR/issue, or named scope constraint from the PR description. This complements the format check — the format alone is not enough; the tile also requires substance",
"max_score": 8
},
{
"name": "Fixes pushed to the same branch",
"description": "The guide states any code changes go on the EXISTING feature branch of the PR, not a new branch. Opening a second PR for review fixes scores zero",
"max_score": 10
},
{
"name": "No dangling threads before merge",
"description": "The guide explicitly states that EVERY thread must have a reply before the PR is merged — nothing left unresolved",
"max_score": 10
}
]
}