General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "Tests whether the script follows this tile's prescribed mechanisms for polling CI status and scraping reviews. The tile prescribes a specific trio — `gh pr checks <N> --json name,bucket` for CI (structured output the caller parses in a polling loop), `gh api repos/<o>/<r>/pulls/<N>/reviews` for review state, `gh api repos/<o>/<r>/pulls/<N>/comments` (pull-request review comments, NOT `/issues/<N>/comments`) for inline comments. Baseline agents typically pick a different combination: ad-hoc greps of `gh pr view` text output, `gh run list` polling, or the issue-comments endpoint (which returns wrong data). The specific-mechanism criteria below measure whether the tile was applied.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Uses `gh pr checks` with structured output",
"description": "The script retrieves CI state by invoking `gh pr checks <N>` with `--json` (the prescribed mechanism returns structured `name`/`bucket` fields the caller can parse into a terminal state). `gh run list`, ad-hoc greps of `gh pr view` text output, or UI polling score zero — those are different approaches that do not give the caller a structured terminal-state signal",
"max_score": 15
},
{
"name": "Uses `gh api .../pulls/<N>/reviews` for review state",
"description": "The script hits the pull-request reviews endpoint (`gh api repos/<owner>/<repo>/pulls/<N>/reviews`) to retrieve per-reviewer state. Other endpoints (`gh pr view`, `gh api` on a different path) score zero — the tile prescribes this endpoint specifically because it returns the per-review structure needed for multi-bot setups",
"max_score": 15
},
{
"name": "Uses `gh api .../pulls/<N>/comments` for inline comments",
"description": "The script retrieves inline review comments via the pull-request review-comments endpoint (`gh api repos/<owner>/<repo>/pulls/<N>/comments`). This is a tile-prescribed endpoint choice that avoids the issue-comments trap",
"max_score": 15
},
{
"name": "Does NOT use `/issues/<N>/comments`",
"description": "The script must NOT use the issue-comments endpoint (`gh api repos/<o>/<r>/issues/<N>/comments`) for inline review comments. That endpoint returns a different object shape (PR-level comments, not inline review comments) and surfaces wrong data. A script that conflates the two scores zero here — this is a known GitHub-API gotcha the tile exists to prevent",
"max_score": 10
},
{
"name": "Retrieves per-reviewer state distinctly",
"description": "The script captures the latest review state for EACH reviewer distinctly, not a single aggregate. Multi-bot setups (Copilot + gh-aw) each have their own state, and the tile prescribes exposing them separately so the developer sees which bot is gating the merge",
"max_score": 10
},
{
"name": "No hardcoded PR, owner, or repo in the script body",
"description": "OBSERVABLE: the script body contains no literal for the target PR, owner, or repo — all three are pulled from runtime inputs (args, env vars). Running against a new target requires changing only the invocation, not the script source",
"max_score": 10
},
{
"name": "Waits for CI to finish before surfacing state",
"description": "The script (or the prescribed caller loop around it) polls until CI is in a terminal state before declaring a final verdict. Surfacing mid-run `pending` as the summary outcome scores zero — any reasonable wait mechanism (the caller looping on the script's output, `gh run watch` on the CI run, etc.) is acceptable",
"max_score": 4
},
{
"name": "Surfaces CI state in the summary",
"description": "The final summary prints a terminal CI verdict (success / failure / cancelled / timed-out), parsed from `gh pr checks` JSON output. `pending` is not a valid final verdict",
"max_score": 4
},
{
"name": "Surfaces review states in the summary",
"description": "The final summary prints each reviewer's latest state — not just raw JSON the developer has to parse",
"max_score": 8
},
{
"name": "Surfaces inline comment content or count",
"description": "The final summary includes inline comment bodies (for low counts) or a count with a pointer to read them, so the developer can tell whether there is unaddressed feedback",
"max_score": 9
}
]
}