General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
Creates a feature branch
100%
100%
Populates .github/workflows with both source + lock pairs
73%
100%
Commits both sources and both locks
100%
100%
Pushes and opens a PR
100%
100%
PR body lists OPENAI_API_KEY
100%
100%
PR body lists ANTHROPIC_API_KEY
100%
100%
PR body lists TESSL_TOKEN
100%
100%
Does not merge
100%
100%
Does not bypass pre-commit hooks
100%
100%
Explains the cross-family reviewer rationale
100%
100%
Identifies the missing dependency
32%
100%
Stops before making changes
100%
100%
Provides the install command
40%
100%
Explains why gh-aw is needed
80%
100%
Invites re-invocation
100%
100%
Detects existing workflow
100%
100%
Refuses to overwrite
100%
100%
No downstream actions after refusal
100%
100%
Explains why the guard matters
100%
100%
Offers an actionable next step
100%
100%
Preserves existing file
100%
100%
Identifies at least two uncovered decision branches
100%
100%
Writes new scenario directories
100%
100%
Criteria files follow the weighted_checklist format prescribed by the tile
100%
100%
Criteria weights sum to 100 and are not equally distributed
100%
100%
New task.md files pass the no-bleeding check
66%
100%
New criteria don't leak tile internals
100%
100%
Failure descriptions are specific
100%
100%
At least one new scenario exercises a negative case
100%
100%
Coverage analysis justifies each gap
100%
100%
scenario-a bleeding detected
0%
100%
scenario-a bleeding fixed
0%
100%
scenario-a leaking detected
100%
100%
scenario-a leaking fixed
100%
100%
scenario-b vague messages detected
100%
100%
scenario-b vague messages fixed
100%
100%
scenario-b misaligned criteria detected
100%
100%
scenario-b misaligned criteria fixed
100%
100%
scenario-c deleted
100%
100%
audit report produced
100%
100%
Uses `gh pr checks` with structured output
0%
100%
Uses `gh api .../pulls/<N>/reviews` for review state
100%
100%
Uses `gh api .../pulls/<N>/comments` for inline comments
100%
100%
Does NOT use `/issues/<N>/comments`
100%
100%
Retrieves per-reviewer state distinctly
100%
100%
No hardcoded PR, owner, or repo in the script body
100%
100%
Waits for CI to finish before surfacing state
100%
100%
Surfaces CI state in the summary
100%
100%
Surfaces review states in the summary
100%
100%
Surfaces inline comment content or count
100%
100%
Uses GraphQL `requestReviews` mutation
0%
100%
Inline comment explains why REST doesn't work
0%
100%
Pinned bot ID with fallback to dynamic discovery
0%
100%
Resolves the PR's GraphQL node ID
0%
100%
Verifies the review request was registered
100%
100%
Feature-branch guard
0%
100%
PR title follows conventional-commits format
0%
100%
PR body structure
0%
100%
Pre-push readiness checks
0%
33%
No hardcoded inputs in the script body
100%
100%
Merge strategy uses `gh pr merge --merge`
0%
100%
Merge includes `--delete-branch`
100%
100%
Fast-forward-only pull after merge
100%
100%
Safe local-branch delete with `git branch -d`
25%
100%
Stale remote-tracking refs pruned
100%
100%
Pre-merge CI gate
20%
100%
Pre-merge review gate
0%
100%
Verifies merge landed on main
75%
62%
Publish CI verification
80%
100%
Final summary includes merged PR URL
100%
71%
Graceful failure on unmet preconditions
40%
100%
No hardcoded PR, owner, or repo in the script body
100%
100%
CI failure: fix required
100%
100%
Applies the reasonable suggestion
100%
100%
Declines the over-engineered suggestion
100%
100%
All three threads get replies
100%
100%
Accept reply uses the `Fixed in <sha>` format
0%
0%
Decline reply uses the `Declining — <reason>` format
0%
46%
Decline reply cites a verifiable reference
100%
100%
Fixes pushed to the same branch
100%
100%
No dangling threads before merge
100%
60%
Patch: no manual manifest update
0%
100%
Patch: explains CI auto-bump
0%
100%
Minor: manifest bumped to `1.5.0`
100%
100%
Major: manifest bumped to `2.0.0`
100%
100%
Major: flags breaking-change impact for downstream
100%
100%
Release sequencing: patch first, major last
100%
100%
Readiness gate: tests + linter
50%
100%
Runbook covers all three changes separately
100%
100%
Table of Contents