General-purpose coding policy for Baruch's AI agents
90
91%
Does it follow best practices?
Impact
90%
1.30xAverage score across 18 eval scenarios
Advisory
Suggest reviewing before use
scenario-a bleeding detected
0%
100%
scenario-a bleeding fixed
0%
100%
scenario-a leaking detected
100%
100%
scenario-a leaking fixed
100%
100%
scenario-b vague messages detected
100%
100%
scenario-b vague messages fixed
100%
100%
scenario-b misaligned criteria detected
100%
50%
scenario-b misaligned criteria fixed
100%
0%
scenario-c deleted
100%
100%
audit report produced
100%
100%
Uses GraphQL `requestReviews` mutation
0%
100%
Inline comment explains why REST doesn't work
25%
100%
Pinned bot ID with fallback to dynamic discovery
25%
100%
Resolves the PR's GraphQL node ID
0%
100%
Verifies the review request was registered
85%
100%
Feature-branch guard
40%
100%
PR title follows conventional-commits format
12%
100%
PR body structure
57%
100%
Pre-push readiness checks
0%
0%
No hardcoded inputs in the script body
100%
100%
CI failure: fix required
100%
100%
Applies the reasonable suggestion
100%
100%
Declines the over-engineered suggestion
100%
100%
All three threads get replies
100%
100%
Accept reply uses the `Fixed in <sha>` format
0%
100%
Decline reply uses the `Declining — <reason>` format
66%
100%
Decline reply cites a verifiable reference
100%
100%
Fixes pushed to the same branch
100%
100%
No dangling threads before merge
100%
100%
Patch: no manual manifest update
0%
0%
Patch: explains CI auto-bump
0%
0%
Minor: manifest bumped to `1.5.0`
100%
100%
Major: manifest bumped to `2.0.0`
100%
100%
Major: flags breaking-change impact for downstream
100%
100%
Release sequencing: patch first, major last
100%
100%
Readiness gate: tests + linter
50%
50%
Runbook covers all three changes separately
100%
100%
names canonical cause
50%
100%
prescribes rewrite-criteria
100%
100%
rejects fix-task and retire
80%
85%
replacement criteria are tile-specific
100%
100%
names canonical cause
100%
100%
prescribes fix-task
100%
100%
preserves the criterion
100%
100%
task rewrite strips technique, keeps situation
100%
100%
names canonical cause
34%
100%
prescribes retire
0%
100%
reasoning cites baseline equivalence
90%
100%
no spurious fix-task or rewrite-criteria
0%
100%
Uses `gh pr checks` with structured output
0%
100%
Uses `gh api .../pulls/<N>/reviews` for review state
100%
100%
Uses `gh api .../pulls/<N>/comments` for inline comments
100%
100%
Does NOT use `/issues/<N>/comments`
50%
100%
Retrieves per-reviewer state distinctly
100%
100%
No hardcoded PR, owner, or repo in the script body
100%
100%
Waits for CI to finish before surfacing state
100%
100%
Surfaces CI state in the summary
50%
100%
Surfaces review states in the summary
100%
100%
Surfaces inline comment content or count
100%
100%
Surfaces merge-readiness state for conflict diagnosis
0%
100%
Merge strategy uses `gh pr merge --merge`
100%
87%
Merge includes `--delete-branch`
0%
100%
Fast-forward-only pull after merge
100%
100%
Safe local-branch delete with `git branch -d`
100%
100%
Stale remote-tracking refs pruned
100%
100%
Pre-merge CI gate
60%
100%
Pre-merge review gate
33%
100%
Verifies merge landed on main
60%
100%
Pre-merge registry baseline captured
0%
100%
SHA-bound publish-run resolution
0%
100%
Watch publish run to terminal state
0%
100%
Conjunction check: run-success AND registry-advanced
0%
100%
Final summary includes merged PR URL
100%
100%
Graceful failure on unmet preconditions
66%
100%
No hardcoded PR, owner, or repo in the script body
100%
100%
Identifies at least two uncovered decision branches
100%
100%
Writes new scenario directories
100%
100%
Criteria files follow the weighted_checklist format prescribed by the tile
100%
100%
Criteria weights sum to 100 and are not equally distributed
100%
100%
New task.md files pass the no-bleeding check
100%
100%
New criteria don't leak tile internals
100%
100%
Failure descriptions are specific
100%
100%
At least one new scenario exercises a negative case
100%
100%
Coverage analysis justifies each gap
100%
100%
Diagnoses why the fork PR is not reviewed
100%
100%
Brings the branch into the base repo
100%
100%
Preserves the contributor's commits unchanged
100%
100%
Opens a same-repo PR from the adopted branch
100%
100%
Leaves the original fork PR open
100%
100%
Links the adopted PR back to the original
100%
100%
Does not fabricate an Author-Model declaration
100%
100%
Identifies the PR as originating in the repository itself
0%
100%
Recognizes the reviewer already covers it
40%
100%
Creates no branch and pushes nothing
100%
100%
Opens no duplicate PR
100%
100%
Reports the PR's status
100%
100%
Creates a feature branch
100%
100%
Plan populates .github/workflows with both source + lock pairs
100%
100%
Commits both sources and both locks
100%
100%
Pushes and opens a PR
100%
100%
PR body lists OPENAI_API_KEY
100%
100%
PR body lists ANTHROPIC_API_KEY
100%
100%
PR body lists TESSL_TOKEN
100%
100%
Does not merge
100%
100%
Does not bypass pre-commit hooks
100%
100%
Explains the cross-family reviewer rationale
100%
100%
Rule file frontmatter declares alwaysApply: false
0%
100%
Rule file frontmatter declares applyTo with glob patterns
100%
100%
applyTo value combines globs with a natural-language clause
0%
100%
plugin.json rules array includes the new rule path
100%
100%
Rule body has H1 title matching the filename concept
100%
100%
Existing rules and manifest entries are preserved unchanged
100%
100%
Rule file frontmatter flipped to alwaysApply: false
0%
100%
Rule file frontmatter gains applyTo with glob patterns
100%
100%
applyTo value combines globs with a natural-language clause
0%
100%
plugin.json carries no per-rule config and its rules array is intact
100%
100%
Rule body content is preserved unchanged
100%
100%
Existing rule (commit-conventions) is preserved unchanged
100%
100%
Rule file frontmatter declares alwaysApply: true
100%
100%
Rule file frontmatter declares no scoping fields
100%
100%
plugin.json rules array includes the new rule path
100%
100%
Rule body covers the stdlib-first practice
100%
100%
Rule body covers the dependency-pinning practice
100%
100%
Existing rules and manifest entries are preserved unchanged
100%
100%
Detects existing workflow
100%
100%
Refuses to overwrite
48%
80%
No downstream actions after refusal
26%
20%
Explains why the guard matters
100%
100%
Offers an actionable next step
100%
100%
Preserves existing file
50%
50%
identifies suite as clean
0%
0%
does not fabricate diagnoses
0%
0%
recognizes negative-case acceptability
0%
0%
output is appropriately minimal
20%
0%
Table of Contents