General-purpose coding policy for Baruch's AI agents
95
91%
Does it follow best practices?
Impact
96%
1.31xAverage score across 10 eval scenarios
Advisory
Suggest reviewing before use
{
"context": "Tests whether the agent applies this tile's prescribed release sequencing and version-bump policy. The tile prescribes: patch bumps are delegated to the CI pipeline (`tesslio/patch-version-publish` auto-bumps the patch segment on every merge) so manual manifest edits for patch changes are wrong; minor and major bumps require manual manifest updates to specific target numbers; breaking changes need explicit impact surfacing in the runbook; PRs should be sequenced patch → minor → major for risk management. Baseline agents typically bump all three manually and do not sequence the releases.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Patch: no manual manifest update",
"description": "For the bug-fix change (null-pointer guard, backward-compat, no API change), the runbook does NOT prescribe a manual manifest version bump. Baseline agents will typically bump it manually; the tile prescribes delegating to CI",
"max_score": 12
},
{
"name": "Patch: explains CI auto-bump",
"description": "The runbook explains WHY the patch case skips a manual bump — because the CI pipeline (`tesslio/patch-version-publish` or equivalent) auto-bumps the patch segment on every merge. Mentioning the specific automation or its role is tile-prescribed knowledge; a baseline agent without the tile would not know this wiring exists",
"max_score": 10
},
{
"name": "Minor: manifest bumped to `1.5.0`",
"description": "For the CSV-export change (additive new route, backward-compat), the runbook updates the manifest version to `1.5.0` (not `1.4.3`, not `2.0.0`). Scores correct classification (minor) AND correct arithmetic",
"max_score": 12
},
{
"name": "Major: manifest bumped to `2.0.0`",
"description": "For the v1-routes removal (breaking change), the runbook updates the manifest version to `2.0.0` (not `1.4.3`, not `1.5.0`). Scores correct classification (major) AND correct arithmetic",
"max_score": 12
},
{
"name": "Major: flags breaking-change impact for downstream",
"description": "The runbook calls out that the v1-routes removal will break existing clients AND surfaces migration guidance in a place a downstream consumer would see (release notes, migration guide, or a deployment checklist). An agent who treats the major bump the same way as the minor bump — no impact statement, no migration — scores zero",
"max_score": 12
},
{
"name": "Release sequencing: patch first, major last",
"description": "The runbook orders the three PRs patch → minor → major with explicit rationale (patch is low-risk and gets CI's auto-bump first; major gives downstream the longest notice and comes last). Any coherent risk-based sequencing with stated reasoning passes; no sequencing, or reversing without justification, scores zero. Baseline agents typically do not sequence",
"max_score": 12
},
{
"name": "Readiness gate: tests + linter",
"description": "The runbook requires tests AND linter to pass before creating any of the three PRs. Scored once as a cross-cutting gate, not duplicated per change",
"max_score": 10
},
{
"name": "Runbook covers all three changes separately",
"description": "The `RELEASE_RUNBOOK.md` has a distinct section for each of the three pending changes, each answering the readiness / versioning / sequencing questions in the task's output spec. A single conflated section scores zero. Only two of three sections present scores two-thirds",
"max_score": 20
}
]
}