Review existing code, diffs, branches, or pull requests by spawning mandatory concern-specific reviewer subagents, then synthesize a ship-it / needs-review / blocked verdict.
92
97%
Does it follow best practices?
Impact
81%
1.22xAverage score across 4 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent produces an evidence-backed review report with correct verdict shape, ordered findings, explicit unverified markers, file references with line numbers, and incorporates repo guidance from CLAUDE.md. Also tests mandatory default reviewer gang spawning plus silent-failures and cleanup persona application.",
"type": "weighted_checklist",
"checklist": [
{
"name": "CLAUDE.md loaded",
"description": "The report references or applies at least one rule from CLAUDE.md (e.g. MAX_RETRIES constant, dead-letter requirement, logging with payment ID, error classification)",
"max_score": 10
},
{
"name": "File references in findings",
"description": "At least two findings cite a specific file name (e.g. retry.ts, retry.test.ts) rather than only vague descriptions",
"max_score": 8
},
{
"name": "Line-level evidence",
"description": "At least one finding references a specific line number or code snippet within a file",
"max_score": 8
},
{
"name": "Verdict present",
"description": "The report contains exactly one verdict label: 'ship it', 'needs review', or 'blocked'",
"max_score": 8
},
{
"name": "Findings by severity",
"description": "Findings are explicitly labeled or ordered by severity (e.g. high/medium/low, or ranked from most to least critical)",
"max_score": 8
},
{
"name": "Error classification finding",
"description": "The report identifies that transient vs permanent payment errors are not classified before retrying (all errors lead to PENDING_RETRY), which violates the CLAUDE.md rule",
"max_score": 10
},
{
"name": "Silent failure finding",
"description": "The report identifies that the catch block in scheduleRetry only logs a vague warning without the error message or type, losing failure signal",
"max_score": 8
},
{
"name": "Mock-heavy test concern",
"description": "The report notes that the tests mock the database, charge provider, and logger — meaning the tests would pass even if real integrations broke",
"max_score": 6
},
{
"name": "Dead code identified",
"description": "The report identifies retry.old.ts as dead/deprecated code that should be removed",
"max_score": 6
},
{
"name": "Unverified surfaces marked",
"description": "The report explicitly marks at least one area as unverified (e.g. actual runtime behavior, payment provider behavior, monitoring integration)",
"max_score": 8
},
{
"name": "Recommended follow-up",
"description": "The report recommends a specific follow-up such as implementation, runtime verification, readiness setup, documentation cleanup, or none",
"max_score": 6
},
{
"name": "No nit inflation",
"description": "The report does NOT elevate purely stylistic issues (naming conventions, formatting) to the same severity as functional defects",
"max_score": 6
},
{
"name": "Reviewer gang listed",
"description": "The report shows the spawned default reviewer gang in compact metadata or evidence; it does not force persona details into the verdict footer",
"max_score": 8
},
{
"name": "Compact verdict block",
"description": "After detailed findings, the final verdict footer is compact: no more than 4 labeled lines, no duplicated finding paragraphs, no scope/persona noise unless needed for ambiguity, and command evidence is summarized by command or surface name rather than pasted logs",
"max_score": 8
}
]
}