Evidence-first pull request review with independent critique, selective challenger review, and human handoff.
87
93%
Does it follow best practices?
Impact
87%
1.31xAverage score across 43 eval scenarios
Risky
Do not use without reviewing
Independent critique of a pull request from a clean reviewer context.
pr-evidence-builder has produced an evidence packreviewer-independence rule — e.g. .github/review-rules/reviewer-independence.md)pr-evidence-builder (see skills/pr-evidence-builder.md)Do NOT use any other context. Do not read authoring prompts, agent reasoning, or tool call history from the authoring session.
A valid evidence pack typically contains:
risk_lane: green | yellow | redchanged_files: list of file paths touchedsubsystems: list of subsystems affectedhotspots: list of high-risk file/line areas flagged by verifiersverifier_output: structured results from automated checks (linting, tests, security scans)stated_intent: description of what the PR is supposed to doIf any of these fields are missing or empty, note the gap in your output but proceed with what is available. Do not block on an incomplete evidence pack — flag the limitation as a low-confidence contextual finding.
Spend your effort on finding the most important issues, not on being comprehensive. A review with 2 precise, well-grounded findings is better than one with 8 findings that includes noise. Read the diff carefully, identify what actually matters, and stop. Do not pad the review with low-value observations.
Write findings as you go. Do not wait until you have reviewed the entire diff to start producing output. Write each finding as you discover it.
Read the evidence pack. Understand what changed, which subsystems are touched, what the risk lane is, what the verifiers found, and where the hotspots are.
Review the raw diff. Focus attention on hotspots and risk-contributing areas first. Review in this priority order — security issues first, because they are most likely to be high-severity:
=cmd prefix", not "potential injection").JSON.parse or similar, check whether downstream code assumes specific types — a wrong-type value may silently produce NaN or undefined rather than throwing.apply_immediately = true on production databases, Terraform engine changes that destroy and recreate resources.If you identify an unsanitized input flow, emit it as a finding — do not merely mention it in the TL;DR or risk summary. An observation that doesn't become a finding is invisible to downstream synthesis.
Produce candidate findings. For each issue found, emit a structured finding. Only emit findings that meet the evidence threshold (see validation guidance below).
Classify each finding. Assign confidence, severity, and action recommendation.
Before emitting a finding, verify all of the following:
verifier_output.some() semantics), the finding is a false positive — discard it.If a candidate finding fails any check, downgrade its confidence or discard it. Do not emit findings that fail the "Grounded" or "Call-site verified" checks.
State conclusions directly. If a UI element is hidden but the server endpoint has no authorization check, say "hiding a UI element is not authorization — the endpoint is accessible to any authenticated user via direct API call." Do not merely note that the check is "client-side only" and leave the reader to infer the consequence.
When classifying risk, use the lane name from the risk routing rule (green, yellow, red) as the top-level label — not synonyms like "HIGH" or "CRITICAL." The lane name is the contract between the evidence builder and the review pipeline.
finding_id: UUIDsource: "fresh_eyes"title: concise issue descriptionfile: impacted file pathline_start / line_end: specific lines (null if file-level)hunk: relevant code snippet (null if not applicable)why_it_matters: impact explanation, not diff restatementevidence: type (verifier_output | hunk_level_code | repo_policy | contextual_reasoning) and detailconfidence: high | medium | lowseverity: critical | high | medium | lowaction: fix | verify | discussrequires_human: true if the issue needs human judgment{
"finding_id": "a3f2c1d0-84bb-4e10-9abc-000011112222",
"source": "fresh_eyes",
"title": "Null dereference on missing user object before permission check",
"file": "src/auth/permissionGuard.ts",
"line_start": 42,
"line_end": 44,
"hunk": "const role = user.profile.role;\nif (role === 'admin') { ... }",
"why_it_matters": "If `user` is null (e.g. unauthenticated request that bypasses the middleware), this throws before the permission gate is reached, potentially exposing a 500 error with stack trace.",
"evidence": {
"type": "hunk_level_code",
"detail": "No null guard on `user` before property access at line 42; middleware that populates `user` is optional according to router config at routes/api.ts:18."
},
"confidence": "high",
"severity": "high",
"action": "fix",
"requires_human": false
}For green-lane PRs (docs-only, test-only, safe renames, formatting), apply a lighter review. Do not flag documentation quality issues (incomplete coverage, missing package docs, style preferences) as findings — these are not defects. Only surface findings on green-lane PRs if they reveal an actual correctness or security problem introduced by the change. If the PR is correctly classified green and has no real defects, it is correct to produce zero findings.
Before flagging a logic change as a vulnerability, read the call sites to determine intent. This is mandatory for any finding about authorization logic changes.
A change from every() to some() on a role check looks like a permission downgrade in isolation. But if call sites pass role arrays like ['admin', 'manager'] to mean "admin OR manager", then some() is correct and every() was the bug (it required a user to hold both roles simultaneously). Check every invocation: if arrays represent alternative roles (OR semantics), some() is correct; if arrays represent cumulative permissions (AND semantics), every() is correct.
Flagging a correct fix as a security vulnerability is worse than missing a real issue. It erodes reviewer trust and wastes human review time.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
scenario-30
scenario-31
scenario-32
scenario-33
scenario-34
scenario-35
scenario-36
scenario-37
scenario-38
scenario-39
scenario-40
scenario-41
scenario-42
scenario-43
rules
skills
challenger-review
finding-synthesizer
fresh-eyes-review
human-review-handoff
pr-evidence-builder
review-retrospective