Evidence-first pull request review with independent critique, selective challenger review, and human handoff.
87
92%
Does it follow best practices?
Impact
87%
1.31xAverage score across 43 eval scenarios
Risky
Do not use without reviewing
Reference Python implementation for the deduplication and ranking logic. The dedupe_findings.py script in scripts/ implements this fully; this document explains the algorithm for manual application or adaptation.
For each pair of findings, check if they should be merged:
def should_merge(a, b):
# Must be in the same file
if a["file"] != b["file"]:
return False
# Check line overlap (within 3-line tolerance)
if lines_overlap(a["line_start"], a["line_end"],
b["line_start"], b["line_end"], tolerance=3):
# Overlapping lines — check description similarity
if text_similarity(a["title"], b["title"]) > 0.7:
return True # Exact match
if text_similarity(a["why_it_matters"], b["why_it_matters"]) > 0.6:
return True # Semantic overlap
# No line overlap — check for very similar descriptions
if text_similarity(a["title"], b["title"]) > 0.8:
return True # Same issue, different line attribution
return FalseFor each cluster of related findings:
severity × confidencecorroborated_bysource values) agree:
def merge_cluster(findings):
canonical = max(findings, key=lambda f: (
SEVERITY_ORDER[f["severity"]],
CONFIDENCE_ORDER[f["confidence"]],
))
sources = [f["source"] for f in findings]
actions = set(f["action"] for f in findings)
merged = canonical.copy()
merged["corroborated_by"] = [
f["source"] for f in findings if f is not canonical
]
if len(actions) > 1 and {"fix", "discuss"}.issubset(actions):
merged["contested_by"] = [
f["source"] for f in findings
if f["action"] != canonical["action"]
]
else:
merged["contested_by"] = []
# Boost confidence if independently corroborated
if len(set(sources)) > 1:
merged["merged_confidence"] = boost(canonical["confidence"])
else:
merged["merged_confidence"] = canonical["confidence"]
return mergedAfter deduplication, rank survivors by:
score = SEVERITY_WEIGHT[severity] × CONFIDENCE_WEIGHT[confidence] × VERIFICATION_WEIGHT[verification_support]Where:
| Level | Severity weight | Confidence weight | Verification weight |
|---|---|---|---|
| critical / high | 3 | 3 | 3 (verifier-backed) |
| high / medium | 2 | 2 | 2 (hunk-level code) |
| medium / low | 1 | 1 | 1 (contextual reasoning) |
Example:
| Finding | Severity | Confidence | Verification | Score |
|---|---|---|---|---|
| SQL injection in login handler | 3 (critical) | 3 (high) | 3 (semgrep hit) | 27 |
| Null deref in permission guard | 2 (high) | 3 (high) | 2 (hunk-level) | 12 |
| Possible race in cache update | 2 (high) | 1 (low) | 1 (contextual) | 2 |
The third finding (score 2) would likely be suppressed by the evidence threshold since its only evidence is contextual reasoning at low confidence.
After ranking, suppress findings that:
Are style-only and already covered by a linter — check if the verifier results include a linter pass for the same file. If the linter ran and passed, style findings from reviewers are noise.
Have low confidence with only contextual reasoning — if evidence.type == "contextual_reasoning" and confidence == "low", suppress. The evidence threshold rule requires at least one concrete evidence source.
Duplicate verifier output — if a reviewer finding restates exactly what a verifier already reported (same file, same line, same issue), suppress the reviewer's copy and keep the verifier finding (which has deterministic evidence).
Suppressed findings are retained with suppressed: true and suppression_reason set. They appear in eval data but not in the reviewer packet.
When two reviewers disagree on the same code:
contested_by: [disagreeing source]"discuss" regardless of individual recommendationsrequires_human is set to trueThis ensures contested findings always reach a human reviewer for adjudication.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
scenario-30
scenario-31
scenario-32
scenario-33
scenario-34
scenario-35
scenario-36
scenario-37
scenario-38
scenario-39
scenario-40
scenario-41
scenario-42
scenario-43
rules
skills
challenger-review
finding-synthesizer
fresh-eyes-review
human-review-handoff
pr-evidence-builder
review-retrospective