Evidence-first pull request review with independent critique, selective challenger review, and human handoff.
87
92%
Does it follow best practices?
Impact
87%
1.31xAverage score across 43 eval scenarios
Risky
Do not use without reviewing
Evidence-first pull request review. Builds a dossier, classifies risk, hands a structured brief to a human who makes the call.
AI code review comments get adopted 1-19% of the time. Human reviewer comments land at significantly higher rates. The gap is signal-to-noise: AI reviewers flood PRs with findings that are either obvious or wrong. Developers learn to ignore the firehose.
This plugin doesn't try to be a better bug finder. It builds an evidence pack about the PR, classifies the risk into lanes, and produces a structured brief so the human reviewer can focus on the questions only a human can answer.
The design is grounded in the Cross-model AI PR Review Research Brief and the PR Review Guardrails Spec.
Six skills, executed in sequence:
Evidence Builder — reads the diff, maps changed files, runs deterministic verifiers (linters, type checkers, secret scanners), and classifies risk into lanes: green (routine), yellow (needs attention), red (security-relevant, requires deep review). Everything downstream flows from this classification.
Fresh-Eyes Reviewer — gets the evidence pack and the raw diff. Hunts for problems, but only problems the evidence supports. No hallucinating security vulnerabilities in a docs-only PR. Produces candidate findings, not verdicts.
Challenger (optional) — a second independent review pass. Cross-model or same-model, configurable. Strengthens or weakens candidate findings. Research says independent review works as a verification layer.
Finding Synthesizer — deduplicates, ranks, and compresses all findings into a single brief with evidence, confidence levels, and a recommendation for what to focus on.
Human Handoff — formats the brief for the person who actually decides whether to merge. Risk classification up front, findings with evidence, and explicit gaps where the plugin didn't have coverage.
Retrospective (optional) — runs after the human makes their call. Compares the plugin's findings against the human's decision. The feedback loop.
A brief, not a wall of findings. It starts with risk classification — green, yellow, or red — so you know immediately how much attention the PR needs. Each finding comes with evidence: specific lines, why it was flagged, and a confidence level. The brief also tells you what it didn't check.
tessl install tessl-labs/pr-review-guardrailsPoint it at a PR you've already reviewed — compare its brief against what you found.
43 scenarios across four test repositories (payments-api, web-dashboard, data-service, deploy-infra), covering:
Plugin scores 97.7% against a 66.6% baseline (Claude Opus with no guidance). The gap comes from false positive suppression and risk classification, not raw bug detection.
| Rule | Purpose |
|---|---|
review-boundaries | What the plugin reviews and what it leaves to humans |
reviewer-independence | Reviewer context must be isolated from authoring context |
evidence-threshold | Findings must be grounded, scoped, and non-duplicative |
comment-quality | Precision over volume — fewer, sharper findings |
risk-routing | Green/yellow/red lane classification criteria |
human-escalation | When and how to escalate to human review |
tile.json # Tile manifest
skills/
pr-evidence-builder/ # Risk classification and evidence pack
fresh-eyes-review/ # Independent critique
challenger-review/ # Optional second review pass
finding-synthesizer/ # Dedupe, rank, compress
human-review-handoff/ # Format brief for human
review-retrospective/ # Post-decision feedback loop
rules/ # Steering rules
scripts/ # Supporting automation
evals/ # 43 eval scenarios with criteriaSee tile.json for tile metadata.
evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
scenario-9
scenario-10
scenario-11
scenario-12
scenario-13
scenario-14
scenario-15
scenario-16
scenario-17
scenario-18
scenario-19
scenario-20
scenario-21
scenario-22
scenario-23
scenario-24
scenario-25
scenario-26
scenario-27
scenario-28
scenario-29
scenario-30
scenario-31
scenario-32
scenario-33
scenario-34
scenario-35
scenario-36
scenario-37
scenario-38
scenario-39
scenario-40
scenario-41
scenario-42
scenario-43
rules
skills
challenger-review
finding-synthesizer
fresh-eyes-review
human-review-handoff
pr-evidence-builder
review-retrospective