9 Apr 20269 minute read

I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs
9 Apr 20269 minute read

TLDR
- Humans don't want to review AI-generated code (why spend an hour reading something that took 30 seconds to generate?), and AI reviewers get ignored 81-99% of the time. PR review is broken from both sides.
- The plugin that hit 97.7% accuracy doesn't hunt for bugs. It builds an evidence pack, classifies risk into lanes, and hands a structured brief to a human who makes the actual call.
- Install it with
tessl install tessl-labs/pr-review-guardrailsand point it at a real PR. You'll know in five minutes whether this approach works for your codebase.
PR review is broken from both sides.
Humans don't want to do it. The effort asymmetry is brutal. An agent generates a PR in 30 seconds, and now a human is supposed to spend an hour carefully reading code they didn't write, didn't design, and can't ask clarifying questions about. That's a hard sell even when the code is good. When the code is AI-generated, the motivation drops further. Who wants to be a proofreader for a glorified word-guessing monkey?
So hand it to another AI? The 2025 research says that doesn't work either. AI code review comments get adopted 1-19% of the time, depending on the study, while human reviewer comments land at significantly higher rates. The gap is signal-to-noise. AI reviewers flood PRs with findings, most of them either obvious (the linter already caught it) or wrong (the code is fine, the reviewer hallucinated a vulnerability). Developers learn to ignore the firehose.
I built a Tessl plugin to try a different approach. A Tessl plugin (used to be called a "tile") is a context artifact: a bundle of skills, rules, and scripts that gives an AI coding agent domain-specific context. Think npm packages, but for agent behavior instead of code. Mine doesn't try to be a better bug finder. It builds a dossier of evidence about the PR, classifies the risk, and hands a structured brief to a human who makes the actual call.
Where it started
Earlier this year, I spent some time researching how AI-generated PRs are wrecking open source maintainers. That became the Good OSS Citizen plugin, teaching agents how to contribute. But while studying the flood of AI-generated PRs, I kept circling back to the other side: who reviews all this code?
The research said something useful: AI is good at local, checkable problems: buffer overflows, missing null checks, SQL injection in a query builder. Things where you can point at a specific line and say "this is wrong because X." What AI is bad at is intent, architecture, and trade-offs. The stuff that requires understanding why the code exists, not just what it does.
So the design question became: what if the AI reviewer's job isn't to find bugs? What if its job is to gather evidence and let the human make the call?
Build the dossier first. Let the opinions follow from that.
How the evidence-first review pipeline works
The plugin has six skills. The first one matters most.
The evidence builder reads the diff, maps which files changed, figures out what kind of change this is, and classifies risk into lanes: green (routine), yellow (needs attention), red (security-relevant, requires deep review). Everything downstream flows from this classification. A README fix gets a green lane and a light pass. A change to the auth middleware gets red and the full treatment.
Then the fresh-eyes reviewer gets the evidence pack and the code. It hunts for problems, but only problems the evidence supports. If the evidence builder classified a PR as green-lane, the reviewer isn't going to invent an exotic attack vector in a README change. If I enabled the optional challenger (a second model checking the first reviewer's work), that runs next. The research says cross-model review works as a verification layer, and I wanted to test that claim.
After the review, a synthesizer compresses everything into a single brief with findings, evidence, confidence levels, and a recommendation for what a human should focus on. The human handoff formats that brief for the person who actually decides whether to merge.
There's also a retrospective skill that's supposed to run after the human makes their call, comparing the plugin's findings against the human's decision. A feedback loop that's supposed to improve the plugin over time.
What an AI code review brief looks like
When you run the plugin on a PR, the human reviewer gets a brief. It looks like this:
The brief starts with risk classification (green, yellow, or red) so you know immediately how much attention this PR needs. A green-lane config change gets a one-paragraph summary. A red-lane auth change gets the full breakdown: which files are security-relevant, what data flows through them, what the specific risks are, and what to look for when you read the code.
Each finding comes with evidence: the specific lines, why the plugin flagged them, and a confidence level. A finding that says "this user input reaches the SQL query on line 47 without sanitization" is something a developer acts on. A finding that says "potential security concern in this module" gets ignored before the developer finishes reading it. The plugin is built to produce the first kind.
The brief also tells you what it didn't check. If the PR touches areas outside the plugin's domain knowledge, it says so instead of pretending it reviewed everything.
Here's what the plugin produced for a real PR that changes Redis cache TTL configuration in a payments API:
PR: #5 — Update Redis cache TTL configuration
Risk lane: RED
- Cache invalidation logic changes with auth-adjacent session_data prefix
- TTL=0 introduces keys that never expire (memory and security implications)
- Mandatory human review required (auth/security, cache invalidation)
Finding 1 [HIGH / verify]: session_data TTL config entry has no consumer
File: src/cache/cache_layer.py:17
"session_data": 0, # sessions managed by auth layer, no TTL needed
Evidence: grep for session_data across src/ returns zero results.
src/auth/sessions.py manages its own Redis keys with 24h TTL,
bypassing the cache layer entirely.
Finding 2 [HIGH / fix]: Zero-TTL cache entries persist forever
File: src/cache/cache_layer.py:47
if ttl == 0: r.set(key, json.dumps(value))
Evidence: No background cleanup, no maxmemory-policy safeguard,
no monitoring for key count growth. Gradual Redis memory leak.
Finding 3 [MEDIUM / discuss]: payment_details staleness window widened to 5min
Finding 4 [MEDIUM / fix]: New ttl==0 branch in set_cached is untested
Questions for human reviewer:
1. Is there a planned follow-up PR that routes session data through cache?
2. Are Stripe webhooks invalidating cached payment details on status changes?
3. What is the Redis maxmemory-policy in production?Four findings, each with the specific file, line, code, and evidence trail. The human reviewer knows exactly what to focus on and why.
What AI code review still can't do
The plugin doesn't replace the human reviewer. The research is clear on this: AI review catches local, checkable problems. Intent, architecture, trade-offs: those are still yours. The plugin's job is to do the tedious forensic work (trace this data flow, check this input path, verify this config isn't exposed) so the human can focus on the questions only a human can answer: should this feature exist? Does this design make sense? Is this the right trade-off?
It also doesn't have real-world adoption data yet. I validated that it finds the right problems across 43 eval scenarios (97.7% accuracy against a 66.6% baseline). I did not validate whether developers trust what it finds and act on it. That's the honest gap. If you run the retrospective skill after a real review, you'll have more data than I do.
In Part 2 (coming soon), I'll show how I built the eval, what I learned from eight rounds of iteration, and the debugging story where I spent a week fixing the wrong skill.
Try it
tessl install tessl-labs/pr-review-guardrailsThe plugin, the eval corpus, and the research brief are all in the GitHub repo. Point it at a PR you've already reviewed and compare its brief against what you found. That's the fastest way to know if the evidence-first approach works for your codebase.
Further reading:
Good OSS Citizen Part 1 (the research that started this) | Research brief and eval corpus



