Back to articlesI Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs

9 Apr 20269 minute read

Baruch Sadogursky

Baruch Sadogursky is a Developer Advocate who helps developers move from vibecoding to spec driven development, with deep experience from JFrog and now at Tessl.

Table of Contents

TLDR

Where it started

How the evidence-first review pipeline works

What an AI code review brief looks like

What AI code review still can't do

Try it

Back to articles

I Built an AI PR Reviewer That Catches Bugs by Not Looking for Bugs

9 Apr 20269 minute read

Table of Contents

TLDR

Where it started

How the evidence-first review pipeline works

What an AI code review brief looks like

What AI code review still can't do

Try it

TLDR

Humans don't want to review AI-generated code (why spend an hour reading something that took 30 seconds to generate?), and AI reviewers get ignored 81-99% of the time. PR review is broken from both sides.
The plugin that hit 97.7% accuracy doesn't hunt for bugs. It builds an evidence pack, classifies risk into lanes, and hands a structured brief to a human who makes the actual call.
Install it with tessl install tessl-labs/pr-review-guardrails and point it at a real PR. You'll know in five minutes whether this approach works for your codebase.

PR review is broken from both sides.

Humans don't want to do it. The effort asymmetry is brutal. An agent generates a PR in 30 seconds, and now a human is supposed to spend an hour carefully reading code they didn't write, didn't design, and can't ask clarifying questions about. That's a hard sell even when the code is good. When the code is AI-generated, the motivation drops further. Who wants to be a proofreader for a glorified word-guessing monkey?

So hand it to another AI? The 2025 research says that doesn't work either. AI code review comments get adopted 1-19% of the time, depending on the study, while human reviewer comments land at significantly higher rates. The gap is signal-to-noise. AI reviewers flood PRs with findings, most of them either obvious (the linter already caught it) or wrong (the code is fine, the reviewer hallucinated a vulnerability). Developers learn to ignore the firehose.

I built a Tessl plugin to try a different approach. A Tessl plugin (used to be called a "tile") is a context artifact: a bundle of skills, rules, and scripts that gives an AI coding agent domain-specific context. Think npm packages, but for agent behavior instead of code. Mine doesn't try to be a better bug finder. It builds a dossier of evidence about the PR, classifies the risk, and hands a structured brief to a human who makes the actual call.

Where it started

Earlier this year, I spent some time researching how AI-generated PRs are wrecking open source maintainers. That became the Good OSS Citizen plugin, teaching agents how to contribute. But while studying the flood of AI-generated PRs, I kept circling back to the other side: who reviews all this code?

The research said something useful: AI is good at local, checkable problems: buffer overflows, missing null checks, SQL injection in a query builder. Things where you can point at a specific line and say "this is wrong because X." What AI is bad at is intent, architecture, and trade-offs. The stuff that requires understanding why the code exists, not just what it does.

So the design question became: what if the AI reviewer's job isn't to find bugs? What if its job is to gather evidence and let the human make the call?

Build the dossier first. Let the opinions follow from that.

How the evidence-first review pipeline works

The plugin has six skills. The first one matters most.

The evidence builder reads the diff, maps which files changed, figures out what kind of change this is, and classifies risk into lanes: green (routine), yellow (needs attention), red (security-relevant, requires deep review). Everything downstream flows from this classification. A README fix gets a green lane and a light pass. A change to the auth middleware gets red and the full treatment.

Then the fresh-eyes reviewer gets the evidence pack and the code. It hunts for problems, but only problems the evidence supports. If the evidence builder classified a PR as green-lane, the reviewer isn't going to invent an exotic attack vector in a README change. If I enabled the optional challenger (a second model checking the first reviewer's work), that runs next. The research says cross-model review works as a verification layer, and I wanted to test that claim.

After the review, a synthesizer compresses everything into a single brief with findings, evidence, confidence levels, and a recommendation for what a human should focus on. The human handoff formats that brief for the person who actually decides whether to merge.

There's also a retrospective skill that's supposed to run after the human makes their call, comparing the plugin's findings against the human's decision. A feedback loop that's supposed to improve the plugin over time.

Pipeline animation: PR Diff flows through Evidence Builder (risk classification), Fresh-Eyes Reviewer, optional Challenger, Synthesizer, Human Handoff, and Retrospective

What an AI code review brief looks like

When you run the plugin on a PR, the human reviewer gets a brief. It looks like this:

The brief starts with risk classification (green, yellow, or red) so you know immediately how much attention this PR needs. A green-lane config change gets a one-paragraph summary. A red-lane auth change gets the full breakdown: which files are security-relevant, what data flows through them, what the specific risks are, and what to look for when you read the code.

Each finding comes with evidence: the specific lines, why the plugin flagged them, and a confidence level. A finding that says "this user input reaches the SQL query on line 47 without sanitization" is something a developer acts on. A finding that says "potential security concern in this module" gets ignored before the developer finishes reading it. The plugin is built to produce the first kind.

The brief also tells you what it didn't check. If the PR touches areas outside the plugin's domain knowledge, it says so instead of pretending it reviewed everything.

Here's what the plugin produced for a real PR that changes Redis cache TTL configuration in a payments API:

PR: #5 — Update Redis cache TTL configuration
Risk lane: RED
  - Cache invalidation logic changes with auth-adjacent session_data prefix
  - TTL=0 introduces keys that never expire (memory and security implications)
  - Mandatory human review required (auth/security, cache invalidation)

Finding 1 [HIGH / verify]: session_data TTL config entry has no consumer
  File: src/cache/cache_layer.py:17
  "session_data": 0,  # sessions managed by auth layer, no TTL needed
  Evidence: grep for session_data across src/ returns zero results.
  src/auth/sessions.py manages its own Redis keys with 24h TTL,
  bypassing the cache layer entirely.

Finding 2 [HIGH / fix]: Zero-TTL cache entries persist forever
  File: src/cache/cache_layer.py:47
  if ttl == 0: r.set(key, json.dumps(value))
  Evidence: No background cleanup, no maxmemory-policy safeguard,
  no monitoring for key count growth. Gradual Redis memory leak.

Finding 3 [MEDIUM / discuss]: payment_details staleness window widened to 5min
Finding 4 [MEDIUM / fix]: New ttl==0 branch in set_cached is untested

Questions for human reviewer:
1. Is there a planned follow-up PR that routes session data through cache?
2. Are Stripe webhooks invalidating cached payment details on status changes?
3. What is the Redis maxmemory-policy in production?

Four findings, each with the specific file, line, code, and evidence trail. The human reviewer knows exactly what to focus on and why.

What AI code review still can't do

The plugin doesn't replace the human reviewer. The research is clear on this: AI review catches local, checkable problems. Intent, architecture, trade-offs: those are still yours. The plugin's job is to do the tedious forensic work (trace this data flow, check this input path, verify this config isn't exposed) so the human can focus on the questions only a human can answer: should this feature exist? Does this design make sense? Is this the right trade-off?

It also doesn't have real-world adoption data yet. I validated that it finds the right problems across 43 eval scenarios (97.7% accuracy against a 66.6% baseline). I did not validate whether developers trust what it finds and act on it. That's the honest gap. If you run the retrospective skill after a real review, you'll have more data than I do.

In Part 2 (coming soon), I'll show how I built the eval, what I learned from eight rounds of iteration, and the debugging story where I spent a week fixing the wrong skill.

Try it

tessl install tessl-labs/pr-review-guardrails

The plugin, the eval corpus, and the research brief are all in the GitHub repo. Point it at a PR you've already reviewed and compare its brief against what you found. That's the fastest way to know if the evidence-first approach works for your codebase.

Further reading:
Good OSS Citizen Part 1 (the research that started this) | Research brief and eval corpus

Resources

PR Review Guardrails Specification

Introduction to Tessl Concepts

Enhance Workflow with Tessl Skills

Good OSS Citizen Plugin Overview

PR Review Guardrails GitHub Repository

Our AI is the bright kid with no manners, part 1

26 Mar 2026

Our AI is the bright kid with no manners, part 2

26 Mar 2026

Best Agent Skills for AI Code Review: 8 Evaluated Skills For Dev Workflows

11 Feb 2026

Baruch Sadogursky

Baruch Sadogursky is a Developer Advocate who helps developers move from vibecoding to spec driven development, with deep experience from JFrog and now at Tessl.