The Tessl Registry now has security scores, powered by SnykLearn more
Logo
Back to articlesPassing tests are not enough

12 Mar 20265 minute read

Macey Baker

Macey Baker is Tessl’s Founding Community Engineer, helping shape the future of AI-native development. Some of her best friends are LLMs.

A useful pattern is emerging in coding agent evaluation.

On February 23, 2026, OpenAI published Why SWE-bench Verified no longer measures frontier coding capabilities, arguing that contamination and test-design issues now limit SWE-bench Verified as a frontier progress metric.

A couple weeks later, on March 10, METR published Many SWE-bench-Passing PRs Would Not Be Merged into Main, showing that many test-passing patches still fail maintainer merge standards. Comments from HackerNews broadly agree: code can be functionally correct and still create maintenance drag, and other issues.

In aggregate, these findings point to the same operational takeaway: benchmark pass rates alone are no longer enough to estimate real development usefulness.

What METR adds to the picture

METR reviewed 296 AI-generated PRs with active maintainers and found a consistent gap between automated pass rates and merge decisions. Their headline number is clear: maintainer merge decisions were about 24% lower than automated SWE-bench pass rates on average.

The important nuance is also in the note itself. METR does not frame this as a hard capability ceiling. Agents were not iterating through review feedback the way human contributors typically do. The result is better read as a warning about naive interpretation of benchmark scores, not a claim that agents cannot improve.

Why this gap appears

Automated checks mostly answer: "did this patch satisfy the test harness?"

Maintainers also ask:

  • Is this implementation idiomatic for this codebase?
  • Does it fit the creator’s conventions and intent?
  • Does it add avoidable complexity?
  • Will it be easy to maintain in six months?

Those questions are central to merge decisions, but they are often weakly represented in benchmark grading.

The bigger narrative arc

OpenAI’s February post and METR’s March note land on the same point from different directions. OpenAI argues that benchmark integrity can drift over time through contamination and test-design artefacts, while METR shows that even clean test-passing outcomes can still miss the standards maintainers use to merge code. Together, they shift the conversation away from a single headline benchmark number and toward evaluation methods that better reflect how software is actually reviewed, accepted, and maintained.

How Tessl evaluates for merge-quality, not only test pass

At Tessl, our repo evals are designed to capture team-specific merge gates.

A typical repo eval scenario includes:

  • task.md: the concrete engineering task.
  • criteria.json: weighted scoring criteria.
  • Repeatable runs across agent and model configurations.

The key is the rubric layer. It makes quality expectations explicit: style fit, architecture adherence, side-effect risk, and other relevant signals that simple pass/fail checks miss.

In practice, this turns evals into a feedback loop teams can run continuously, not a one-off benchmark snapshot.

A practical loop teams can run this week

A useful starting point is to run an eval on your actual commit history: select representative commits, generate scenarios from those commits, and score multiple agent/model setups against merge-relevant rubric criteria. The command flow below does exactly that, and the interesting signal usually comes from the failed criteria and review notes, which then inform the next iteration of context and prompting.

Example command flow:

tessl repo select-commits org/repo --count=10 --since=2025-01-01
tessl eval generate-scenarios org/repo --commits=<sha1>,<sha2>
tessl eval run ./evals/ --agent=claude:claude-sonnet-4-5 --agent=cursor:auto

The direction here is less about replacing benchmarks and more about grounding them in real engineering outcomes. The teams getting the most value from coding agents are the ones measuring what reviewers actually care about, then iterating quickly on that feedback loop.