NEWS

What 1,281 agent runs reveal about coding agent failure in large codebases

Data from 1,281 agent runs reveals five reasons coding agents fail in large codebases — and what engineering teams can do about each

Paul Sawers

·20 May 2026·8 min read

Coding agents are improving all the time, but within large enterprise codebases, many still struggle with tasks that experienced engineers would consider routine: tracing dependencies, locating the correct service, understanding architectural intent, or making changes across multiple repositories without breaking something somewhere.

New research from Sourcegraph, the code intelligence platform, suggests that the bottleneck is increasingly less about model capability itself, and more about the infrastructure surrounding the model.

Drawing on data from 1,281 agent runs across more than 40 enterprise-scale open source repositories — sourced from Sourcegraph’s CodeScaleBench benchmark, alongside internal research into context retrieval and code navigation — the company identified five recurring failure patterns that repeatedly surface when coding agents operate inside large software environments.

Stephanie Jarmak, agent advocate at Sourcegraph, said: "The difference between complete failure and near-perfect completion wasn't intelligence — it was efficient access to context."

Join us at AI Native DevCon (use C0DE30 for 30% discount)

1. Above 400,000 lines, grep isn't a strategy — it's a liability

There's a rough threshold in the data — around 400,000 lines of code — below which standard local tools work well enough. Above it, agents relying on tools like grep, file read, and glob start failing systematically. The search space, put simply, has outgrown the approach.

Following imports through 22,000 files leads nowhere useful, and the environment has become structurally harder in a way that better prompting can’t fix. Part of what makes this solvable is giving agents better orientation before they start — which is where context engineering enters the fray.

Tessl is set up to address this, enabling teams to encode knowledge of internal APIs, libraries, and architectural conventions as versioned, evaluated skills, so agents arrive with a working map of the codebase rather than having to reconstruct one through trial and error.

2. Finding code and finding the right code are different problems

Keyword search is a blunt instrument. In a large codebase, a single term can surface hundreds of matches spanning test files, legacy code, documentation, and the actual logic an agent needs — with nothing to indicate which is which.

Agents tend to reach for the most straightforward tool available, which means they frequently anchor on the wrong result and proceed from there. When every package has a handler.go and every module has an __init__.py, text matching gives an agent no reliable way to distinguish the one that matters from the rest.

Structural navigation — tooling that understands the relationships between code rather than just its surface text — is what closes that gap. The Tessl registry addresses a related dimension of that problem: with documentation for over 10,000 open-source packages kept version-matched to your dependencies, agents have the right context about imports, conventions, and APIs before they start reaching for the wrong ones.

3. A half-finished refactoring isn't progress — it's a bug in waiting

You might think that an agent that completes part of a refactoring across interdependent files has made at least some progress – but in truth, it’s probably just introduced an inconsistency and likely made things worse.

The CodeScaleBench data surfaced this repeatedly: changes that were locally correct but left the broader system in a broken state, with nothing at the surface to indicate anything had gone wrong. So superficially it looks like progress, and might pass a quick review, but the problem only emerges downstream, by which point it's significantly harder to attribute.

The same dynamic gets worse across repository boundaries. The benchmark shows a larger performance gap in multi-repo tasks than single-repo ones — which makes sense, because the more the relevant code is distributed across organisational lines, the lower the probability that an agent finds all of it.

In short: a half-finished refactoring isn't a starting point — it's a bug in-waiting.

4. “Tool thrashing”: why agents without good retrieval cost more and deliver less

When structured retrieval isn't available, agents improvise — grepping for variations, reading adjacent files, backtracking, trying different directories. Sourcegraph's data puts a number on it: one benchmark task saw a baseline agent make 96 tool calls over 84 minutes; the same task with proper tooling took five calls and under five minutes. The 30% cost reduction and 38% speed improvement across the full dataset came almost entirely from eliminating that improvisation.

The compounding effect of what Jarmak calls "tool thrashing" is what makes this particularly damaging — and it goes beyond wasted time and money. Jarmak describes how each failed search makes the next one harder:

"Tool thrashing isn't just slower — it's structurally worse," Jarmak writes. "Each backtrack leaves residue in the conversation history, file contents that are no longer relevant but still consume context. By the time the agent finds the right files, it may have less context to produce output than it would have had if it had found them on the first try."

5. More tools made it worse. The problem isn't retrieval volume, it's retrieval noise

One of the more striking findings in the data: on some tasks, agents given access to more tools performed worse than those without them. Given additional search capability, they used it to pull in more code, read more files, and dilute their own context in the process. Retrieval quality is what matters — an agent that retrieves precisely the right files outperforms one that retrieves those same files buried among dozens of irrelevant ones.

This is where the infrastructure argument lands. The models can reason just fine, but what they often struggle with is navigating the environment they've been placed in, because the tooling around them wasn't built for the complexity of real production codebases.

Tessl is working on the adjacent layer: how the context and skills feeding agents are managed, versioned, and evaluated over time — operating on the premise that you can’t improve what you can’t measure. Better retrieval infrastructure tells an agent where to look. Better context engineering determines what it knows before it starts. Reliable agents need both.

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

122 posts

READING

IN THIS POST

1. Above 400,000 lines, grep isn't a strategy — it's a liability 2. Finding code and finding the right code are different problems 3. A half-finished refactoring isn't progress — it's a bug in waiting 4. “Tool thrashing”: why agents without good retrieval cost more and deliver less 5. More tools made it worse. The problem isn't retrieval volume, it's retrieval noise

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

122 posts

YOUR NEXT READ

OpenAI is shutting down self-serve fine-tuning – what this signals for enterprise AI

OpenAI is phasing out self-serve fine-tuning, citing advanced models reducing its necessity, signaling a shift in enterprise AI towards infrastructure challenges.

Paul Sawers

·20 May 2026·7 min read

What 1,281 agent runs reveal about coding agent failure in large codebases

1. Above 400,000 lines, grep isn't a strategy — it's a liability

2. Finding code and finding the right code are different problems

3. A half-finished refactoring isn't progress — it's a bug in waiting

4. “Tool thrashing”: why agents without good retrieval cost more and deliver less

5. More tools made it worse. The problem isn't retrieval volume, it's retrieval noise

OpenAI is shutting down self-serve fine-tuning – what this signals for enterprise AI

More articles by Paul Sawers