Logo
Book a Demo
CareersDocsRegistryBook a Demo

NEWS

The model's solved, now comes the hard part: Reviewability as the bottleneck

AI engineering shifts focus from model development to ensuring system reviewability, emphasizing manageable task sizes for reliable and governable outputs.

Paul Sawers

·2 Jun 2026·9 min read

It's something you'll likely be hearing more and more: the model is no longer the big sticking point in AI engineering. The question keeping teams up at night is how to build reliable, governable systems around it.

Kilo, the open source coding agent built on VS Code, recently crossed three million downloads and processed more than 40 trillion tokens. And the lessons that volume of real-world usage produced had little to do with model intelligence, and everything to do with reviewability, context, and operational control.

‘Task size should be bounded by what a human can review in a single sitting’

Forty trillion tokens sounds like some sort of success metric, but Kilo's own assessment is a little more measured. At that volume, small problems in the surrounding system become expensive fast. A missing context file becomes repeated tool calls; a poorly scoped task produces a diff too large for any engineer to sensibly review; and a vague permission setting becomes a blocker the moment a second team tries to adopt the tool.

The conclusion Kilo drew from its own usage data was pointed: task size should be bounded by what a human can review in a single sitting. If the output can't be reviewed, it can't be trusted, and if it can't be trusted, it won't be merged.

To illustrate the point, Kilo describes splitting a single feature into three parallel workstreams — a billing API endpoint, a test suite, and documentation update — each handled by a separate agent with a narrow, explicit instruction. One diff touches the endpoint, another touches the tests, while one touches the docs. If the tests fail, the failure is scoped. If the docs agent guesses, the mistake is visible.

“The job changes from ‘write every line’ to ‘design the loop,’” Brendan O'Leary, developer relations engineer at Kilo, writes in a blog post. “You decide the task boundary, the model, the permissions, the environment, and the verification step. The agent writes code. You decide whether that code should exist.”

Kilo's findings fit into a broader pattern emerging elsewhere in the industry. Sourcegraph, the code intelligence platform, recently analysed 1,281 agent runs across more than 40 enterprise-scale open source repositories and found that the gap between success and failure had almost nothing to do with the underlying model.

"The difference between complete failure and near-perfect completion wasn't intelligence — it was efficient access to context," Stephanie Jarmak, agent advocate at Sourcegraph, said.

One benchmark task saw an agent make 96 tool calls over 84 minutes without proper retrieval tooling. The same task, with the right infrastructure in place, took five calls and under five minutes.

The lesson from both Kilo and Sourcegraph is that the systems surrounding the model increasingly determine the outcome.

The infrastructure around the model is the engineering challenge

Kilo's experience also surfaced a more granular picture of what production-grade agentic engineering actually requires. The full loop — plan, scope, run, verify, review, merge — needs dedicated infrastructure at every step. Planning needs modes and file-backed handoffs. Scoping needs explicit permissions and task boundaries. Running needs model choice, tool calls, and environment isolation. Verification needs tests, CI integration, and sometimes a second agent with fresh context. Review needs a diff a human can understand. When any one part is missing, the agent may still produce code, but the team just won't trust it enough to merge.

But reviewability is only part of that picture. OpenAI's most recent enterprise guidance, drawing on deployments at companies including BBVA, Philips, and JetBrains, shows that organisations seeing the most traction are those focused on evaluation systems, context management, orchestration, and governance — not on which model sits underneath.

"The organisations that win with AI won't be the ones that tried it first — they'll be the ones that operationalised it best," said Sanj Bhayro, OpenAI's managing director for EMEA.

The emerging picture is of a new engineering layer forming around AI systems: evaluation tooling that runs against real codebases, shared context registries, permission controls, usage analytics, and observability infrastructure. Kilo's own roadmap reflects this directly — its next priorities centre on portable sessions that survive moving between VS Code, the terminal, Slack, and cloud environments, and on ensuring every agent workflow ends in an artifact a human can judge.

Governance before autonomy: teams won't adopt what they can't explain

One of the less obvious lessons from Kilo's experiences is the difference between individual and team adoption. Individual developers adopt tools when they save time. Teams adopt them when they can explain the risk — to finance, to security, to whoever owns the production environment.

“That means agentic engineering needs controls that feel boring until you need them,” O’Leary notes.

Kilo learned that the hard way. Its early free credits attracted tens of thousands of throwaway accounts, generating billing pressure, infrastructure strain, and weeks of engineering time spent in merge conflicts rather than shipping product. The experience sharpened Kilo's thinking on what enterprise-grade agentic tooling actually needs: model allowlists, usage visibility before a billing surprise arrives, permission prompts that can block tool calls, isolated cloud environments for sensitive work, and source visibility for security review.

Those requirements map directly onto the questions Kilo found developers asking about any open source AI tool: can I inspect what runs against my code? Can I bring my own model key? Can I control which models my team is allowed to use? Can I see usage before a bill arrives? Can I keep sensitive work local? And crucially — can I leave if the product stops fitting how my team works?

The infrastructure layer is still being built

Sourcegraph's retrieval findings, OpenAI's governance lessons, and Kilo's focus on reviewability all point toward the same challenge: reliable AI systems depend on reliable infrastructure around the model.

Kilo's own roadmap frames the next phase in three parts: portable, meaning sessions that survive moving between VS Code, the terminal, Slack, and cloud environments; governed, meaning teams can set model policies, inspect usage, and control permissions; and review-first, meaning every agent workflow ends in an artifact a human can judge — a diff, a test result, a PR comment, a deployment preview.

Forty trillion tokens and three million downloads later, Kilo's conclusion is that generating code is only part of the problem. Teams still need ways to review it, verify it, govern it, and trust it. The model may be good enough, but the systems around it are still being built.

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

READING

·

0%

IN THIS POST

‘Task size should be bounded by what a human can review in a single sitting’The infrastructure around the model is the engineering challengeGovernance before autonomy: teams won't adopt what they can't explainThe infrastructure layer is still being built

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

YOUR NEXT READ

OpenAI is shutting down self-serve fine-tuning – what this signals for enterprise AI

OpenAI is phasing out self-serve fine-tuning, citing advanced models reducing its necessity, signaling a shift in enterprise AI towards infrastructure challenges.

Paul Sawers

·20 May 2026·7 min read
Read more

More articles by Paul Sawers

See all articles

What 1,281 agent runs reveal about coding agent failure in large codebases

Sourcegraph's study of 1,281 agent runs in large codebases identifies infrastructure, not model capability, as the main bottleneck, revealing five common failure patterns.

Paul Sawers·20 May 2026