ARTICLE
Why Warp is betting engineering leaders are done picking a favourite coding agent
CEO Zach Lloyd talks multi-harness orchestration, why the CFO is now in the room for AI tool discussions, and what governed autonomy actually looks like on a “factory floor.”

Paul Sawers

Engineering leaders have spent the past year trying to get their teams to adopt AI coding tools as quickly as possible. Now, a new set of questions has taken over: how do you measure whether any of it is worth the money, and how do you stop agents from running unchecked on production systems?
Developer tooling company Warp, maker of an AI-native terminal and development environment, thinks the answer isn't picking a single agent and standardising on it — it's giving teams a way to run several at once, compare them, and govern all of them from a single control plane.
As Tessl wrote back in February, orchestration has emerged as a discipline in its own right — a dedicated layer of tooling for coordinating, supervising and directing multiple agents running in parallel. Back in February, Warp launched Oz as a cloud platform for running and managing coding agents at scale.
Now, Warp is taking things a step further. In May, the company expanded Oz into what it's calling the first multi-harness control plane — meaning teams can now run Claude Code, Codex and Warp Agent simultaneously through a single interface, rather than committing to any one of them.
Tessl caught up with Warp CEO Zach Lloyd to discuss how engineering leaders are thinking about agent fleets, what the harness layer actually changes, and where the lines between autonomy and human oversight are really being drawn.
"The wild west": how the agent gold rush became a budget problem
Zach spent several years at Google, leading engineering on Docs and Sheets before co-founding photo-editing startup SelfMade. He later served as interim CTO at Time, before founding Warp in 2020, raising some $30 million in funding from the likes of Google Ventures, Figma co-founder Dylan Field, and Salesforce’s co-founder Marc Benioff.
That background — building collaborative tools at Google scale, then navigating the startup world — gives Zach a particular vantage point on how quickly the engineering tooling landscape has moved. A year and a half ago, he says, most companies were still trying to get developers to use AI autocomplete tools. Then, about a year ago, the conversation moved to interactive agents — Claude Code, Codex, Warp — where engineers were directing tools to build features and fix issues end to end.
Now, he says, that phase too has largely passed — and the CFO's arrival in the conversation is perhaps the clearest sign of it.
"Companies right now have moved from a 'can we get people to adopt' mindset to a 'how do you measure ROI' mindset," Zach explained. "They're paying a lot of money for these tools, and the CFO has gotten involved — these costs are showing up, and so they are thinking through how to go from the wild west, where every engineer is just spending as much as they can on different agents, to a world where they're still getting as much productivity as possible, but they want to measure it, they want to put quotas and budgets, and they want to use different agents for different types of tasks."
That last point is central to Warp's multi-harness bet. Rather than standardising on a single agent, Zach argues that engineering teams want the ability to route different tasks to different agents depending on what each does best — while keeping the governance layer consistent across all of them.
"The biggest trend that we see is: can you use open-weight models for some tasks when you have to be at the frontier?” Zach said. "The way that we're positioning Oz is that you can basically not lock into one source of intelligence. You can use Claude Code, you can use Codex, you can use open-weight models — but you can still confidently invest in a layer of infrastructure for governance that is not tightly coupled to any one particular agent."
The economics driving that are already visible. Open-weight models — DeepSeek, Kimi, Qwen — have gone from lagging well behind the frontier to matching it on many tasks, and at a fraction of the inference cost. Tessl also recently switched its default eval model from Claude Sonnet 4.6 to GLM 5.1 for exactly this reason — finding that for skill evaluation work, a cheaper open-weight model produced near-identical signal at meaningfully lower cost.
Elsewhere, AI agent startup Lindy recently moved 100% of its traffic from Anthropic to DeepSeek v4, with CEO Flo Crivello claiming the company would be saving millions in the process.
It's worth noting that Warp has been doubling down on openness more broadly, open-sourcing its client earlier this year and using Oz itself to manage the repo — agents handle the implementation, community contributors handle direction and verification.
“We now have a lot of confidence in code that is generated by Oz with our rules, context and verification, so anyone contributing should have a high chance of success coding a feature correctly,” Zach said at the time.
The move also serves as a live test of Warp's own thesis — if the orchestration layer is good enough to run a public repo at scale, it's good enough for enterprise teams to trust with their own.
“Leaning on agents creates pressure for us to nail orchestration, memory, handoff, and all of the other parts of agentic engineering that are core to our business,” Zach continued. “There’s a virtuous loop here.”
That loop extends to customers too. The things that matter most — context management, memory, audit logs — can all be separated from the agent itself, Zach argues. That's the point of Oz: a container layer for all of it, so that when the best model or harness changes — and Zach is clear that it will, every few months — teams aren't starting from scratch.
The model isn't enough: why the harness and context matter just as much
The natural question is whether multi-harness is a solution in search of a problem. If Claude Code and Warp Agent can both run on Anthropic models, what is the harness actually changing?
Zach's answer is that performance is a function of three things working together: the model, the harness, and the context.
"The harness is what feeds the context in," Zach said. "You want a harness that is good at managing the context window — when do you take different sources of external context and put them in? If you put too much context in, the model has to summarise and it loses information on the current task. How you manage that context window is really important. Different harnesses excel at different things — Claude Code is a great harness, Codex is a really good harness, Warp's agent harness is [also] really good."
The model and the harness are table stakes. The third element — organisational context — is where Warp is investing most heavily right now, through what it calls cross-harness memory. The idea is that as agents complete tasks, the system captures what worked and surfaces it automatically in future runs, across whichever harness is being used.
"Every time one of these agents runs, it does some task, and maybe in the course of figuring out some problem, with the guidance of a human, they arrive at some solution," Zach said. "What you don't want to do is throw that away and start from scratch next time. If you have a memory system, think of it as a layer that is observing what all of your agents are doing and being like: this seems like an important thing to remember."
Cross-harness agent memory is currently in research preview with a small number of pilot customers.
More autonomy, more controls: Warp's answer to an uncomfortable balancing act
The tension at the heart of Oz's pitch is one that Zach doesn't try to resolve so much as manage. On the one hand, the platform promises agents that can handle complex, long-running tasks — migrations, production deployments — with less human oversight. On the other, the same release adds approval gates, per-user authentication, and least-privilege permissions.
Those two things pull in opposite directions.
"I think there's a fundamental tension, but I think it's necessary," Zach said. "From talking to our customers, I don’t think companies are ready to be fully hands off. The ideal system at this moment looks like a factory floor, where you want to put stuff that can be automated through an automation process, but then you want a human to step in and say: ‘was this done right’?"
The logic Zach applies is essentially risk-tiering. The parts of the stack where errors are cheapest get automated first; the parts where they are most costly stay human-supervised longest.
"The parts that can be most automated are the parts where the risks are lowest — this is common sense," Zach said. "Making changes to our website is way lower risk than making changes to our data. So you'll see more and more of the guardrails go away on the low risk things before they go on the high risk things."
As for who inside an enterprise actually draws those lines, Zach says it's rarely one team. Platform teams or dedicated AI developer productivity functions tend to lead, with security always involved and finance increasingly so.
"The security team is always involved — probably the team that's most scared," Zach said. "Increasingly there is a cost management component. What's the budget for this? What's the token budget per engineer? What's the way that you see ROI? It's starting to become a significant line item for all of these customers."
Evals: measuring the factory floor
Which brings the conversation to evals — how teams actually know whether any of this is working. Zach's framing here draws again on the factory floor analogy: what you want, ultimately, is a bird's eye view of how work flows from idea to shipped product.
Warp has built a live version of this for its own open-source repository at build.warp.dev, where anyone can pull up a view of how issues move through the agent pipeline. Zach uses it as a reference point for what enterprise teams should be aiming for.
"The things you can measure are throughput of code as one basic measurement," Zach said. "Ideally, in a more sophisticated world, you would go all the way from measuring throughput of code to throughput of user or customer impact — be able to tie back: ‘a ticket came in asking for this feature, an agent was able to build it, it cost this number of dollars or tokens, and in production it was used by XYZ customers’. That's the dream loop. The code part is not that hard — that's where we can just deliver."
Token efficiency per PR is the baseline metric Warp currently offers. The harder problem — tying agent output to business outcomes — remains what Zach calls the “holy grail.”
The agent builder: a new role that doesn't require an engineering background
One of the more striking parts of the conversation is what Zach describes happening to engineering teams themselves as agent fleets become the norm — at Warp and at the companies it works with.
The background profile of engineers Warp hires hasn't changed much, he says. What has changed is what they do.
"The day to day of a software engineer now is not about writing code," Zach said. "It's about: can you accurately specify a user requirement to an agent? Can you make sure that the technical plan an agent comes up with makes sense? Is it building in the right part of the codebase? Is it repeating a bunch of code? Is it using the same quality of abstraction that a human would use?"
Beyond that shift in existing roles, Warp has also introduced a new function it calls the agent builder — a full-time role focused on building internal automations using agents. Notably, the people filling it don't come from engineering backgrounds.
"The people who are in this role are people with product and design backgrounds," Zach said. "They are not engineers by training, and I don't think you need that. For internal tooling use cases you can hire people who are more generic builders. One of the cool things that's come out of all this new technology is a democratisation of who gets to build stuff."
The caveat is that this only holds where the stakes are low — customer-facing product, he implies, is a different matter. "As long as it's not customer-facing, I think it's pretty much fine for that to work that way," Zach said.
Among the companies Warp works with, Zach sees two distinct camps emerging. Larger organisations with dedicated developer productivity teams are building their own internal software factories from scratch — the complexity is manageable if you have the headcount. Smaller ones are buying, because the build cost simply doesn't justify the investment. What they share, he says, is the destination: a centralised system where agents handle the routine work and humans focus on the exceptions.
What that means in practice for engineering leaders is less about which agent to pick and more about building the layer around it — the governance, the memory, the measurement — that makes any agent trustworthy enough to run at scale.
For all the variation in how companies are approaching this — different tools, different team structures, different risk tolerances — Zach sees them all heading toward the same place.
"The goal of most companies right now is to get to what I would call an internal software factory — a centralised system where agents are taking in issues, judging, building, verifying, pushing," Zach said. "They don't want to do that for 100% of the issues, and they don't want to take humans out of the loop. But they're all trying to stand up this same kind of machine. And different companies are further along on this journey than others.”
COPY & SHARE

Paul Sawers
Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source
READING
·
0%
IN THIS POST
COPY & SHARE

Paul Sawers
Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source
YOUR NEXT READ
The hidden cost of agentic software development: why context engineering matters
AI token costs in software development are rising, impacting budgets. Context engineering is crucial for managing these expenses and ensuring efficient resource use.

Paul Sawers



