CtrlK
BlogDocsLog inGet started
Tessl Logo

ainativedev/aidevcon-2026-ldn

AI Native DevCon 2026 London — all conference sessions as interactive skills

70

Quality

88%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

transcript.mdtalk-lopopolo-harness-engineering/

Transcript -- Harness Engineering: How to Build Software When Humans Steer and Agents Execute

Speaker: Ryan Lopopolo (OpenAI) Date: 2026-06-01 Source: user-provided transcript, single continuous block, no per-speaker labels

⚠️ Attribution warning. This transcript was supplied as a single block with no per-speaker labels. The bulk is a monologue by Ryan Lopopolo. A Q&A section near the end contains audience questions interleaved without labels — you can usually tell from second-person framing ("You mentioned earlier in your talk..."). Post-talk closing chatter at the very end is informal cross-talk between unnamed people. Do not invent attributions. Also note: the speaker references the OSS project artichoke/rand_mt, which is publicly associated with Ryan Lopopolo, not Ryan Lopopolo — flag this as an inconsistency if it matters.

Transcription artifacts likely present: "IMAX out" → probably "I max out"; "bottle attention" / "bottle context" → probably "model attention" / "model context"; "ll and as judges" → "LLM-as-judges"; "es links" / "es languages" → "ESLint"; "the chase" → possibly "the case"; "annies" → likely "anys"; "fim ray" → unclear; "xcorks" → possibly "Xvfb"; "ffm picked" → "ffmpeg"; "dish" → contextually "diff" in several places. These are preserved verbatim below.


Intro & "invented the term"

I'm excited to kind of be in the home stretch here on this first day which has been jam packed and super fun. It has been super fun to be here and I'm kind of excited today to talk to you about harness engineering, which is a thing that is kind of near and dear to my heart kind of having invented the term here. And to me. The way that we go about working with these agents is something that fundamentally is brand new and we don't really know all the good parts yet. But hopefully today I could walk you through some of what I believe the good parts of working with these agents are and how to be effective in your own code bases.

Origin story: trying to automate his own job

To give a little bit of context on why you should listen to me about this. Back in June of last year when we had just had the earliest reasoning models around o3 and the very earliest versions of Claude CLI, which is Anthropic's coding agent, I had an insane idea that I would try and get this tool to do my job. And at the time with less capable models, that wasn't true. I asked the agent to read my alerts channel in slack and triage a page. It would not do that. And kind of got myself into this operating of presenting myself as a tool to the model. In order to empower it, to solve problems, issues, and write code on my behalf. And ended up in this very quickly accumulating snowball of effective use of this tool by giving it more and more powerful tools and more and more context around what it means to do the job. There's a bunch of patterns here that make that effective and stack really, really well to your team. We got to go through today.

The pace of disruption

I know I'm preaching to the choir here. Everybody's AI native. That's why we're at the con here. But the way we build software has changed pretty significantly in the last six months. I would say in December with the introduction of o1, GPT-4.5, we really reach singularity levels of software engineering and code production being something that these tools do insanely well. And this is a level of disruption that I think we have typically only seen once every decade here. The last one that I can think of is probably like the existence of the cloud as a tool to accelerate ourselves. And with that sort of like cadence of disruptive innovation, we have had a longer time to internalize changes to our workflows and the way we go about building. But here. The technology keeps changing so rapidly with every point release of these models where I find myself very often having to reevaluate my priors on what even is possible to achieve with these tools. And I think if you're not in the habit of kind of completely retooling your stack in the way you work with every point release of the model, you are in a way missing out on what it is that you can achieve with these tools.

New constraints after code-production constraint dies

The reasons that the way we had built software. Has changed and continues changing at an increasingly rapid pace because we have kind of upended some of the core axioms of what it means to build software. Right now, I'm telling you, the models are good enough in order to do significant parts of the software engineering life cycle, not just writing code, but debugging, triaging, responding to customers, planning, scheduling work, all these other bits that are outside of the core, would you say production function of a software engineer. A lot of the way we have tooled teams and organizations and roadmaps have built around this idea that the production of code is this very, very expensive thing that is going to dominate most of our headcount resources and is slow. And in this world where we can give a call to a coding agent and get a PR or six out of it, that constraint is no longer true. And we kind of have these teams who were doing the bulk of the production for software that need to figure out ways to increase their leverage by delegating increasing parts of that responsibility to these machines. So for all the software engineers and engineering managers and product managers and designers who are trying to incorporate this technology into your work, all of your goals are to be how to unblock your execution team, these coding agents from being able to make your ideas, your vision, and your products a reality.

So having just told everybody here that the core constraints on software engineering no longer apply, what are those core constraints, right? We have kind of a new set of problems to contend with using agents in order to produce our software. And to me, these three things are the foundational limits that remain in a world where we are, as a team of humans and agents producing software. Human time is the fundamentally scarce resource that we have. I know IMAX out probably had three concurrent sessions on my laptop if I want to be more parallel and have higher throughput, I must find ways to remove my own synchronous attention from the process. Human and bottle attention are these foundational limits in the architecture of these LLMs, attention must sub to one thrashing of the agents by having them do more and more work with conflicting and overbearing requirements in the course of a task is something that is always going to degrade performance. Less and less over time, but it is one of those core limits of the models. So we need to retool the way we work in order to be more parallel fork off a bunch more tasks. Be willing to accept smaller or larger or many more PRs in order to let the agents explore what it means to do the job that we need them to do. And finally, you all probably deeply lift this bottle context window. Things that get bigger over time, still a scarce resource, something we need to protect. I will say in my own experience with the GPT series of models, auto compaction is fantastic. I never speak about a context window anymore. I can let a task go for six, 12, 36 hours and still get good results. But the context window being obliterated and rebuilt over the course of these auto compactions is something you must contend with and there are ways that we structure the context we give the model or continually resurface context to the model to deal with this constraint that context windows are continually being emptied and filled.

Writing down what "good job" means

Okay, so we've got these agents that we hope can produce more and more of our software that we can remove humans more and more from the loop in order to produce more code, more features, solve more user needs, address more critical user journeys with higher quality and fidelity. How do we make sure that we as a team with our agents do a good job? And I think. A newish thing here is we have to actually articulate that. We have to write it down. I know it used to be the case, oh, you know, we'll have people go to the office, we'll have meetings through osmosis. Like people will understand what it means for us as a team to ride high quality software to work effectively together. And agents just do not have that capability. They don't have presence in our stand up. They don't have this durable memory that accumulates context and battle scars over time. So we have to find ways to make all these nonfunctional requirements of writing good software legible to the agent. And as an llm, the thing that it craves, the thing that drives it is text. So figuring out ways to take the definition of what it means to do a good job and write it down is a net view function for a software engineering team in 2026. But it's not enough to just write things down. We need to make sure that this text is a thing that the agent will look at because it doesn't do much to say you will write reliable network code by making sure that retries and timeouts are consistently applied. If that text never makes it to the agent. So figuring out ways that not only we can write things down, but also have them pulled into context at the right time in ways that don't thrash the agent and still lead it to be creative and reason, which are the power of these models is the important thing.

The React/suspense onboarding analogy

So to kind of take a step back and look at some. In the small instances of context and what it means to kind of like think systematically and close loops for the models and for your team. If I were onboarding a new engineer to my team and we were, I was reviewing some react code that they had written. And I knew for this particular set of components, we use suspense because that leads to better performance in the front end. I would be able to give that feedback once to the human and they would incorporate it into their mental model to code base what it means for these different screens to relate to each other well. And I would largely solve that problem going forward by empowering my teammate to know more about what it means to do a good job. But I can't really do that with an agent in the same way. So I kind of have to step back, give that review feedback on an agent produced PR and then myself figure out a way to make these mistakes statically impossible going forward. And it might be the case that I'm looking at all those review comments seeing this is missing content that the agent was not able to pull in at the right time to know that the code that it wrote was misaligned. How do I figure out where I can write it down, what links I can have fail, what tests need to exist, whether or not I can power a viewer agent to look at all the proposed diffs through the lens of these guardrails to make it so that this feedback is actually durably encoded as a static guardrail that we apply to every PR going forward. It's not enough to do point in time fixes with these models. We want to make every mistake something that is just not possible. I never want to give the same review feedback twice.

Defining harness engineering

And this is really the core of what harness engineering is. Harness engineering is making context around what it means to do a good job legible. And then just in time surface to the agent over the course of its trajectories in order to steer and refine its output to make sure that every PR we get adheres to the golden thread of what we consider to be acceptable, high quality aligned software.

Shift right, not left

It's kind of funny when you work with these agents, then what I would normally consider to be like good practice or a dev ops and shifting left as far as possible in order to make things cheaper earlier in the process. I don't do that at all when working with agents. In fact, I try and put my interventions as far right in the process as I can. In order to minimize my own synchronous time having to engage with these issues. For example, if I'm working on a PR and I realize I get a bad result. It might just be the chase that'll trash it change my prompt and probably get something good out of it. But that's not really a durable thing. It's not reliable. I don't socialize those improvements to my team. So sort of the next level of shifting that left to write it down. And if writing it down is not enough. Writing down and then empowering or a few agent judge every dish is another way I can shift that left and then I can shift it left further into statically verifiable lints and guardrails and tests and on and on and on earlier in the process.

Pruning latent space

And we think about this as. Needing to surface to the agent. All those sets of nonfunctional requirements. It is not the case that these agents don't know how to write high quality software. They absolutely do. But as artifacts of their training, they have seen every possible permutation of every possible choice that goes into producing software. And it's up to us to prune latent space, to tell it which choices we want to make. If I am using these things to prototype a new data science model in a Jupyter notebook, I have a very different set of choices I make in the production of those gifts than I do if I am working on adding a new index type to a database. You just fundamentally different tasks. So it's up to us as owners of our code bases to make legible the sets of decisions that we make in order to produce our code. What it means for something to be a prototype versus production feature that requires a stage role like with maybe test and feature flags. And if we write this down and give the agent some tools to reason about what type of change is being made to find runbooks that are appropriate to refining its output over the course of its PRs and epics. We can give it bounds and context, but still give it the space to reason, be creative and cook.

Code-as-prompt; unify on patterns

One maybe not obvious thing is that because the agents crave text every bit of text that we feed them is in some sense prompting. It's going to inform what tokens get predicted, which means it's going to inform the code and the disk that we produce. This means all the code in the repository of itself outside of the documentation knowledge base is also prompts. So if we think about aligning the code base or unifying it all on the same patterns, we kind of limit the amount of attention the model needs in order to do a good job. If I am able to standardize on hotel across my entire stack for example, when the model stinks observability, it's able to translate context that is in one part of the repository over to something halfway across the code base without any loss of quality or intelligence. But if I have six of observability stacks in the code base, the model is going to have to spend a lot more time figuring out which one do I use here? Is this migrated or not? What is canonical be good?

Three phases of context delivery

So over the course of the pinaura there's sort of three phases I think about when we're talking about context delivery. And because we are curating the code base in order to make it efficient to deliver context to these agents, we also want to encode that in the operating loop we give the model.

Phase 1: Grounding (agents.md)

To me, the most important thing that ends up in that agents.md Is a numbered set of steps that we expect the model to go through over every rollout that we do, over every session. We first wanted to ground itself in the documentation knowledge base and the ticket that is proposed. We wanted to spider through our history of ADRs and design docs to figure out how this might impact other features of our code base. We want it to look at critical user journeys to inform itself around what screens and user services are impacted so it can keep the QA plan in mind over the course of its execution. We expect some amount of slowness during this process because we want to page in all the context around what it means for this feature to slot in globally.

Phase 2: Messy middle (just-in-time injection)

Then there's sort of a messy middle part of the run where the agent is writing code, running test, exploring the code base. And for that we exploit the fact that these agents are going to call a bunch of tools, run a bunch of tests in order to use them to just in time prompt inject the agent to steer its output back to baseline. The tests we write, the links we write for agents that are very different than the ones that we write for humans. They by default recognize that agents are going to truncate tool called outputs, but they respond really well to descriptive error messages that point them to run books for remediation steps. We are willing to have very many of these things that are kind of fiddly to write and I wouldn't normally think about. To go back to this sort of network code example, I am sure all of you have been paged at some point in your careers around an outage that boiled down to a missing timeout and a retry on a cross service network call. And the collective amount of engineering time that has been spent on this very common failure mode is astounding. But still today there's no code that asserts that we pass retries around. There's no like es links plugin that I can swap into my code base is going to do this for me. But because the production of code is very, very cheap now, we can absolutely vibe a set of guardrails into place. With 100% code coverage and exhausted table driven tests. And migrate the code base all in one go and just in time surfaces failure to the model every time it writes another fetch call and never have to worry about this again. And because we don't have to pollute context window up front and we can exploit the fact that a tool called alcohol is going to be given less weight during an auto compaction, just in time correct the model. And still let it go off and do the complex work that we want it to in our original prompt.

Phase 3: Review & merge

And then sort of after the run, we have a much easier task of determining whether or not the code, the diff, the artifact is aligned because it's a static thing and we have static sets of guardrails and can use very, very many elements to look at the code. Operationalize it with a set of free of static guardrails. This is what it means to write reliable code. This is what it means to write performant react. And make a determination. Is this good or bad? And if it's bad, why isn't that. Because the llms create text, these ll and as judges can collaborate with the implementation agent over that PR thread, give more text back to the implementation agent and further realign the proposed diff back to baseline.

agents.md as a map, not a rulebook

So we've got agents.md kind of as this map that shows where the context is during what types of work the model might want to look at that text. But otherwise not being prescriptive around any of the guardrails. We don't want to jam a ton of rules in here. Because we're going to chop up latent space too much. We're going to make it difficult for the model to spider through the code base with creativity. I find it very, very useful from this agents.md to point to a curated set of review personas that are essentially bolded lists of guardrails.

Slack-thread-to-PR loop

And I find this really, really neat for an interfacing with the other humans on the team perspective because it is so cheap. As a team to have a slack conversation in a thread around what it means to fix that performance regression and then at mention the agent in it to say yoink all of this and put up a PR that adds it to our static set of guardrails. So cheap in order to continually refine and improve the output of our agents in that way. I also think it's really neat to take that same sort of pattern and apply it toward. Documenting what your product features are or what the critical user journeys are or why their apps even exist, what user problems they solve. All this context that we can give the agent helps. Grounded in what we are trying to do and why, how our team thinks about working. Because all of this is going to produce more and more aligned output.

Coarse tools in the messy middle

In that messy middle, we can kind of use. Tests on the EST tests on the structure of the files on disk. Really blunt hammers around file line counts or whether or not snapshot tests exist. These very, very coarse grain tools in order to make the model do what we know is good. Just requiring that every react component in our code base has a snapshot test that gives 100% branch coverage means that the model is naturally decomposing these things and making them pure where possible and not doing prop drilling and putting hooks close to where the data is used because that makes it easier for it to fill the requirement that there must be snapshot tests. And we can do this, we can assert this because it's free to produce the code that spiders through disk and matches up the snapshot test to the underlying component. Another failure mode that I hear folks talk about a bunch is that these agents are doing type shape probing all the time. I end up with these annies or unknowns all over the code base. And the way I've approached it is to just statically disallow any function that has a type of any or unknown unless it's parsing input in a route handler or from the database. Other than that with es languages ban the existence of that type would require the code base to be 100% typed, which means all this bad behavior and weird type probing just kind of falls out because we require 100% code coverage. These functions cannot possibly be exercised because the unknown types can exist. And we get more line code. More online code that I would consider acceptable, high quality, maintainable and all these other sorts of properties. And having these failing checks tell the agent why they failed and what to do instead means that it's able to self heal.

Treating agents as teammates at merge

Ultimately, as we move into that third phase of review and merge, we want to treat the model as if it's another member of the team and it needs to convince me to merge its code. I'm not shoulder surfing. I mean, my teammates and VS code or fim ray when they put up a PR and they attest that they tested the code, I take their word, you know, and if I am unsure, I'll ask them to show me the looks and the staging deploy or to post a screenshot of them exercising the feature in the app. And we can require these agents to do the same thing. It does a lot easier these days now that we have things like computer use and browser use, Claude is fantastic, highly recommended. But even without that, you know, vibing yourself up on xcorks, connected headless display in a docker container and wiring up ffm picked that stream to record a reproduction video is within reach because I don't care about how gross this code is. And Claude is able to sling ffmpeg better than anybody in this room probably.

On the back half of things where we are. Looking for ways to accept the dish. I'm treating it again like I would my human teammates benefit of the doubt biased toward merge. What are the p2 and above things that would be necessary for me to accept this code? Use the reviewer agents, which is really just a matrix CI job that points out a bunch of markdown files. To judge this thing, drive these remediations to completion. Get the coding agent to pick them up. Implement it, get the reviewers to be happy and off we go. And this sort of process with me observing along the way of which review feedback is regularly getting surfaced. Why is it making into this part of the pipeline? Maybe I need to use that as a signal that I need to shift some of these guardrails left. And then I can spend my time at the reviewer agent that spend their time on more bespoke or more complex changes that we need them to look at.

Systematizing feedback capture

Another thing. That you should be thinking about doing as a team is how to systematize capturing all of this human feedback. Every review comment, every time you have had to interrupt the agent, every agentic intervention, every failed build, every exception in production, all of these in some sense are signals that context was missing to the implementation agent that it did not consider full end-to-end consequences of the code that it wrote and whether or not it would be successfully deployed. And what we are trying to do, which I expect you'll learn about in the next talk is slurp all this data up and dream over it every night. Pointing a bunch of sub agents at it trying to distill whether or not there's anything that humans can do better in their prompting, whether there's missing guardrails that should exist in the code base that disallow this behavior and how we can get to a world where we're more and more headless, less human interrupt dependent and able to trust the agent to do more and more complex things headlessly.

Vibe coding's role

I think vibe coding is a big part of what it takes to be successful here because there's a ton of guardrails that only affect my local development process. This code can be gross, but it brings into possibility. This idea that I don't need to care about some parts of the software production function. This lets me operate like a group tech lead or an org lead where I don't have visibility into every single engineer's activity on the keyboard. But the thing I care about are invariants interfaces whether or not the components that they're producing do what they say on the tin with high reliability. And with that, I'll just leave it with y'all can go build things. These tools are fantastic. Go get after. It.

Q&A — shift-right vs lint rules

Take some questions now. Hold on one second.

[Audience member]: Hello, great talk. You mentioned earlier in your talk that you find that you don't need to shift left. As much as before you stay more bright. And I'm curious about that because isn't it better for agents to see something in a lint rule rather than as we give feedback, for example? Like what do you mean by. Staying more right and not shifting left?

So I think once you kind of put these structures in place to surface these requirements to the models at the right time, it becomes pretty easy to rely on them for the most part to auto discover this stuff. It is very often the case that our agents on MD paints a picture of which guardrail files are relevant for which categories of changes back end working on the design system, these sorts of things where the models will just naturally page those sense of persona oriented guardrails into context. Which means I very often just don't see patterns of misbehavior in that way. Only if, for example, guardrails are commonly required over tasks that span 15 context windows and by then the context in those guardrails files has been auto compacted away, then that's the sort of thing that I would use as a signal that, okay, this is the thing that I need to shift my further on. But I do recognize here that it is sort of predicated on making sure that like those auto discovery functions are things that are reliable.

Q&A — practical implementations

Time for one more here.

[Audience member]: Is there a practical implementation of those of the harness you just mentioned in terms of some end-to-end implementation of those capabilities? Before, during and after.

I have started to bring some of these techniques to my open source work. I used to long ago build a Ruby interpreter in Rust called artichoke. There's a bunch of crates out of that work that I still actively maintain. Probably the most interesting one for you to take a peek at is rand mt artichoke slash random T. It's sort of a percent twister implementation. Been doing a lot of fun stuff exploiting automations in the Claude app to basically take my hands off the wheel for a ton of the maintenance tasks of this OSS work. I haven't quite gotten to putting those review agents in place yet, but it's coming.

Close & post-talk chatter

Any final questions? No? Okay, big round of applause for Simon. Thank you. Thank you. Next talk starts in just about 10 minutes in this room. Other rooms are available, but this room is the best. I'm. Alan. Nice to meet you. Great. To meet you. Wow. So. You have the black. Box. Say. My other stuff like. This. One. Yeah. We've got jump. S down. Yeah. There's no. Point. With. All the nice.

talk-lopopolo-harness-engineering

README.md

tile.json