AI Native DevCon 2026 London — all conference sessions as interactive skills
70
88%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
⚠️ Speaker-label warning. This transcript has no per-speaker labels. The body is a single continuous block. Almost all of it is Luke Marsden delivering the talk; the final segment is Q&A in which an MC, audience members (one named "Samuel"), and Marsden are all run together without breaks. A third party "Ivan" is mentioned but not present. When attributing quotes, prefer "Marsden said…" for the main talk and "an audience member asked…" / "in response Marsden said…" for the Q&A. Do not invent attributions. Speech-to-text artefacts (e.g. "didn't analogs business" likely = "Dotscience", "gas tower" likely = "Sourcegraph", "Tesla conference" likely = "Tessl conference", "type cluster exploding" likely = "tech cluster exploding") are preserved verbatim below.
Thanks very much. It's nice to be here. Yeah, I'm going to talk about basically how we use agents ourselves and things that we built. To do it better. So just to introduce myself, I'm Luke Marsden, I self-identifying as a client human. I'm also the CEO at Helix where we do private agents and before that I didn't analogs business. I was SIG cluster-lifecycle lead for the Kubernetes project and I was sort of a DevOps company doing storage for Docker and Kubernetes back in the early Docker days. So my thesis is that all information work is eventually going to become. About managing agents. And it's just software engineering where that's really started and it kind of makes sense why it started in software engineering. Because that's also where AI came from. But eventually everyone who is moving information around in any way, basically all white collar workers are going to end up having to interact with agents one way or another. It's a bit difficult to see on this slide, but this is from Steve Yegge who is the creator of gas tower and he talks about how there are these different stages of AI adoption. That people go through. And you go from just kind of using chat completion like a GitHub Copilot through to running like a single CLI agent. Like a Claw Code. And then you kind of get bored of babysitting a single agent running on your computer. At which point you're like, oh, I should be able to go faster if I run more than one of these in parallel.
And that's where the problems start because the naive approach to that is just run many agents on the same working directory. And then you end up in the horrible situations like when I was rushing to prepare a presentation for a customer in Paris and I was up early on a Monday morning and I was trying to get my astronomy around five agents in parallel to get everything done. And one of them decided to get stash to work for all the other ones would do it. And then the following day a different agent decided to RM minus RF dot on my git checkout. And this actually happened. So I've been working since 2023 on making it possible to run agents and LLMs and RAG and all that stuff around it. That runs entirely on your own infrastructure. You're working with people who care about that. And so kind of late last year I got really obsessed with this idea of making the snake eat its own tail by actually using our own stuff that we were building to build itself. And so we also were lucky to have customers who wanted to push us towards the sort of coding agent side of the equation.
And so we've spent a lot of time thinking about kind of this design space for systems that run agents. And in particular systems that run agents on the server. So all I want to say here is, and we'll go through each of these in turn in the slides. Can a warning contains opinions? And the reason I'm sharing these opinions is that I hope they will be helpful to anyone else who's building systems like those. So kind of both layout kind of the shape of the landscape and then also carve out some specific opinions, but also why I came to have. Them. Because you might have different opinions. Or you might not have thought about some of these things.
So. The first. Big decision, in my opinion, is do you want to carry on letting every developer run whatever agent they like or maybe the same agent that everyone uses in the company. But on their own specific development environment, that's kind of like a snowflake. Or do you want to create a pool of. Agents that run on your organization's infrastructure that many humans can interact with? And my strong argument is that. Going from the centralized approach, and I say centralized, it doesn't mean you have to like hand over all your data to OpenAI or run on their infrastructure or anything. You can run it on your own infrastructure still on Kubernetes or something. But there are some significant benefits from having a centralized approach. For example, you can have globally distributed team all around the world. And you can say that when the sun sets in Tokyo and rises in London, whoever a different human can carry on doing the work that a certain agent was working on. Especially if that agent truly has its own computer for that specific task. And so there's a really nice article written by someone we've been working with a company called Devicon. And what they said was, to be honest, agents coding is not new. Going from doing it the easier way, which is the way on the left, which is unsafe and per engineered, to something global per company and safe. Is what they found interesting in this approach. So that's kind of my argument. My opinion number one is you should give each agent their own computer, not each human. It's almost like you wouldn't hire a team of software developers and ask them all to share one computer. You should give each agent their own computer.
The next opinion that I've come to is do we still need an IDE? Claude Code started making us like not look at the code so much. And I genuinely feel like it made me stupider. My opinion is that you should still have an IDE. These agents will increase long running tasks, but when it comes to actually stepping in and pair programming with an agent, you do still need to be able to look at code. And having a visual display that follows what the agent is doing is actually really useful when you want to understand what it's doing. Another thing that really matters is that things have to be fast. They've got incredibly frustrated with Cursor. Were back when I was a Cursor user. Just how slow it was at handle it keyboard input. And then Claude came up with Anthropic came up with Claude Code. And do you know Claude Code is implemented in React? And so they've ended up having to implement basically a web browser and terminal. It's just. Like, can't we just have a working text box?
So anyway, the next thing I want to talk about is. How you organize your agents. And there's kind of two different schools of thought here. The first one is do you scale your agents out by number of tasks? Or do you try and build like an organization out of your agents and you hire a CEO agent and they hire a VP of engineering agent and the VP of engineering agent hire some engineers and you give them all names or something? And we are, we started with the left hand approach because it's easier, but we're researching the right hand approach. But so far what we've researched, what we've found in the research is if you go to fine grained on giving each agent their own specific role and you give them these special channels, they can communicate on like agent Slack or something. They end up kind of devolving into enterprise politics because they were trained on all of this human data of how humans argue with each other about stupid stuff. So you end up burning a lot of tokens like that. But what I did, what I actually think is ultimately going to be appropriate here is to combine the models where you have this coarse brained different categorizations of roles like marketing agents and sales agents and engineering agents because they will need different tools and connectivity to different systems. But within those. You scale the tasks, scale the agents by task. Let's say you have a sort of pool of bees maybe and so on.
So yeah, I also wanted to share some demos as I go to illustrate the design decisions I'm talking about. So I'll start with this one. So what we've put together here, this is basically this idea of scaling by task. And the notion there is that Kanban is quite a natural fit for scaling by tasks because you want to constrain the parallelism. So you can kick off, this is a sample project that has, let's say, to do this app. And what you can see hopefully is that by clicking play on the Kanban board, we've spun up three separate computers for those three agents that are working on those three different tasks. One of them is adding a dark mode toggle, another one of them is fixing a bug, another one is adding like customized categories. And so what you can see here. Is that we ended up done this with rabbit hole of like, okay, if you've got, if you want like a good background agent experience. That feels like the foreground agent experience. You end up implementing GPU accelerated desktops because non-GPU accelerated desktops are not very nice to use. But if we're doing AI infrastructure, we've got some GPUs lying around in it. It just so happens that the same GPUs you use for inference, a lot of them also support like WebGPU were originally for graphics. And they also support hardware video encoding. So we borrowed a lot of the technology from the gaming industry. I don't know if anyone's ever played like computer games in the cloud. It's actually possible to get pretty low latency and good performance. So we're using the same tech as that. But basically what's happening here from a process perspective is the agent has. Started up inside the IDE. We use Zed because it's fast. And we also forked it so that you can remote control it. So we can inject prompts into the Zed agent. And the other thing that's nice about Zed is that it supports MCP, which means it integrates with all of the major agent harnesses. So it works with Claude and Claude Code. We're not trying to reinvent that. At all.
Then. Yeah, so. The next thing is like, well, what's that agent actually going to do? When it starts up? And this is a Tesla conference, so it would be bad not to talk about spec driven development, but I'm actually a huge fan of spec-driven development. So the approach that we've taken is. Spec is central to what's going on. The human prompt that comes in is something really short like add dark mode support to my to-do list app or something kind of really trivial. And that prompt goes into the agent. It's actually the same agent. It's just the agent has an explicit planning phase and later implementation phase. So we prompt the agent to say please write a plan. Because it turns out that if you get the agent to write a plan before it does the work, you get much better results because you can then correct it and you can kind of do that design work before the agent has gone off in the wrong direction when it's harder to steer it. It's in the middle of doing implementation. And it's got, it's misunderstood something fundamental. So you put the prompt in into the planning phase. And then you have the human in the loop commenting, which you'll see in a second. And then, but the spec is the combination of the human prompt and the planning agent reading code. So I don't know if you can see this red arrow here. We've got to go check out for the project, which is one of the repos you've attached. And the job of the plan phase is based on the user's requirement, write a good spec based on also looking at the code so that you understand what you're going to do. In quite a lot more detail than the user specified. And then once you approve the spec, so there's a human in the loop I'm commenting on the spec. There's also a human in the loop on approving spec to say, yes, I'm happy with that spec. Then you go to the implementation phase. And because each agent has its own desktop environment, you can now do some interesting things like the agent can test to QA. If it's a web application. And lots and lots of things are web applications. So you can then have the agent put up a PR on GitHub, but then also, and a human in the loop review both the PR, but also the running application in that agent's own computer. And because each agent has its own isolated computer, they're actually Docker adopt environments or Docker Kubernetes environments. They're not trading on each other's toes. When they're bringing up the Chrome instance and MCP to navigate around.
So yeah, I'll show you this. Spec review. Phase. So here we had these three tasks or four tasks and like 30 seconds a minute later. The systems created changes created its own spec. And so if we go and look at the bug where deleting a to do item from the to do list doesn't delete it from storage. What you can see is it's come up with this spec. So it was given that simple prompt. There's a bug with deletion and it's come and written out a much more detailed document. And then you can comment on the document because I believe that the terminal like Claude Code is actually a terrible place for reviewing and commenting on documents. It's much nicer to use a Google Docs style UI, which is we added one. So you can see there's this kind of wrapper around the agent where you can comment on a line in the document and it will push changes to the spec. And the specs are all just markdown files in a git repo. And we put all those markdown files on a special branch that's separate from the rest of the repo. So all the agents can see all the other specs and all the other agents to work with. So they're pretty isolated from each other, but they can, we'll see the history of all the work that's happened or is. Happening. So then. We fast forward this a little bit. And what I did was I added a request to on deletion. Let's also do like a fiery CSS animation. So it looks like the item is burning. And then I can review the rest of the plan. There's basically a to-do list. And then that's the human approval game. You click approve design. And again we put the prompt in there and then I spell up the actual implementation part. So I've got 15 minutes left. But then you can see this is the really cool part. The agent starts QA'ing the application. I'm not typing here. The agent is typing by groceries and walk the dog and then watch. It's going to delete by groceries. I think. And then we get to see the animation. And I'm like, that wasn't a very good fire animation. So I'm going to prompt it and I'm going to say that was a bit rubbish. I want you to make it look like it was actually burning in hell as I think the phrase I used earlier. So as soon as we, yeah, the CSS animation was not cool enough. And then you can press enter twice and it will interrupt the agent that's running on the background. So it does some thinking, it comes up with some better CSS animation. And then you can kind of collaboratively QA what the agent has done. Directly in the browser. And it's pretty cool. So it's actually kind of nice to give you changes on their own computer and to have a high quality like way of interacting with it that you'd actually want to use. And just to give you a sense of what the board looks like after this, you get to see that you've still got these items to review the spec one, but then you books in progress task over here.
So there's a few other things that we learned from. Kind of the change that the changes that come about when instead of developers wanting to spend. All of their time looking at their IDE to only needing to interact with each agent like for a few minutes every hour. Of course they can paralyze like trying to try and do more, but they can also go to the gym and do some exercise while their agents are doing work. And so we also wanted to optimize for mobile experience. So we made sure that. Everything works on. Mobile and on tablet. I put a LinkedIn message out that was like our niche pitch is that this is the best way to run Zed on your iPad while you're at the gym. And I had one person commenting that's like, I feel seen. So I guess that's our target audience. Anyway. So. The other interesting thing about this is. That notice this diagram I put up earlier had many users looking at the same screen. Or many users looking at the same agent. And so we added this kind of Figma like ability to have multiple mouse pointers. And so you can actually start collaboratively. Like multiple humans can basically pair on the same agent's desktop and they can join in and they can both interact with it. And one of those humans might be at the gym on their iPhone. Anyway.
I'll go quickly. The next. Thing that we found was really important and actually stopped us using it. Remember I was saying we wanted to have the snake eat its own tail. We wanted to dogfood this to make it build itself. The next pain point we had was the text. Sorry, it took 40 minutes to build our stack from scratch in Docker. And there's really no point being able to spawn agent sandboxes rapidly if those agents are going to say 40 minutes before they can do any useful work because they actually want to be able to test and QA the application that they're changing themselves. So we did some quite interesting things with ZFS as a file system to make clones of the most developed environments are in Docker. That's our kind of working assumption. And Docker and Docker can actually go 16 levels deep. Which is crazy. But we don't go 16, but we go like I think three. But that means that you can give each agent its own Docker environment, but you we also prime the data directory in that Docker environment with the latest version of the startup script running against the latest main commit. So every time an agent starts up, it's got a really fresh fully cached Docker environment to start up from.
And so this is how we use. This, how we use Helix to build Helix. So here we've got an example of. A task down which was can you review this external contribution? Another. Like lots of other tasks in progress like we're doing Notion integration for one user. The desktops shut themselves down on for an hour of being idle so I can start this one back up and then interact with it. And I think I speed up the video here. But basically everything we do now is done inside the tool. And so we've achieved that goal of kind of the snake eating its own tail. And most of my day now when I'm doing technical work is that I'm reviewing the specs and I find two lines in the spec where the agent got it wrong or is making a poor design decision or is unaware of some other area of the code. And just comment on those two lines and then it then becomes fairly mechanical. For the agent to get each code changed all the way through to sort of being close. And then the pull requests that come out of it are quite nice because we put quite a lot of effort to telling the agent to take screenshots. Of changes that it's made. So I quite often review pull requests now by just looking at whether I like the change. And that's quite powerful. And you can also link, we also link automatically to the spec that was used to construct it.
Now I want to talk briefly about token costs privacy and Donald Trump. The type cluster exploding and we spend a lot of money now on tokens which has a bootstrap startup is kind of painful. But we also have one of these machines. These open source models like Llama 3.1 and so on and can now do pretty much 80% of the work that you need to do. And so. If you take as a meaningfully sized organization kind of your next three month token budget and instead invest in some hardware with like 8x RTX 6000 Pro in it. And then you use something or you build something that allows you to easily switch between different agents and different models. Then you can avoid getting locked into like the OpenAI Anthropic ecosystem, but you can still kind of burst out to Claude Opus 4.1. If you feel something really hard. But for all of the tedious stuff you should be using local models where you just have to pay for the electricity. In my opinion.
Now the very last thing I'll say is that I'm intrigued by this idea of building a self improving business. At the bottom layer you've got a self improving code base which is what you can kind of already do here by putting software issues into a Kanban board. If you then plug that into agents that deal with product and support roles, then you can start automatically improving the product based on user feedback. And then if you also add agents that can do sales and marketing and finance and legal things, as well as kind of the founder level stuff of like what's the hypothesis, what's the direction of the market and so on. Then all of this is with humans in the loop, but you can start like going a lot faster with fewer people than you could before, which is really interesting. And so the very last thing I'll show you is we also use this to like log into LinkedIn. And it's fun because you're collaborating with the agent. And so the agent can be like, oh, can you do two factor auth please? Now? LinkedIn thinks I'm human. And I kind of am. I'm a human that's. Like using an agent kind of like a mechanical suit. To do a lot more LinkedIn outreach than I would normally have the patience for. And so I got it to build me a list of 200 people I wanted to reach out to when I was visiting the bay area recently. And I actually got a bunch of good meetings. I wouldn't have had the patience to do that by hand. So it's pretty useful.
Coming up to time. So just to recap, we've looked at the design space of systems that run agents. I hope this has provided some useful ideas. Talked about local versus centralized. Obviously my opinion is that we should go centralize. My opinion is that we still need like an IDE. Task scaling seems to be the most appropriate place to start. But I think the org shapes are interesting. Spec-driven development is a must. Make things mobile, multiplayer. Get your dev environment set up sources. Think about how you can reduce token costs because if it's not hurting now it's going to be hurting soon. And also think about how you can apply to be on software engineering. So thank you very much. That's me. LinkedIn or if you just like to compare notes, I'd love to chat. So thank you so much. Incredible. Thank you very much.
Do we have any questions from anybody? Oh you've got two. So follow the assistant for get the mic and run it down for you. We've got one this side and one this side. Thank you. Yes. All right. Yeah, that was a really, really cool talk. And yeah, I think one of the things that I was interested in is the kind of security aspect of. So obviously agents can do lots of things and you just let it log into your LinkedIn and play around with that. What have you got any guardrails and security things in there to what extent have you got it like? Because I find one of the hard things is letting agents run wild versus making them secure but then having to have lots of input manually.
Well, I think my first argument there is that we're already doing better than like opening the floor on your man. Ini. When I saw the open Claude one Password extension, I was like. So by just logging it into one website, it's only got one login and it can't go and log in to everything else. So that's step one. Now you can also configure on a per project basis what MCP servers you expose. But I think there is definitely a need for, and I think Ivan's working on something interesting in the space. Like I definitely need for tooling that helps you lock down what you're giving permission to and what not. It's not something we necessarily want to try and bite off as well. I think we'd rather partner with people who are doing good things in that space. But it's definitely a need. And the more people who ask me for like a governance for that agent control plane. I'm like, oh, we need to put the governance panel in our thing. So yeah, especially. Thank you. For.
Samuel. I'm just really curious how. The GPU VMs.
Yeah, so it's pretty interesting. The way it works is that we run Mutter, which is the Wayland compositor inside a Docker container. And well documented. And then we use the GStreamer plugins to. Business project called Wolf that does this at the gaming community that was written in C++ and for a long time we tried to use that. And then we found that honestly. I got really bored of trying to write C++ and it was pretty flaky. It ran the whole thing. In like one process for all of the different containers. So when it crashed it brought everything down. So we ended up just taking the core pieces of that, which is this Rust plugin for handling NVIDIA CUDA stuff in particular. And this is all I'm going to get up if you want to go and have a look. But yeah, also feel free to ping me. I can tell you more about that.
You talked about still wanting an IDE early on. What do you find you want to do with that? I'm assuming that it doesn't need to look like VS Code from 2020. So can you say a bit about what it does have done?
Yeah, I mean we just invented Zed which is a really nice idea written in Rust that I like because it's fast. And also when you're running hundreds of these things on a machine you care about the memory you've got currently. I really like just being able to watch the agent Zed around between different tiles. And also I find that watching like a flow of the agent doing the work. Just you ambiently absorb more knowledge about the code base as it's doing that versus the black box that is Claude Code these days. And then when something gets difficult and the agent actually needs help, then you've already got a decent IDE that you can search for things. And look around. So my take is you need an IDE on the inside. You also need the meta IDE which is like the control plane for all of the different agency running. And the higher level viewer. Thank you. Thank you so much. Yes. So next up we've got back in the box who's going to be talking to us about co-develop cells, the rise of the agent. Use one of these. Yeah it's easier.
.tessl-plugin
talk-azriel-executable-specs
talk-baker-sadogursky-context-engineering-skills
talk-batey-building-product-teams-age-of-ai
talk-birgitta-closing-keynote
talk-cormack-tests-lie-observability-ai
talk-debois-agent-enablement
talk-douglas-training-ai-on-your-own-code
talk-dubnov-merge-rate-ai-adoption
talk-farley-vibe-coding-best-we-can-do
talk-firtman-web-mcp-agentic-web
talk-foxwell-reinvention-dev-team
talk-groetzinger-skills-everywhere
talk-jones-odevo-ai-native-transformation
talk-jourdan-pipelines-to-prompts
talk-katsioloudes-code-security-ai
talk-kerr-bipolar-disorder-dysregulation-ai
talk-kushwaha-benchmarking-agent-era
talk-lamis-context-engineering-dreaming
talk-lawson-agent-experience
talk-lopopolo-harness-engineering
talk-lubken-embedding-pi-coding-agent
talk-maleix-collective-intelligence
talk-marsden-agent-desktops
talk-martinelli-spec-driven-development
talk-moss-skills-team-workflow
talk-obstbaum-willoughby-vibes-to-metrics
talk-overweg-one-brain-no-filtering
talk-podjarny-skills-are-the-new-code
talk-roberts-ai-native-brownfield
talk-roberts-brownfield-ai-native
talk-ruiz-agents-on-canvas-tldraw
talk-scheire-artificial-intelligence
talk-selajev-docker-sandboxes-agents
talk-sloan-harness-engineering-beyond-code
talk-smith-connecting-context-future-transports
talk-stack-humans-architect-ai-writes-code
talk-syme-agentic-repository-automation
talk-thomas-ai-native-engineering
talk-trieloff-browser-agents
talk-walter-runtime-intelligence-agents
talk-wotherspoon-humans-vs-slop