Speaker labels unavailable. This transcript was supplied without per-speaker labels. The body is a single speaker — Katie Roberts — delivering the talk. The final portion (from "We got a couple of minutes for questions" onward) contains unlabelled audience questions interleaved with Katie's answers. When attributing, prefer phrasing like "an audience member asked..." for questions and only attribute substantive content to Katie. Speech-to-text artefacts (e.g. "Katie Cullen" likely = "Katie Roberts"; "$350,000 lines" = "350,000 lines"; "Damian. Deming's" = "W. Edwards Deming"; "drive thru digital" likely = "deterministic simulation" or similar) are preserved verbatim — do not silently correct.

Opening — infrastructure metaphor and the MinIO motivation

Fundamental. Block for. How people want. To locally. So. Hi to everyone who's online as well. I love tests and I've always liked tests. With AI, I enjoyed thinking about testing a lot more. I live in the east of England in the middle of nowhere. This is our local infrastructure. This is the Roman road. This is what happens if you don't maintain your infrastructure. I love infrastructure software and it's kind of my thing. And a while ago last year, MinIO stopped maintaining their S3 clone, object storage product. Lots of people complained and said product. I'm very obsessive about object storage. I gave this talk at KubeCon a few years ago. It was one of the most watched talks from Katie Cullen about object storage.

The experiment setup

And earlier this year I was kind of looking for a project to experiment with this. I wanted to understand what happened if you tried to use AI to write really large complex systems. Because it was great fun writing small tools with AI and you could write them really fast in half a day and it was exciting. But like could you really build huge things? Object storage to give you an idea. Most of the implementations I've looked at take around at least two years to write. Some of them a decade. These are big projects. And so I thought well. How hard could it be? I might just give this a go because maybe I can do it in a week. And like I knew there's going to be a big project. I was kind of excited about riding a big project. At the moment it's $350,000 lines of rust. So I knew I wasn't going to read the code. If only ever half learned rust anyway. And reading the code is kind of annoying. It's going to be a huge and complex and I don't necessarily need to understand it at all, but I know I want high quality code and that was my thing. But I knew there was plenty of scope for things to go wrong. I was building distributed systems, complex stuff. And I was going to reverse engineer this entire thing by talking to us, training how it works with my AI. So it was like this was going to be kind of fun.

Goals and constraints

I didn't want to broadly understand the code and be in the loop always and really understand what's going wrong. I had architectural opinions about how I wanted it to work. I didn't want the thing to collapse into chaos as it got larger. I mean I talked to a lot of people who said, yeah, after 100,000 lines of carriage, the AI can't do anything anymore. It's terrible. Give up. Security a lot and performance engineering, things like that. And so I've got opinions about security good. And I was trying to not automate everything up front because I wasn't know what went wrong. So doing a real human in the loop thing. And the other kind of question I want to answer for myself was like what can we achieve with AI coding? Can we be really ambitious and build things we never would have tried to build? You know, I know people who build database startups and it takes a decade. It's kind of a hard slog. It's like if we can do this in less time, we can build more of these interesting systems like building and using. So that was what I tried to find out.

Quality via testing; against 100% coverage

And I like this quote from Damian. Deming's wrote about quality and he introduced quality control to Japan after the war. You know inspection of quality. I know, you know, looking at the lines of code does not actually find the box. We know that. So you have to build quality as you go along. And one of the important things to do that with obviously is testing. And so I was like, well, I'm going to go lots and lots of texts. And just using testing, I think. Works really well on small projects. Whether you're doing a small AI project, you can really, really test it extensively. You can be, you know, kind of ruthless about it and you can be pretty. Happy that it's actually working for you and that, you know, simple. But once you get into these larger projects, testing gets more complicated, I found, you know, especially when you get into tribute systems, you get into race conditions and you get into red things, we talk about in a minute. And I keep hearing people say, well with AI, all you need is 100% test coverage and you're done. And it's like. But I did try that. I started off trying to get 100% test coverage. It seemed like a good idea. And I also measured different measures of test coverage from test coverage from integration tests. And it didn't really help a lot. I found they were kind of better uses of my time than trying to get 100% You know, I sat down and I asked AI agent to get 100% test coverage. It kind of wrote trivial tests that are like, I don't, you know, this is stupid test. I don't care about that. And there's also a lot of weird arrow cases that are really hard to cover. I mean, I went through and asked the agent like, well, why haven't we got 100% test coverage from this? It was like, we don't have a test for when the random number generator in the system fails. And I was like, yeah, I don't think I want to write a test for that. Like, this is too, too much error injection. Like I can see that there's a, you know, it's going to give an error or it's going to panic. And that's the right behavior. I don't need to have test coverage for that. All my test suite. But I still have a lot of tasks. I mean 75% of my code base is tests. Test coverage is between 75 and 100% for each file. Like it's like. You can still have a lot of tests without being really obsessive about 100% test coverage, which I think is not really the right kind of aim.

Test oracle pattern

The great thing about copying an existing system like S3 is you've got this test on. You can find out what happens when you do something. You run the test, you can run it against the, you can run it in, I guess, against S3. I have 1500 tests that run against S3 and they lock down the behavior. Then you tell the AI, you have to do it exactly like that. And it's much better at doing things when it does that. You know, it basically gives it a nice baseline. And so much so that I actually think that if you're writing a complex system, the one way to start is to write it really, really simple version that's kind of trivial that basically has the same behavior that you can use as a test oracle so that when you're building the more complex version, you don't make it break. And you can build a test read against a simple version first. It's not 100% easy. Like you pointed at Amazon S3, for example, and you just gather all sorts of kind of weird things like S3's authorization is, a lot of it's eventually consistent, not immediate. So you have to keep working out, well, we need a load of retries in this test. Otherwise it looks like behavior is different and things like that. And so you have to kind of do some interpretation even with a kind of exact test oracle, which is kind of annoying. So it's not, it doesn't, it's not quite a specification. But it kind of does keep you very, very grounded, which is really important. The other thing about having a test oracle or something you're trying to copy, whether it's an existing system of yours or someone else's is that. The public API doesn't cover behavior. You can't really see that goes on in the background. And so you can't tell when things happen or invisible behaviors about things. You have to kind of either infer them from something you can observe. Or you need to find some other way of testing those kind of behaviors. So, you know, you can't see when an item really gets deleted in S3, it kind of happens at some point. But you know, you kind of, there's no API that lets you see behind the scenes of that. The other thing, like you think S3 is really nicely documented, it's got all these hundreds of pages of documentation. The documentation turns out to mostly be wrong. And every single detail, when you actually look at the details. So again, having the test oracle and running the test is much more important than reading the documentation. I think the documentation, you know, maybe it was true once, maybe it was, maybe it's approximately true. It kind of gives you hints about things that are interesting, but never trust anyone's documentation at all. But again, that gives you grounded tests on what real behavior is and real edge cases.

Edge cases

Edge cases are really interesting. Because they kind of, it was, it was kind of a while in when I found when I was actually working on improving the type system and the code. And the was writing some test cases for this, run them against s3 and found a 500 error. Which was repeated role. And that was kind of quite exciting because this was, you know, this was an interesting edge case that had come up. It was implementing the code and it just thought, well, I'll test this against S3 and let's see what happens. And clearly Amazon did not have a test case for this. And I did. And it actually gave me confidence that I was fine. As soon as I found out, I was like, my test suite's actually good now. Or parts are really good now because I'm finding really weird errors and really weird edge cases. That means I must have kind of explored a lot of the universe of testing. And it says kind of, because it's kind of where when you're doing this kind of complex development, because sometimes you feel everything's terrible and everything's really bad and like, this is never going to work. And then other times you feel, oh, actually, yeah, this is kind of working again. So this was kind of. The kind of thing about testing that gives you this kind of confidence. I found another one, another repeatable 500 hour later and another weird education. I go, okay, how many of these there are? And if I two so far. It was. So I was quite good at the edge cases, particularly kind of when it's actually writing code itself. I kept finding edge cases by reading the AWS documentation thinking, is that really kind of true? Have you got tests for this? Asking the AI to write tests for them and finding other weird things that were approximately equal to the documentation or hence from the documentation. When I put the AI at the docks and asked her to do the same thing, it was really bad at that. It seemed to not be able to look at docs and find edge cases in the way that I can. So that was kind of interesting. So, but sometimes if you asked it, so if I, I mean, sometimes I find education through test coverage issues. Like we haven't make sure you've got improving test coverage, we'll find some age cases just asking you to think about the usual kind of edge cases like zero length and one length. And 10,001 length and that's our unhelped. A bit. But there's, you know, you kind of have to iterate through these things and kind of think about a QA person and think like a test yourself. And have some ideas about areas that might have weird errors.

Flaky tests with AI — the hard rule

Flaky tests were really interesting. I said that like AWS was, AWS converges to truth over time, which is really annoying. And this wasted a huge amount of time. And I was like never have flaky tests with AI as my hard rule. Just fix them immediately. It's weird things go wrong. Sometimes it decides that the training data says that developers never fix flaky tests, so we ignore them. And I then shouted it and say no, in this code base, we fix our flaky test that says so in the agents.and file and you're ignoring it again. But somehow the training data says no one fixes flaky tests, which was definitely true. I've worked in many places that had a lot of leakage tests. Sometimes with the AWS test, it decides that I do this, changes their behavior every day. We'll just change the code to match again because we must match AWS's behavior. So we, and then it's like, oh no, it's changed back again. Okay, we'll change that again. And it's like, no, there's something wrong with the test. You've got to fix the test first. So that was kind of annoying. So I would absolutely. Like, like this is the opportunity to fix flaky tests. AI is good at and it will really help your tests if you just get rid of all your flaky tests. And also make your test as fast as possible and run them a lot. You'll then find the flakes much quicker because most flakes don't happen that often. I currently have 5,000 tests that run in two minutes. And that's kind of two minutes is my kind of borderline acceptable. I might try and speed them up again. But it's like, you know, that's, to me, that's okay. But they have to, you know, you're running them a lot. And say, and you need to find the flakes. And sometimes with certain kinds of change, you get a lot of flakiness and you spend like, you know, I've set up overnight runs doing repeated test runs to try and find errors on multiple machines and things like that. So, you know, it's a, it's a, it's not nice if it's slow. Oh, the tests that really are flaky for some reason.

Asking AI for new kinds of tests

I ask great at all sorts of tests. I basically every now and again I would ask it things like what, you know, what kind of test. Should we have had to fix those issues we just had and it would come up with new kinds of tests. And many of them found issues. The first test would get the property based testing found some issues like, you know, so I think you can do things that you might not have ever thought about before. There's lots of great kinds of testing that are available and you should try them and see how they work and have more tests, more kinds of tests as well.

What tests can't find

What Tess can't find. Is important to know and understand. Sometimes they can find race conditions and again like the more tests you have, the more faster they are, the better. But it's hard. Performance tests, we'll talk about it again in a minute. It's hard to have good performance tests on an ongoing basis. We can do it. You can find security issues with tests. You can't decide if your architecture is good with tests. And if you can't measure it right now, you can't really test it. Or if you, you know, so there's lots of things that you have to kind of think outside your test and try and work out how to get them inside. And that's. You know, you've got to be thinking about these issues that your tests are not finding. And again, that's why just focus on test coverage. Means, you know, thinking about security enough. To make things more testable, you know, you've got to think about things that, you know, any kind of signal you can get out of the black box basically. You know, if there's a meowing noise, then it's giving you information that you need to know. Increase the scope of what you can test by building more testable interfaces. One thing I kind of regret doing is not really building the management and reporting and backend interfaces because I could have run the tests on those to understand what's going on better. I was really focusing on the public API because that's the thing I felt I was trying to replicate and not the internals. I have unit tests on them, but it's the apis are kind of internal and structured. And I don't necessarily know how much I trust them. Because I can't see them. I can only see it looking through testing and I can't like sit there and play with them. So I kind of like the more you build out the better.

AI-built tracing as a debugging tool

Tracing. And classic, you know, observability pieces. I discovered that like even just getting the AI to build. A hand-built hand maintained tracing framework was incredibly useful. You don't need to tie it into production system or something, but anything that can give it traces that it can look at to debug is amazingly useful. In this case, it had a bunch of overheads. So when I used it for performance testing, it was a little bit misleading, but it told it, you know, it basically gave way the big. Performance gaps were. And it was incredibly useful for debugging because I could give it like, you know, I could run the test. I could have my test suites running looking for race conditions or errors. Give it a trace and say, this happened overnight overnight run. We need to fix this and it would get, it would let actually lock down on what the real problem was rather than trying to guess. Because if you give an eye a bug but you don't know how to repriorit and it's a very, it's a rare condition. It can waste a lot of time. Either try, I mean, it can either fail to reproduce it itself or it can guess what the solution might be and get it wrong or something. And if you can give it a trace and some trace tooling and just get it to sit there and try and reproduce it and itself and see if it's the same thing, then it usually can happen. And that works really well. So you don't need to necessarily hook it up to a production environment. You can really do this. Just by getting the AI to build some tracing tools for you.

AI performance engineering

Performance testing. I found the AI very much like human people doing performance testing performance improvements. Like you build something, it wouldn't improve the performance or it'd make it worse because it would think this must be the way to fix this and it's not. And that's just the way performance engineering to be honest. And kind of lean into that. Just remember, this is cheap low cost work and you just throw it away if it doesn't work. Don't do it just because it seemed a good idea and keep it. It's like generally just. Say, no, throw that one away and try something else. Comparing performance against other systems was quite fun. I did some performance testing against one of the other S3 implementations. And got the re performance to be the same and that was nice. And then I was like why is our right performance really slow when it spent a bunch of time thinking and said eventually it said they've got a comment in the code saying we haven't done fsync when we actually write. And I was like okay and of course it's going to be faster. I started wasting my time trying to actually make performance the same as something that's doing something we don't want to do. But. It's good tool for that.

Type system over tests for invariants

Other things I had a lot of issues early on with how permission checking worked and to have check time of use testing permissions twice in different places. I tried tracing these and you know fixing some issues but ended up just telling you to fix it in the type system instead of actually trying to use tracing or anything to do this like having authorized request type that can't be reauthorized make sure that all the things going into at this gate or authorized all these types of functions take authorized requests and then just force everything through types. And that saves a lot of effort and you know once you've constructed it so you can't have to have a test for this anymore because you know that the types are enforcing it for you and I spent a lot of time like looking at the interface types between modules and just seeing if they look same.

Security review with Codex

Security reviews are used like Codex security. I found a lot of issues. What I do with them is I check the findings into the repo and ask the AI to review them. Three quarters of them are valid. They're not necessarily 100% security findings and then I would every now and again I do review sessions for like how could we have avoided these? What tests should we have that would fix these? I found it really valuable because although I do a lot of AI code review at the time and find a lot of issues and do a lot of review iteration, it still found things that have been missed that were actually important and really quite kind of major things. So. I've done more kinds of security review as well. I mean Codex security reviews, pull requests, switch is fine. But you also want to sit down and just review the state of the code as it is as a whole and look for issues and so on. I found those have been very valuable.

Human in the loop

As I said human in the loop. I view myself as part of the feedback loop. I have opinions and I'm here to find out what's going wrong. And you know so I've been not trying to automate things too much because I want to actually understand what's going wrong in order that I can because I'm responsible for quality and I care about the code and I want to know it's good. Because that's my kind of aim with that. So I kind of view myself as part of that.

Lessons / closing

So what did we learn? So a test oracle or model is really useful. And copying things that exist is actually nice activity. That's long history of air vehicles projects. Doing that. The new project was there to replicate Unix and so on. And you know it's a great activity to do and it's kind of good fun and it works quite well. This is my github commit. Graph as you can see. Refactoring. Like you have to do enormous amount of refactoring. The ridiculous week in the middle where it was 120,000 lines plus and 75,000 lines minus. Part of that was that was just I basically had these refactoring weeks where there's another one near the beginning and I would just refactor stuff. There was one 43,000 line single file that had to be refactored at that point as well which is kind of the in scale there. But like refactoring is part of the feedback loop. You don't feel you have to one shot things like you're converging on a better answer. No. That there's an answer and if you had the Risev tests, it's there. You kind of You need to expand the tests if you're uncertain. And where you think you're suspicious and you think there might be more errors. Work out new things you can attest to. And just kind of it's part of your kind of quality control thoughts about, you know, what's going on and, like, am I am I happy about this? Do I think this code is looking good? Or do I am I worried? I don't know if it's worried. You probably want to add more. Tests and, you know, kind of spend more time trying to to break things. Because your role is to is to be there and break things and get them fixed. I can open source this code in a week or so when I've just finished the distributed system space. Again, when I'm happy with it. So if you wanna have a look, sign up and send you a an email when it's ready. And we got a couple of minutes for questions. Thank you Justin. Any questions at the room? Please put your hands. Up. And I'll run over to you. My as a gentleman. There. As well. Any questions? In the middle left. With the tables, it's a little bit trickier.

Q&A — formal verification

[Audience question] Did you experiment with any formal verification or testing tools. For the distributed system part.

[Katie] Not yet because I'm quite I'm still working through it. I'm I'm want to next. I'm looking at basically I mean, yeah. I'm basically gonna look at drive thru digital as soon as it's kind of fully implemented. I'm I'm, like, I have a sort of huge transition plan from the single host to the multi host, and, like, all the bits of it being through. And I'm hoping, like, next week, it'll be runnable. As fully distributed. And and so, yeah, then yeah. I'm yeah, I'm really interested in what I can do there because I think there's probably gonna be some bugs And I'm, you know, I've been I've been looking at those tools for quite a while, and I'm I'm really interested in that space. And you can what you can verify and what you can fully verify. Formal verification is, like, something that I love to do more often. It's kind of something that there's mixed reports about how good AI is at it. As well, and I'm really fascinated that area. So, yes. There's one right at the back, ain't it?

Q&A — the test-oracle "tautology" risk

[Audience question] Thanks for that. Absolutely brilliant. Quite you know, it's it's uncanny, unsimilar to what I'm doing. And I know how you you know, just think about this question, but the challenge I had in nice think you you're bound to have it with a test oracle. Is as, you know, the AI loves telling you that, oh, that would be a tautology. Because when you're building a system and you're a test oracle, the test oracle has to work on an entirely different different way of doing the exact same thing with the system you're testing. So that you're not just running the logic twice. Which would naturally produce the same output. So the the the big difficulty I had was implementing this system using the AI and then implementing the test oracle to follow an entirely different way of achieving the same outcome so that these two keep each other in check So this will be a you know, this may be a doing entirely different things. But does that sound familiar? Did you think or did you have to fight? That?

[Katie] Bit yeah. I mean, I think it was slightly easier because s three was so was so external and, like, just there and, like, I had some similar issues, though, with with some of the model testing it set up where it was, what am I actually testing anything that's different from the code? And I think that yeah. And I think that said that I was still build an oracle even if I was building a complex software. But I think I'd probably build it maybe outside the repo or something as a fixed, like, very done model of it that was yeah. Because I think you you can end up in a situation where you turn out you're not really testing anything at all. That's not the same as the thing they're testing. And you need to have that sort of fixed guarantee that it's what you want. And and, yeah, it's it's it's definitely easier when you've got an external system. You're or something that you can really nail down that that's true.

Close

Awesome. That's all the time we have for Okay, we're gonna actually start the couple of minutes. So we'll invite Ian on stage, but don't go too far. Start in a couple of minutes. Minutes.

.tessl-plugin

talk-batey-building-product-teams-age-of-ai

talk-birgitta-closing-keynote

talk-debois-agent-enablement

talk-douglas-training-ai-on-your-own-code

talk-dubnov-merge-rate-ai-adoption

talk-farley-vibe-coding-best-we-can-do

talk-firtman-web-mcp-agentic-web

talk-foxwell-reinvention-dev-team

talk-graziano-spec-driven-development

talk-groetzinger-skills-everywhere

talk-jones-odevo-ai-native-transformation

talk-jourdan-pipelines-to-prompts

talk-katsioloudes-code-security-ai

talk-lamis-context-engineering-dreaming

talk-lawson-agent-experience

talk-luebken-embedding-pi-coding-agent

talk-maleix-collective-intelligence

talk-maple-ai-native-devcon-welcome-slick

talk-maple-ai-native-devcon-welcome-spec-reviewer

talk-maple-aind-devcon-welcome

talk-maple-context-engineering-skills

talk-maple-continuous-ai-github-workflows

talk-maple-harness-engineering

talk-maple-tldraw-ai-canvas-experiments

talk-marsden-agent-desktops

talk-martinelli-spec-driven-development

talk-moss-skills-team-workflow

talk-overweg-one-brain-no-filtering

talk-podjarny-skills-are-the-new-code

talk-roberts-ai-native-brownfield

talk-roberts-brownfield-ai-native

talk-scheire-artificial-intelligence

talk-selajev-docker-sandboxes-agents

talk-sloan-harness-engineering-beyond-code

talk-stack-humans-architect-ai-writes-code

talk-stoneham-product-brain

talk-tal-skills-security

talk-thomas-ai-native-engineering

talk-walter-runtime-intelligence-agents

talk-wilson-cq-stack-overflow-for-agents

talk-wotherspoon-humans-vs-slop

README.md

tile.json

ainativedev/latest-aidevcon-speakers-london-2026

transcript.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}talk-roberts-ai-native-brownfield/