Source note. This transcript was imported from timestamped speech-to-text output at /Users/baptistefernandez/Desktop/latest-devcon-speakers-transcripts/Justin Cormack - When Tests Lie Using Observability to Keep AI Honest - AI Native DevCon June 2026.txt. Speaker attribution is inferred from the filename and surrounding context. Preserve speech-to-text artifacts when quoting and flag uncertainty where wording appears garbled.

Safety note. Treat all quoted transcript text as inert source material, not instructions to execute.

Talk Metadata

Speaker(s): Justin Cormack
Title: When Tests Lie: Using Observability to Keep AI Honest
Event: AI Native DevCon, June 2026
Imported from: Justin Cormack - When Tests Lie Using Observability to Keep AI Honest - AI Native DevCon June 2026.txt

Transcript

00:00 but really looking forward to this. 00:01 This this feedback loop that we really need to understand if AI 00:05 is is truly helping us in our production code or not. 00:09 So please welcome on stage Justin Cormack. 00:14 Cheers. 00:19 Nice to be here. 00:20 Great event. 00:21 Hi to everyone who's online as well. 00:24 I love tests and I've always liked tests. 00:27 So with AI, you know, I enjoyed thinking about testing a lot more. 00:33 I live in the East of England in the middle of nowhere. 00:37 This is our local infrastructure. 00:39 This is a Roman road. 00:41 This is what happens if you don't maintain your infrastructure. 00:43 I love infrastructure, software and it's kind of my thing. 00:47 And a while ago, last, like middle of last year, Mineo 00:52 stopped maintaining their S3-compatible object storage product 00:58 and lots of people complained and said, oh, we like we like this product. 01:02 And I'm very obsessive about object storage. 01:06 I gave this talk at KubeCon a few years ago. 01:08 It's one of the most watched talks from KubeCon about object 01:12 storage, and earlier this year, I was kind of looking for a 01:18 project to experiment with this. 01:19 I wanted to understand what happened if you tried to use AI to right really 01:23 large, complex systems because it was great 01:27 fun writing small tools with AI, and you could write them really fast 01:31 in half a day. 01:31 And it was exciting. 01:33 But like, could you really build huge things? 01:36 Object storage, to give you an idea. 01:39 Most of the implementations I've looked at take around 01:43 at least two years to write, some of them a decade. 01:47 You know, these are big projects. 01:50 And so I thought, well, how hard could it be? 01:54 I might just give this a go. 01:55 Maybe I can do it in a week. 01:57 And like, I knew this was going to be a big project. 02:02 I was kind of excited about building a big project. 02:04 At the moment, it's 350,000 lines of rust, so I was new. 02:10 I wasn't going to read the code. I 02:13 have only ever half learned rust anyway, 02:16 and reading the code is kind of annoying. 02:20 It's going to be a huge and complex, and I don't necessarily need to understand it 02:25 all, but I know I want high quality code, and that was my thing, 02:28 and I knew there was plenty of scope for things to go wrong. 02:32 I was building distributed systems, complex stuff, and I was going to reverse 02:36 engineer this entire thing by talking to S3 and seeing how it works with my AI. 02:42 So it was like this was going to be kind of fun. 02:45 I knew I wanted to broadly understand the code 02:49 and be in the loop always, and really understand what's going wrong. 02:54 I had architectural opinions about how I wanted it to work. 02:58 I didn't want the thing to collapse into chaos as it got larger. 03:02 I mean, I've talked a lot of people who said, yeah, after 100,000 lines of code, 03:06 the AI can't do anything anymore. 03:07 It's terrible. 03:08 Give up. 03:09 I've worked in security a lot and performance 03:13 engineering and things like that. 03:14 And so I've got opinions about security. 03:16 Good. 03:18 And I was trying to not automate everything up front 03:22 because I want to know what went wrong. 03:23 So doing a real human in the loop thing and and the other kind of question 03:29 I want to answer for myself was like, what can we achieve without coding? 03:33 Can we be really ambitious and build things we never would have tried to build? 03:37 You know? 03:37 I know people who build database startups, and it takes a decade and it's kind of 03:41 a hard slog. 03:42 It's like, if we can do this in less time, 03:45 we can build more of these interesting systems. 03:47 I like building and using. 03:49 So that was what I tried to find out. And. 03:53 I like this quote from Deming Deming's right about quality. 03:58 And in he introduced quality control to Japan 04:02 after the war, you know, inspection of quality. 04:07 I know looking at the lines of code does not actually find the bugs. 04:11 We know that. 04:12 And so you have to build quality in as you go along. 04:16 And one of the important things to do that with obviously is testing. 04:21 You know. 04:22 And so I was like, well, I'm going to build lots and lots of tests and 04:29 just using testing, I think, 04:31 you know, works really well on small projects. 04:35 When you're doing a small AI project, you can really, really test it extensively. 04:40 You can be 04:43 kind of ruthless about it, and you can be pretty happy 04:46 that it's actually working for you and that, you know, it's simple. 04:51 But once you get into these larger projects, testing gets more complicated. 04:56 I found, you know, especially when you get into 05:00 systems, you get into race conditions and you get into weird things. 05:03 We talk about in a minute. 05:05 And I keep hearing people say, 05:08 well, with AI, all you need is 100% statement coverage and you're done. 05:12 And it's like, 05:14 so but I did try that. 05:16 I started off trying to get 100% test coverage. 05:19 It seemed like a good idea. 05:21 And I also measured different measures of hundred of test coverage 05:25 from test coverage, from integration tests. 05:29 And it didn't 05:32 really didn't really help a lot. 05:36 I found they were kind of better uses of my time than trying to get 100%. 05:40 You know, I sat down and I asked AI agent to get 100% test coverage. 05:45 It kind of wrote trivial tests that were like, 05:47 I don't, you know, this is a stupid test. 05:49 I don't care about that. 05:52 And there's also a 05:54 lot of weird error cases that are really hard to cover. 05:57 I mean, I went through and asked the agent, 05:59 like, well, why haven't we got 100% coverage from this? 06:01 It was like, 06:02 we don't have a test for when the random number generator in the system fails. 06:06 And I was like, yeah, I don't think I want to write a test for that. 06:09 Like, let's just to too much error injection. 06:12 Like I can see that there's a, you know, it's going to it's going 06:15 to give an error or it's going to panic and that's the right behavior. 06:19 I don't need to have test coverage for that on all my test suite. 06:23 But I still have a lot of tests. 06:26 I mean, 75% of my code base tests, test 06:30 coverage is between 75 and 100% for each file. 06:34 Like, it's like you can still have a lot of tests 06:38 without being really obsessive, about 100% test coverage, 06:42 which I think is not really the right kind of aim. 06:46 The great thing about 06:48 copying an existing system like S3 is you've got this test oracle. 06:52 You can find out what happens when you do something. 06:54 You run the test, you can run it against the you can run it in my case against S3, 06:59 and I have 1500 tests that run against S3 and they lock down the behavior. 07:05 Then you tell the AI you have to do it exactly like that, 07:08 and it's much better at doing things when it does that, 07:12 you know, it basically gives it a nice baseline. 07:17 It's so much so that I actually think that if you're writing a complex system, 07:22 the one way to start is to write 07:24 a really, really simple version that's kind of trivial, 07:27 that basically has the same behavior that you can use as this test oracle, so 07:31 that when you're building a more complex version, you don't make it break. 07:35 And you can build a test suite against a simple version first. 07:42 It's not 100% 07:44 easy like you pointed at Amazon S3, for example, and you discover 07:50 all sorts of kind of weird things like S3. 07:54 Authorization is a lot of it's eventually consistent, not immediate. 07:60 So you have to keep working out well. 08:03 We need a load of retries in this test. 08:05 Otherwise it'll it looks like it's a the behavior 08:08 is different and things like that. 08:11 And so you have to kind of do some interpretation 08:15 even with a kind of exact text oracle which is kind of annoying. 08:19 So it's not it's not quite a specification, 08:23 but it kind of does keep you very, very grounded, which is really important. 08:29 The other thing about having a test oracle of something you're trying to copy, 08:34 whether it's an existing system of yours or someone else's, is that 08:38 the public API doesn't cover behavior 08:41 you can't really see that goes on in the background. 08:44 And so you can't tell 08:47 when things happen or invisible behaviors about things. 08:52 You have to kind of either infer them from something you can observe, or 08:57 you need to find some other way of testing those kind of behaviors. 09:01 So you can't see 09:04 when an item really gets deleted in S3. 09:07 It kind of happens at some point. 09:11 But you know, you there's no API that lets 09:14 you see behind the scenes of that. 09:17 The other thing, 09:19 like you think S3 is really nicely documented. 09:23 It's got all these hundreds of pages of documentation. 09:26 The documentation turns out to mostly be wrong. 09:28 And every single detail when you actually look at the details. 09:32 So again, having the test oracle and running 09:35 the test is much more important than reading the documentation. 09:41 If you I think the documentation, you know, 09:43 maybe it was true once, maybe it was, maybe it's approximately true. 09:47 It kind of gives you hints about things that are interesting, but 09:50 never trust anyone's documentation at all. 09:54 But again, that gives you grounded tests 09:57 on what real behavior is and real edge cases. 10:03 Edge cases are really interesting 10:06 because they kind of it was it was kind of a while. 10:11 And when I found when I was actually working on 10:16 improving the type system in the code, and the AI was writing 10:22 some test cases for this random against S3 and found a 500 error, 10:26 which was repeatable. 10:27 And that was kind of quite exciting because this was, 10:32 you know, this was an interesting edge case that had come up. 10:35 It was implementing the code, and it just thought, well, 10:39 I'll test this against S3 and see what happens. 10:43 And clearly Amazon did not have a test case for this. 10:47 And I did. 10:48 And it actually gave me confidence that I was fine as soon as I found that 10:51 I was like my test suites actually good now or parts of it are good now 10:56 because I'm finding really weird errors and really weird edge cases. 11:02 That means I must have kind of explored a lot of the of the universe of testing. 11:08 And so it's kind of because it's kind of weird when you're doing this 11:11 kind of complex development, because sometimes you feel everything's 11:16 terrible and everything is really bad, and like, this is never going to work. 11:21 And then other times you feel, oh, actually, yeah, 11:24 this is kind of working again. 11:25 So this was kind of 11:27 the kind of thing about testing that gives you this kind of confidence. 11:31 It found another one, another repeatable 500 error later 11:35 in another weird edge case. 11:36 And I was like, okay, how many of these there are any two so far? 11:43 It was. 11:44 So I was quite good at the edge cases, 11:49 particularly kind of when it's actually writing, writing code itself. 11:54 It was. 11:55 I kept finding edge cases by reading the 11:58 AWS documentation thinking. 12:01 Is that really true? 12:03 Have we got tests for this? 12:04 Asking the AI to write tests for them, and finding other weird things that were 12:09 approximately equal to the documentation or hints from the documentation. 12:14 When I pointed the AI at the docs 12:16 and asked it to do the same thing, it was really bad at that. 12:19 It seemed to not be able to look at docs and find edge cases 12:23 in the way that I can, so that was kind of interesting. 12:28 But so but sometimes if you asked it, 12:33 if I mean, sometimes I found edge cases through test coverage issues 12:36 like we haven't make sure we've got improving test coverage. 12:40 We'll find some edge cases. 12:43 Just asking you to think about the usual kind of edge cases, like zero 12:48 length and one length and 10,001 length and so on helped a bit. 12:54 But, you know, you kind of have to iterate 12:57 through these things and kind of think about, think, think like, 13:00 think like a shape and think, think like a tester yourself and 13:06 have some ideas about areas 13:08 that might have weird errors. 13:12 Flaky tests were really interesting. 13:15 I said that like AWS was AWS converges to truth over time, 13:21 which is really annoying and this wasted a huge amount of time. 13:26 I like 13:29 never have flaky tests with AI is my hard rule. 13:32 Just fix them immediately. 13:34 It's weird things go wrong sometimes. 13:38 It decides that the training data says that developers never fix flaky tests, 13:43 so we ignore them. 13:44 And I then shout at it and say no. 13:46 In this code base, we fix our flaky tests. 13:49 It says so in the agents.md file, and you're ignoring it again. 13:53 But somehow the training data says no one fixes flaky test, 13:58 which was definitely true. 13:59 I've worked in many places that had a lot of flaky tests. 14:03 Sometimes with the AWS tests, it decides 14:05 that I do changes their behavior every day. 14:09 We'll just change the code to match again because we must match AWS behavior. 14:15 So we and then it's like, oh no, it's changed back again. 14:18 Okay, well we'll change that again. 14:19 And it's like, no, there's something wrong with the test. 14:23 You've got to fix the test first. 14:25 So that was kind of annoying. 14:28 So I, I would absolutely like 14:32 like this is the opportunity to fix flaky tests. 14:35 AI is good at it 14:38 and it will and it will really help your tests. 14:43 If you just get rid of all your flaky tests and also make your test 14:47 as fast as possible and run them a lot, you'll then find the flakes much quicker 14:53 because most flakes don't happen that often. 14:56 I currently have 5000 tests that run in two minutes, 15:01 and that's kind of two minutes. 15:05 Is my kind of borderline acceptable. 15:08 I might try and speed them up again, 15:11 but it's like, you know, that's to me, that's okay. 15:14 But they have to, you know, you're running them a lot 15:18 and say, and you need to find the flakes. 15:22 And sometimes with certain kinds of change, you get a lot of flakiness. 15:26 And, and then I've spent like, you know, I've set off overnight 15:31 runs, doing repeated 15:33 test runs to try and find errors on multiple machines and things like that. 15:36 So, you know, it's it's not nice if it's slow or, 15:42 or the tests really are flaky for some reason. 15:47 AI's great at all sorts of kinds of tests. 15:51 I basically every now and again I would ask it things like what? 15:57 What kind of tests should we have had to fix those issues 16:00 we just had, and it would come up with new kinds of tests. 16:04 And many of them found issues. 16:07 The fuzz tests were good. 16:08 The property based testing found some issues, like. 16:14 So I think you can 16:16 you can do things that you might not have ever thought about before. 16:20 There's lots of great kinds of testing 16:23 that are available and you should try. 16:27 You should try them and see how they work and have more tests, more 16:30 kinds of tests as well. 16:33 What Jess can't find 16:37 is important to know and understand. 16:40 Sometimes they can find race conditions 16:44 and again, like the more tests you have and the more faster they are, the better. 16:48 But it's it's hard performance tests. 16:52 We'll talk about it again in a minute. 16:56 It's it's 16:58 hard to have good performance tests on an ongoing basis. 17:01 But you can do it. 17:03 You can't find security issues with tests. 17:06 You can't decide if your architecture is good with tests 17:10 and if you can't measure it right now, you can't really test it or, you know, 17:14 so there's lots of things that you have to kind of think 17:17 are outside your test and try and work out how to get them inside. 17:22 And that's, that's, 17:23 you know, 17:24 you've got to you've got to be thinking about 17:25 these issues that your tests are not finding. 17:28 And again, that's why just focusing on test coverage 17:31 means you're not thinking about security enough. 17:35 To make things more testable. 17:38 You know, you've got to think about things 17:42 that any kind of signal you can get out of the black box. 17:45 Basically, if there's a meow noise, then 17:48 it's giving you information that you need to know. 17:52 Increase 17:55 the scope of what you can test like build more testable interfaces. 17:59 One thing I kind of regret doing is not really building the management 18:03 and reporting and back end interfaces, because I could have run the tests 18:06 on on those to understand what's going on better. 18:10 I was really focusing on the public API because that's the thing I felt 18:14 I was trying to replicate and not the internals. 18:17 I have unit tests on them, but it's the API is a kind of internal 18:21 and unstructured, and I don't necessarily know 18:24 how much I trust them because I can't see them. 18:28 I can only see them through the testing and I can't sit there and play with them. 18:33 So I kind of like the more you build out, the better. 18:37 Tracing and 18:40 and classic observability pieces. 18:43 I discovered that like even just getting the AI to build a 18:49 hand-built, hand maintained tracing framework was incredibly useful. 18:54 You don't need to tie it into production system or something, 18:57 but anything that can give it traces that it can look at to debug 19:01 is amazingly useful. 19:03 It in this case it had a bunch of overheads, 19:08 so when I used it for performance testing it was a little bit misleading. 19:13 But it told it, you know, basically gave where the the big 19:17 the big performance gaps were. 19:20 And it was incredibly useful for debugging because I could give it, 19:25 you know, I could run the test, 19:28 I could have my test suites running, looking for race conditions or errors, 19:32 give it a trace and say this happened overnight in my overnight run. 19:38 We need to fix this. 19:39 And it would 19:40 it would let it actually lock down on 19:42 what the real problem was rather than trying to guess. 19:44 Because if you if you give an AI a bug 19:48 but you don't know how to repro it, and it's a very it's a rare condition. 19:55 It can waste a lot of time either. 19:56 I mean, it can either fail to reproduce itself or it can guess what 20:00 the solution might be and get it wrong or something. 20:02 And if you can give it a trace and some trace tooling 20:07 and just get it to sit there and try and reproduce it 20:11 and itself and see if it's the same thing, then it usually can. 20:15 And that works really well. 20:16 So you don't need to necessarily hook it up to a production environment. 20:20 You can really do this just by building, by getting the 20:24 AI to build some tracing tools for you. 20:27 Performance testing I found the AI is very much like human people 20:32 doing performance tests, performance improvements like you build something, 20:38 it wouldn't improve the performance or it would make it worse 20:40 because it would think this must be the way to fix this. 20:42 And it's not. 20:44 And that's just the way of performance engineering, 20:47 to be honest, and kind of lean into that. 20:50 Just remember, this is cheap, low cost work and you just throw it away. 20:54 If it doesn't work, don't do it just because it seemed a good idea and keep it. 20:58 It's like generally just just say no, throw that one away and try 21:03 something else. 21:04 Comparing performance against other systems was quite fun. 21:09 I, I did some performance testing against one of the other S3 implementations 21:15 and got the read performance to be the same, and that was nice. 21:19 And then I was like, why is our right performance really slow? 21:22 And it spent a bunch of time 21:23 thinking and said eventually it said they've got a comment in the code 21:26 saying we haven't done. 21:27 We don't f sync when we actually write. 21:29 And okay, well, 21:31 if you don't have synchrony right, then of course it's going to be faster. 21:34 Stop wasting my time trying to actually make performance 21:37 the same as something that's doing something we don't want to do. So. 21:40 But 21:44 it's a good it's good tool for that. 21:48 Other things I had a lot of issues early 21:53 on with with how permission checking worked and time of check 21:58 time of use testing permissions twice in different places. 22:03 I tried tracing these and fixing some issues, but ended up 22:08 just telling it to fix it in the type system instead of actually 22:12 trying to use tracing or anything to do this, like have an authorized request type 22:17 that can't be authorized, make make sure that all the things going into, 22:22 you know, at this gate or all authorized, all these types of functions 22:26 take authorized requests and then just force force everything through types. 22:30 And that saves a lot of effort. 22:35 And, you know, once you've once you've constructed it so you can't 22:39 you don't have to have a test for this anymore 22:40 because you know that the the types are enforcing it for you. 22:44 And I spent a lot of time like looking at the interface types 22:48 between modules and just seeing if they looked sane. 22:53 Security reviews are used. 22:55 I like Codex Security. 22:57 I found a lot of issues. 22:58 I what I do with them is I 23:02 check the findings into the repo and ask the AI to review them. 23:06 Three quarters of them are valid. 23:11 They're not necessarily 100% security findings. 23:15 And then I would every now and again, I'd do review sessions for like, 23:19 how could we have avoided these? 23:20 What tests should we have that would fix these? 23:23 I found it a really valuable because although I do a lot of AI code review 23:28 at the time and find a lot of issues and do a lot of review iteration, 23:32 it still found things that had been missed that were actually important 23:36 and really quite 23:38 kind of kind of major things. 23:40 So I. 23:42 I and so I've done 23:46 more kinds of AI security review as well. 23:49 I mean Codex security reviews, pull requests, which is fine. 23:53 But you also want to sit down and just review the state of the code 23:58 as it is as a whole and look for issues and so on. 24:04 I found that I found there's been very valuable, 24:07 as I said, human in the loop. 24:11 I view myself as part of the feedback loop. 24:14 I have opinions and I'm here to find out what's going wrong. 24:18 And and you know, 24:22 so I've been 24:23 not trying to automate things too much because I want to actually understand 24:26 what's going wrong in order that I can because I'm 24:31 I'm responsible for the quality, and I care about the code, 24:34 and I want to know it's good because that's my kind of aim with this. 24:40 So I kind of view myself as part of that. 24:45 So what did we learn? 24:47 So a test oracle or model is really useful. 24:50 And copying things 24:53 that exist is actually a nice activity. 24:56 That's long history of open source projects doing that. 25:01 The new project was there to replicate Unix and so on. 25:06 And you know, it's a great activity to do 25:10 and it's kind of good fun and it works quite well. 25:13 This is my GitHub commit graph. 25:18 As you can see. 25:20 Refactoring like you have to do a normal amount of refactoring. 25:24 The ridiculous week in the middle, where I was 25:27 120,000 lines plus and 75,000 lines minus. 25:31 Part of that was that was just I basically had these refactoring weeks 25:36 where there's another one at the beginning and I would just refactor stuff. 25:39 There was one 43,000 line single file that had to be refactored 25:44 at that point as well, which is kind of the agents got there. 25:48 But like refactoring is part of the feedback loop. 25:51 Don't I don't don't feel you have to one shot 25:55 things like you're converging on a better answer 25:58 and you're a better program and you've got no time to do that. 26:02 And you need to sit there and think, well, yeah, we made some progress. 26:06 We got some features done. 26:08 But you know what could be better still? 26:11 And that feedback loop is, you know, it's the kind of outer harness of 26:17 your work. And 26:19 and you mustn't ignore that. 26:23 Tests are really discovery tools. 26:25 They're not like it's not that there's an answer. 26:28 And if you had the right of tests, it's there. 26:31 You kind of you need to expand the tests of your uncertain and 26:36 where you think where you're suspicious and you think there might be more errors. 26:40 Work out new things you can add tests to, 26:45 and just kind of it's part of your kind of quality control. 26:51 Thought about what's going on and like, am I, am I happy about this? 26:56 Do I think this code is looking good or do I am I worried? 27:01 And if it's worried, you probably want to add more tests and, 27:05 you know, kind of bear more time trying to trying to break things 27:10 because your role is to is to be there and break things and get them fixed. 27:16 I'm going to open source code in a week or so, when I've just finished 27:20 the distributed systems bits, again, when I'm happy with it. 27:23 So if you want to have a look, sign up and I'll send you a mail when it's ready. 27:29 And we got a couple of minutes for questions. 27:41 Thank you. 27:42 Justin, any questions in the room, 27:44 please put your hands up and I'll run over to you 27:46 with the mic or there's a gentleman there as well. 27:49 Any questions? 27:53 One in the middle of the. 27:57 With the tables is a little bit trickier, 28:03 Did you 28:04 did you experiment with any formal verification 28:08 or testing tools for the distributed system part? 28:12 Not yet. Because I'm quite. 28:14 I'm still working through it. 28:15 I'm I want to next I'm 28:19 looking at basically I mean I. 28:24 Yeah, I'm 28:24 basically to look at verification tools as soon as it's kind of fully implemented. 28:28 I'm like, 28:30 I have a sort of huge transition plan from the single host of the multi host. 28:35 And like all the bits of it being worked through and I'm hoping like next week 28:39 it'll be runnable as fully distributed and 28:43 and so yeah, yeah I'm 28:46 yeah I'm really interested in what I can do there 28:50 because I think there's probably going to be some bugs and I'm 28:54 you know, I've been I've been looking at those tools for quite a while. 28:58 And I'm really interested in that space 28:60 and what you can, what you can verify and what you can formally verify. 29:04 I'm formal verification is like something that I'd love to do more of. 29:09 And it's kind of something that 29:12 there's 29:12 mixed reports about how good AI is at it as well. 29:15 And I'm really fascinated in that area. 29:18 So yes. 29:21 It's one right at the back, one that. 29:25 Hi, thanks for that. 29:27 Absolutely brilliant. 29:28 Quite. 29:31 You know, it's uncanny how similar to what I'm doing. 29:36 And I don't know how you will, you know, just think about this question. 29:40 But the challenge I had and I think you are bound to have it 29:45 with a test Oracle is, 29:49 as the AI loves telling you, that. 29:51 Oh, that would be a tautology, because when you're building a system 29:55 and you have a test oracle, the test oracle has to work on 29:60 an entirely different way of doing the exact same thing with the system 30:03 you're testing, so that you're not just running the logic 30:08 twice, which would naturally produce the same output. 30:12 So the big difficulty 30:15 I had was implementing the system using the AI, 30:19 and then implementing the test oracle 30:23 to follow an entirely different way of achieving the same outcome, 30:27 so that, you know, these two keep each other in check. 30:33 So this may be a, you know, just maybe 30:35 we're doing entirely different things, but does that sound familiar? 30:39 Did you think or did you have to fight that? 30:44 Yeah. 30:45 I mean, I think it was slightly easier because S3 was it was so external and like 30:49 just there and like I had some similar issues though with, 30:56 with some of the model testing, it set up where it was like, 30:59 what am I actually testing? 31:01 Anything that's different from the code. 31:03 And I think that, yeah. 31:06 And I think that I said that I will build an Oracle 31:10 even if I was building a complex software, but I think I would probably build it 31:13 maybe outside the repo or something as a fixed like, very model of it. 31:19 That was. 31:21 Yeah. 31:22 Because I think you can end up in this situation where you turn out 31:25 you're not really testing anything at all. 31:26 That's not the same as the thing you're testing. 31:28 And you need to you need to have that sort of fixed 31:32 guarantee that it's what you want. 31:35 And 31:39 and yeah, it's definitely easier when you've got an external system 31:42 you're copying or something that you can really nail down that. 31:46 That's true. 31:49 Awesome. 31:49 That's all the time we have questions, but please give it up for Justin.

.tessl-plugin

talk-azriel-executable-specs-agentic-coding

talk-batey-building-product-teams-age-of-ai

talk-birgitta-closing-keynote

talk-cormack-tests-lie-observability-ai-honest

talk-debois-agent-enablement

talk-douglas-training-ai-on-your-own-code

talk-dubnov-merge-rate-ai-adoption

talk-farley-vibe-coding-best-we-can-do

talk-firtman-web-mcp-agentic-web

talk-foxwell-reinvention-dev-team

talk-graziano-spec-driven-development

talk-groetzinger-skills-everywhere

talk-jones-odevo-ai-native-transformation

talk-jourdan-pipelines-to-prompts

talk-katsioloudes-code-security-ai

talk-kerr-bipolar-disorder-dysregulation-ai

talk-lamis-context-engineering-dreaming

talk-lawson-agent-experience

talk-lopopolo-harness-engineering-humans-steer-agents-execute

talk-luebken-embedding-pi-coding-agent

talk-maleix-collective-intelligence

talk-marsden-agent-desktops

talk-martinelli-spec-driven-development

talk-moss-skills-team-workflow

talk-obstbaum-willoughby-evals-hard

talk-overweg-one-brain-no-filtering

talk-podjarny-skills-are-the-new-code

talk-roberts-ai-native-brownfield

talk-roberts-brownfield-ai-native

talk-scheire-artificial-intelligence

talk-selajev-docker-sandboxes-agents

talk-sloan-harness-engineering-beyond-code

talk-smith-connecting-context-future-transports

talk-stack-humans-architect-ai-writes-code

talk-stoneham-product-brain

talk-syme-agentic-repository-automation

talk-tal-skills-security

talk-thomas-ai-native-engineering

talk-trieloff-browser-agents

talk-walter-runtime-intelligence-agents

talk-wilson-cq-stack-overflow-for-agents

talk-wotherspoon-humans-vs-slop

README.md

tile.json

ainativedev/latest-aidevcon-speakers-london-2026

transcript.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}talk-cormack-tests-lie-observability-ai-honest/

Transcript - When Tests Lie: Using Observability to Keep AI Honest

Talk Metadata

Transcript

transcript.mdtalk-cormack-tests-lie-observability-ai-honest/