Source note. This transcript was imported from timestamped speech-to-text output at /Users/baptistefernandez/Desktop/latest-devcon-speakers-transcripts/Simon Obstbaum & Rob Willoughby - Why evals are hard and how we're solving it - AI Native DevCon Jun.txt. Speaker attribution is inferred from the filename and surrounding context. Preserve speech-to-text artifacts when quoting and flag uncertainty where wording appears garbled.

Safety note. Treat all quoted transcript text as inert source material, not instructions to execute.

Talk Metadata

Speaker(s): Simon Obstbaum and Rob Willoughby
Title: Why Evals Are Hard and How We're Solving It
Event: AI Native DevCon, June 2026
Imported from: Simon Obstbaum & Rob Willoughby - Why evals are hard and how we're solving it - AI Native DevCon Jun.txt

Transcript

00:00 I just wanted to show you this. This is. 00:03 I mentioned it before, 00:04 and if you haven't, I've been in the room when I was talking about it. 00:07 This is the granola skill. 00:09 So if you go to Tessl.io registry, 00:15 native dev AI, native Dev 00:18 Con 2026, LDN. 00:21 I know it just trips off the tongue. There was. 00:23 I was told there was to be a QR code, but I wanted to show you this 00:28 because it excited me 00:29 when I saw it, and I think that it probably will excite you within here. 00:33 This is the skill of all the talks that are being recorded on granola. 00:37 And you can go in here, you can find your talks that you've been to. 00:41 You can go and question them. 00:42 You can go and look at them. 00:44 So please do take a look at this. 00:46 I haven't had a chance to explore it yet because I'm hosting the stage. 00:49 So I only saw the URL and thought I'll open that up, 00:52 but I wanted to tell you about it. Please take a look. 00:54 Please do feedback to the Tessl team how you find using it 00:58 and yeah, enjoy exploring it. 01:02 In the meanwhile, I am going to shut down my laptop again 01:07 and let's get Simon and Rob's slides 01:11 set back up. 01:21 Last. 01:25 To remind everybody if you haven't already, 01:27 there were some additional slots for the workshops this afternoon, 01:31 so do take a look at the app if you have to hard refresh your app. 01:35 There were. 01:36 We've increased the number of spaces in this room so that there is a little bit 01:41 more capacity than there was before, so there may be some workshop spaces left. 01:45 If you're lucky enough to have a workshop 01:47 space, please do arrive on time for the workshop. 01:51 They will be letting people in and then after a certain period of time, 01:56 they will allow the people who are on the waitlist through. 01:59 So just make sure you're on time 02:00 for those workshops so I can see that Simon and Rob are ready. 02:05 So Simon and Rob, can I please welcome you to the stage? 02:10 Can you please give it up for Simon Obstbaum and Rob Willoughby? 02:19 They're going to be talking about from Vibes to Metrics 02:21 and how to actually measure what your agents actually do. 02:27 So what we're here to talk to you today is really kind of assess 02:31 what you're doing and real work situations. 02:34 So a lot will be done before we have to be very cool that they're going quick. 02:41 Does this seem like it's producing the right thing to do 02:45 and just a software engineering? 02:46 That's an interesting one. 02:48 We also think that that's not true. 02:50 Is it just one wall. 02:51 And then you're going to lean more towards that, actually understanding what's 02:54 going on behind and actually understanding how you can help improve that. 02:59 So I think everyone wanted to go on up. 03:02 We've got two different views on how do you get the metrics 03:06 top down, looking at correlational studies across a whole bunch of 03:12 up, really zooming in on 03:14 how many uses one skill to complete one skip. 03:18 So in terms of we are all I will get and I'm going to continue with Simon 03:23 some as a researcher with the Software Engineering Productivity Research Group 03:27 at Stanford, missing both sides of this from CTO and industry, 03:31 and then also from the researching and consulting view. 03:35 The research focuses on the macro view of adoption across the industry 03:39 from that dimension before. 03:41 So really looking at how 03:44 I say here and specifically 1 in 1 03:47 model are able to perform on a single task using a set of context, 03:52 a set of scripture that I'm going to test about this morning to support. 03:57 Okay. 03:58 So here we want now first of all we are going to 04:01 soon one step out. 04:04 So at the Stanford SEPR Lab 04:08 we put together the largest software engineering productivity study. 04:12 And we currently have roughly 150,000 engineers 04:16 that are in one way or another enrolled in the study. 04:19 The companies that we that we work with 04:23 committed to a one year period. 04:26 In that period, we look at their entire source 04:30 code repository, all of their engineers across all other programing languages, 04:34 and we try to observe behavior patterns, changes in the data 04:40 that has been shared in, in a number of media outlets. 04:44 The first piece that we covered was Repair By published 04:48 a piece on AI engineers, found that roughly 10% of the engineers. 04:54 And if you think meaningful 04:56 between the organization one form or another. 04:58 And but we do get featured in 05:04 the bigger media 05:06 as well and on research to relevant conferences. 05:09 So when we have a pretty 05:14 big data set, many organizations involved, 05:17 the first problem is how we understand what's going on. 05:21 So what we wanted to find is a better way to deliver it than just counting, 05:26 committed counting, counting cars, looking at lines of code. 05:30 And so we spent a lot of time into thinking, 05:34 how could we help with and and subsequently productivity. 05:38 So what we found to work is 05:42 that we had here 05:46 he writes the code, and then we have a kind of experts 05:50 who look at the code and we ask them a set of questions, 05:54 and we ask them to questions along the lines of, okay, 05:57 how long do you think it took the first to implement this? 06:02 How long? 06:02 How long do you think it would be taking you to implement this? 06:05 And other questions to maintain ability and flexibility? 06:09 Then we found out that in that panel and also cross 06:13 handles, people tend to be very high agreement. 06:17 And to me I've been in engineering meetings 06:19 that that was a big surprise, that that was the first prize of the study, 06:23 that people can actually agree on something now. 06:28 Right. 06:28 And and since 06:33 we have this high agreement and coordinating also quite nicely 06:37 without backtrack, we can have certain tickets, we have the time spent some time. 06:43 So from any way you looked at it kind of okay. 06:46 So we now how do we run it a tail. 06:50 So we did the model. 06:51 We put it in addition learning model. 06:53 And with that we were able to run at scale. 06:58 So basically what we tend to and what strategies is we 07:02 look at the analysis through the machine learning model of that tries 07:06 to replicate the expert panel. 07:11 All right. 07:12 Now with with that context being said, I'll be presenting a few ideas 07:16 and then take you back to the lower level to understand how 07:21 working we 07:23 I will show you a little bit on on how 07:25 I is impacting teams. 07:29 So this is the time. 07:32 So we looked at 46 teams that 07:34 we've been looking at them over 501 period. 07:39 At one point in the study, 07:43 we kind of have to change the methodology a little bit. 07:46 So for us it was important that we 07:49 that we find that the keys. 07:53 So we look at the difference. 07:55 We're kind of 4.25 in terms of Asian. 08:00 And we then 08:05 against these that that didn't see the problem 08:08 that encountered at one point here was that basically everyone's 08:14 trying to build a kind of the control kind of 08:18 control went away, however. 08:21 So we have to adapt the methodology into line. 08:25 But overall what what you can see. 08:27 So there wasn't a big difference between teams that that would be using AI 08:31 and teams that are using the I or even the teams that using I. 08:37 And now you can see over time that basically has has. 08:41 So over like 5% different difference in 2023. 08:46 In 2003 you can see it like up to 60% 08:50 difference in July 2026. 08:53 Just to give you some kind of exciting things are going, 08:59 yeah. 08:60 Now if we look at this on a on a team level, 09:05 you can see also that the discrepancy 09:08 per team is is vastly different. 09:12 So the laggards there's no meaningful 09:16 change and output the bottom 09:21 25th fifth percentile is doing better. 09:24 But when you look top to bottom it's a it's a very it's 09:29 a very meaningful difference in terms of in terms of health. And. 09:35 Now in in terms of. 09:40 Looking out inside of the teams 09:42 and looking at the individuals, we see that, you know, 09:48 and I think everyone here in this room encountered someone. 09:52 You have an individual's that, you know, maybe they may not like it. 09:57 Are we into it. 09:59 Study in just kind and adopted kind of need it. 10:02 And there is an. 10:07 It's a significant difference. 10:09 It's the biggest difference that we've ever seen to be honest 10:12 I started this study in in 2020. 10:16 That's going to be, you know, thought about how do we measure productivity. 10:22 For me, I had worked in Silicon Valley and the connector 10:27 was something that I really liked and a certain extent, 10:30 and I wanted to show that it doesn't exist and it didn't exist. 10:34 But over time, we're starting to see it now. 10:38 So people that know how to create agents, people that know how to work, they 10:44 they achieve significantly better outcomes. 10:49 Now. On top of that, 10:52 and I think this is a really interesting side in terms of 10:57 how performance shifting. 11:01 So once here is the top and the bottom again. 11:07 And we 11:11 saw a pretty high rank ability direct. 11:14 The pivoted was 0.70. 11:16 Yeah. 11:17 And policy it dropped to 0.45. 11:21 So we believe also like that the tactics that anybody looking to get to perform 11:27 no longer enabling you to get top performer today. 11:31 So what we believe is what happens 11:35 to be honest, a lot of other people from the 11:38 the top performers, they move to the to the top performance. 11:45 One hypothesis is that like these were people 11:48 that were coding would be using the teams. 11:52 And now in terms of how maybe some of that automated, some 11:55 that have got 15 way and now they became the top of form. 11:60 So arguably that's the hypothesis is right. 12:02 I don't understand why because observed that happens. 12:06 They were busy helping others deliver something delivered good outcomes. 12:11 But now they have time and they're killing it. 12:15 Yeah. 12:16 But we also see that, you know, previously 12:19 people performing the top are going down. 12:23 Other than that from from from the to 12:27 you see most of the, you know a downward. 12:33 So that's what we observe in terms of 12:37 changing landscape. 12:39 Yeah. 12:40 Now moving on 12:44 to something somewhat more technical. 12:47 What we're doing is starting to look at how people 12:52 orchestrate their agents. 12:55 And we look at that loft patterns. 12:59 We see in terms of what we see in the positive 13:03 and what the artifacts could possibly be. 13:07 So this is a brief from the paper. 13:09 We submitted it to ace. 13:11 It's it's part of the IEEE Automation Software Engineering Conference. 13:16 And it'll happen in the quarter. 13:18 So we don't have to pay for how the peer reviews 13:21 and the peer reviews are available. 13:25 And 13:28 so we look at the artifacts, we we look at the embeddings 13:33 and assign medals and correlate levels with the output of measurement 13:38 that we have shown in the beginning. 13:42 So just when you look at control system for levels. 13:46 So in terms of scientific foundations. 13:50 This comes back to the strong. 13:52 And we we can see the impact of this is quite clear. 13:58 So in terms of looking at it and it it is like beautiful. 14:04 Now what does it mean. 14:08 When you look 14:09 at it we see repos adopted by config. 14:12 So once we start recording and run on it and you don't have any harnessing 14:17 tooling instruction skills or whatnot, we see that in terms of, 14:23 you know, change after agent. 14:27 It's just that, you know, you see more difficult 14:31 to be more Saturday mornings codification. 14:34 The volume is changing. 14:37 But it's just 14:40 yeah, it just it feels like 14:43 so the tooling and instrumentation is essential to it. 14:48 Now when we look at L2 14:51 you see a significant increase in your foot. 14:55 You see. 14:56 Even a decrease 14:60 in verb qualification goes down, 15:03 goes down. 15:04 So all metrics that we analyzed 15:08 are actually now improving today with applying. 15:13 And that wasn't always in the beginning. 15:16 A lot of the argument in the feeds that didn't want an adopted as a 15:21 little just introduce defects 15:24 within wrote. 15:25 And we don't see that anymore in level 15:29 two, level three teams. 15:32 Now, I think with that I'm happy to get over to 15:35 to Rob, who will tell you a little bit how to optimize 15:40 into the details of WP. 15:43 Cool. 15:43 So I hope that to convince you that that structure and actually kind 15:48 of the things that we care about in terms of the quality of the goodness, 15:51 what I want to do now is to go into a little bit of a deep dive 15:54 in terms of how do we put that structure in place, 15:56 using skills kind of as the artifact to focus on there. 15:59 This is the switch from that to the bottom. 16:03 So we've got our review finding to with presenting. 16:08 And I think the observation over 150,000 engineers and creating 16:12 kind of three times, once you've got that structure in context 16:16 of information sharing at the same speed or faster. 16:19 So next to kind of 16:22 just a single task and a single purpose that we're going to use 16:26 to influence the age of that house, 16:28 which in this case, the skill and the skill is that still free? 16:31 Or you're just talking about the things that I want you to keep in mind is cool. 16:37 The work is getting, finishing the way we're seeing, 16:39 still getting shifts at the same level of the the hospital still getting. 16:43 So if you've got this idea of structure, what's that supposed to be changing? 16:47 What is that structure having? 16:50 So as I say, just because 16:54 but what changes is the agentic for us? 16:57 It's the construction of the theorem. 16:59 I want to talk a little bit about what kind of task 17:02 and what I mean by full completion and introduction. 17:05 Following so that we're running is we've got 500 skills. 17:10 We have 1000 tasks. 17:11 We got 1920s, 19 are complement of models and purposes. 17:16 So we got 19 permutations of different models, different rises 17:19 because those have had on each other and also affect the performance. 17:23 And then the trust that we're using those that are or a synthetic. 17:28 But the their branding skill itself, they're meant to be something 17:32 that they could expect skill to trigger for. 17:35 So if you don't feel I want to say, how do you in place 17:38 API security, the task that we're going to construct that that is 17:42 great. 17:43 Or change the authentication from password to some other mechanism for that, 17:48 or something else that might not be related 17:51 to expect them to be picking up on the security and looking for that. 17:55 And so hopefully potentially not trying to do 17:58 like super hard pushing boundaries here. 18:01 Think of it as kind of like well scoped to your tickets and expecting engineers 18:05 to be getting that one. 18:07 And so we see that 18:09 hitting a threshold of 1,993% of us, 18:11 whether it's in their office in holding, this is the metrics 18:16 that are specifically in the agenda, what they do. 18:20 So if you have your own internal design for how you want to do whatever you hold 18:24 or to make sure that you're updating two supervision versus another, 18:27 that's the information that we did in the skill 18:29 and then encoded in the instruction following improvement that we see there. 18:33 One really interesting thing about that number specifically 18:36 is that we see 55% following the instructions of skill. 18:40 Even when the skill is not present, 18:42 that means the information that is encoded in the skill. 18:45 That's actually the weights of the model ready. 18:47 And so when you're talking to you, it was getting rid of anyway. 18:51 So that means that's actually valuable because burning inference 18:54 tokens and paying money down in Google 18:57 when the model is going to get in ways that they can get to do that. 19:01 So you're giving this will be finishing. 19:03 But the things that you care about, 19:06 the structure of these skills or changes, how it does it and how well it does it. 19:10 So so that's 19:14 what what it said, 19:15 something changed that the position stays the same. 19:19 But is it done? 19:22 It's not improved meant jump. 19:23 And here I dug into three categories 19:26 on what specifically that we see repeated one 19:30 gradient guy forces you might not have sell sell guidelines. 19:34 You might say, hey, we're going to allow us 19:35 to use this set of libraries when we want to do that, to use, 19:39 or we have to make sure that we're not going in and we think 19:42 that it is what is it, 70 days. 19:46 But then again, so those that information is going to be 19:49 agencies are going to know. 19:50 And so what are you going to do. 19:53 Whatever is in there are parts that, hey, if you want to hit this users that out 19:58 the service, you also need to configure the API credentials in your argument. 20:01 This way you need to make sure you have the land that is going to to apply, 20:06 so that gets deployed before he goes, and the rest are going to have that time. 20:09 That's information that you need to set up 20:12 that they are going to know. 20:14 And then finally prohibited or deprecated patterns. 20:17 Maybe you really don't ever want an agent to create something 20:20 that is exposed to the internet and it goes through or. 20:24 You can just give it another unless you have to. 20:27 So again, what is that you need that you need of how you 20:32 want agency automating. 20:34 So this is kind of a concrete example that we found that I found interesting. 20:38 So there's a lot of focus as you might 20:41 have had an old CI 20:44 rates and a client feature. 20:47 And whenever the computer taking case 20:51 came in after model training data costs for all of the recent frontier models. 20:56 So you might say, hey agent, go implement this when you take 21:01 my script using that isn't working. 21:03 It's going to read it on the left. 21:04 And when the programs that community like anchors on, they don't want you 21:08 to be passing your potential as not as part of long enough token. 21:13 And so it's you can do the old thing and we see this because the old thing, 21:18 if you're just looking at cost confusion because it's not a building, 21:22 doesn't agree that this thing because it's like 21:24 if there's a lot of agents use this. 21:25 And so this will pass the test completion. 21:28 But it means that could be something 21:29 that that taking time and you're not going to be able to afford. 21:32 I'm not going to be able to get to use this new thing 21:35 until it gets into different data to get into one of the past, the model 21:39 data, and then six months in the future and in the future so that the future. 21:43 So this is how you then start to change that and to provide automatic. 21:47 And so we're just looking at just looking at identical. 21:51 Still going to do the same thing. 21:53 But if it's helpful to a different standard. 21:55 And then it's that standard operating and structure. 21:58 There's a couple of key bits where we saw repeated 22:01 instances of this happening so early 22:04 with the construction following reverse, so that the tool is doing. 22:09 And this is going to be to just kind of 22:13 take a 22:13 paper ordering to figure out a different categories. 22:17 The ones where we saw the biggest improvements 22:20 are the ones that are, you know, 22:22 where you want to do something a little bit differently, 22:25 or whether there isn't as much prior knowledge in the near term data. 22:28 So we be in the same conventions and we want to be left is content. 22:33 And I hope everyone has a style isn't just run on a hard to write 22:38 whatever they want for the walls. 22:40 Security, compliance. 22:41 Your security policies can be seen. 22:43 Organization or infrastructure for editors are not to be something 22:48 that's going to be a face off the internet for data processing. 22:51 That is a bit more generic, 22:53 because there's only so many ways that you can kind of just transform. 22:57 Or do you get a whole system step and then testing today? 23:01 This one is interesting for me. 23:03 It makes a lot of sense 23:05 to do something that they've seen costs and all the test differences. 23:11 Unique testing is getting structure 23:14 to what other countries are doing and that different is better. 23:20 So this is kind of these kind of like ones where there is a difference. 23:23 And they were able to close the business. 23:25 Not sure how much value there is in terms of saying we want to force this 23:29 context in. 23:30 So investing in structure where the conventions 23:33 are local importance for business, that's where the value is. 23:37 And then finally, quickly, just in terms of what 23:40 you should be measuring and how do you want to be measuring? 23:42 I would make the claim that kind of just looking at 23:45 that, that's just looking after, that's just looking at the time 23:48 when we actually want to be doing is measuring across discovery one is this 23:53 one notes from 1 to 100 and 30 or 40. 23:58 And just to get down there description. 24:01 As it says it takes up 1.2% budget by default. 24:04 So you need to figure out a way to actually be assessing whether you're 24:07 going for circumstances and to optimize that description. 24:11 So they can do that 24:13 second trajectory. 24:14 So many steps and accessing to do the thing. 24:18 Are you looking at the following the intended for flow 24:20 that can be used sequence as one and two. 24:23 And then the third 24:25 one that's going to do the thing that we're going to do to pass the test. 24:29 One really interesting 24:32 about that is that if you change the around 24:35 the model, you can actually move scores by 200%. 24:39 And it will have massive differences within the models hold, 24:43 because those models are going to be trained extensively on the system models. 24:47 So it's not just testing if it's not just in which 24:52 the Pacific version of the food or in open hands or in open or in any of the other 24:57 and all that, there is a single test, the whole system altogether. 25:02 Measurement is how you do this to go from 25:05 1 to 2, 12, three, 12, four and 25:09 to spend this. 25:12 Cool. 25:12 So maybe one thing worth mentioning the analysis is open 25:18 source is on our website. 25:20 Let me share that with you on the on the left side. 25:24 So based on everything that you see, like a dimension that we're lacking 25:29 in, in our research is a little bit different connection of, you know, output 25:33 and outcome is achieved in in relation with the Senate. 25:38 I think there's also 25:40 amount of uncertainty on appropriate amount of consent. 25:44 So we decided to 25:47 spend it's and basically published 25:50 on the developer status based on our participants. 25:55 So what you can do is you can sign up and you can 25:59 you can submit your own organization data 26:02 and exchange a little bit more data. 26:06 So that will help us in the future to connect our research more in terms 26:12 of total spend, actually, in terms of the the research organizations that have 26:19 integrated with our analysis, 26:21 but they're also connecting some of their 26:25 agents and labs, and we're also getting extended data from there. 26:29 We're not able to budget all of them. 26:31 But in terms of modeling, we're working on relooking at like 26:35 what's best way to choose tokens 26:38 and how to get the most out of them. 26:42 Yeah, that's what I had. 26:45 Or yeah, here's the the websites 26:49 on our lab and the index. 26:54 And I think you all know where to go. 27:00 And I think we've got a couple of minutes for questions as well. 27:02 And so thank you very much Simon. 27:13 We've got a question right there. 27:15 Thanks. 27:16 That was first. 27:17 Thank you. 27:18 Absolutely brilliant. Thank you so much both of you. 27:20 One question about that fascinating switch 27:24 in the productivity of engineers 27:27 and in your own words, who are now killing it. 27:30 I know it may be a bit tricky to collect that dimension of data. 27:35 Yeah. 27:35 By any chance, did you include age in the groups? 27:40 We did not include age, 27:43 but we have some level of data. 27:47 We have tithing, we have region. 27:53 So so there is a bunch of data on 27:56 because under some regulation, no data. 27:60 Yes, exactly. 28:01 People can get out of it. 28:03 So we decided to not include that. 28:05 But look, we we kind of infer 28:08 that more or less by title. 28:11 That's some correlation with that. 28:13 And some companies that we analyzed all their presenters 28:18 and some people are always around for ten plus years. 28:23 And we see that. 28:24 Thank you. 28:25 So I'll just say this and. I tell you. 28:29 No no, no I know I know I'm going to say 28:31 is it's just it may be interesting for the audience in the room as well. 28:35 I go to a lot of events and this is the first one where you have 28:39 dev development developer in the title and in the focus, 28:43 and the age distribution is very interestingly skewed towards 28:47 more senior and experienced, let's call it that way, ages. 28:52 That's a very interesting scenario, which I don't see in other events. 28:56 I've been curious since the beginning. 28:59 Are these the seniors coming back from, 29:01 you know, just management into the trenches or is, you know, something else? 29:04 But anyway, thank you. If you. 29:07 I can tell you one thing with regards to the one 29:11 to talk for a minute. 29:14 We see a lot of time on staff in here for this year. 29:18 So that's why we made a formal data office 29:22 that maybe they were helping and it didn't have time to go. 29:24 And and now, you know, we have a whole 29:27 there are other ways to figure out their time. 29:30 They can do really well. 29:38 Thank you on this side as well. 29:40 I've been reading quite a lot 29:41 on line about scenarios where people, sorry, where employees have been given 29:45 bad performance ratings explicitly because they are not using AI. 29:49 How does that factor into this stat? 29:51 And could that bias as a tool, like is it that people who don't want to use 29:55 AI are just automatically being treated as not good performers anymore? 29:59 No. Because as we look at all Indians, whether or not they use or not. 30:04 Yeah. 30:04 So, it's we really just look at their output 30:08 for the model analysis that should be on first line. 30:12 So it's the expert panel algorithmic analysis that most of your output 30:17 and have contributed it. 30:18 And and based on that we put you in the forecast. 30:23 So if you did all of that. 30:28 Doesn't matter. 30:29 Yeah. We don't look at any data. 30:31 So this is really just based on the the radio panel. 30:36 Thank you. 30:40 We've gone back there 30:41 and then one down here. 30:45 You mentioned it 30:46 matters in what harness the model runs. 30:50 Do you regularly publish those data as well? 30:55 Maybe I wasn't precise in my language. 30:57 It matters that you harness and that you have clear instructions. 31:03 Or is it okay? Sorry. Yeah. 31:04 We made the claim that it does matter. 31:07 You're probably a better bet. 31:10 So that's just about. 31:14 Where we've done this 1950s to have a look at the communications 31:18 of all the learning system and what what we want to do that 31:22 moving forward to be sharing those numbers more widely. 31:25 So it's a bit of answer, but also on the after 31:30 everything else is going to be those. 31:31 And maybe maybe share them like right on a regular basis. Yep. 31:35 Not just one paper. Yeah. Great. 31:37 We're going to draw, you know, all the rules as well. 31:39 Just when you know there's a lot of people don't understand how performance. 31:44 Great. 31:44 Thank you. And I think we had another question today. 31:47 Oh another couple of questions down here. 31:49 We're okay for a couple more. 31:55 Thanks again. 31:56 Just questions for Rob. 31:57 I think you mentioned in one of your slides that 32:02 there is a code structure quality that you unlock, 32:06 or you've come to consensus or there's some, some idea 32:09 that the quality of the code matters or has an influence in the outcome to you. 32:15 Do you have like an idea of what that quality is, what conventions it is? 32:21 I mean, it doesn't have to be a concrete answer, but like following 32:24 some of the engineering practices related qualities, or you've come up 32:28 with some new ways of doing this that will help improve the outcomes. 32:33 So I think what we are looking at more specifically is just kind of 32:37 following any instructions specifically, not degeneration, but there's 32:41 going by 32:42 clicking code or any other kind of any tangles, because 32:46 that is very particular to normalization in terms of how they want to reconcile 32:51 and to have a kind of like a manufacturing technology decision. 32:56 So no, we didn't know about that. 32:58 It's been really. 33:00 Was there any kind of findings from what you get to them, 33:03 like how people structure kind of principles of like 33:07 to go with that, that kind of get back to the performance. 33:11 We don't look at this specifically. 33:14 We're actually working on a paper that we're getting ready with application 33:20 performance management tools 33:22 that will give us staff dimensions, that we have nothing back. 33:26 So so there is some, some, some thought or inquisitive 33:30 on code quality might have an impact. 33:33 And you have investigated yet. 33:35 So we when I go back to the initial let me show you on the radio panel. 33:40 So there are questions on on quality and 33:46 quality. 33:46 That's pretty subjective. 33:48 So and if you can look at that maybe you see there is 33:53 this is a good question with the lowest removed 33:57 we have since then done separate 34:00 panel analysis specifically on maintenance and in terms 34:04 of how easy it is to maintain it if there is significant and higher. 34:09 So that's all we use is now. 34:10 Now this is when we talk about making ability, 34:13 whether that up or down that's going to use. 34:20 Adjusted name. 34:22 I'm interested. 34:23 In the in the diagram we're looking at before where people do. 34:25 I mean how long is it taking people to get into these high performing states. 34:30 And you know, what 34:32 can you see what kind of triggers there are and what kind of things are? 34:36 You know, the high performer is kind of getting there very quickly. 34:39 Or are people gradually still moving across into these things? 34:44 Because your first graph of the kind of exponential growth 34:49 showed a very fast take off 34:51 very recently, but I kind of interested in like how are people getting there? 34:55 Do you understand what triggers and is it is it is it something that they can 34:59 people are still. Hitting for. Interesting. 35:02 We just haven't done. 35:03 Yeah. 35:04 We we take note that this is how it is. 35:09 What we see is that. 35:12 So once you're classified 35:15 as AI top performer you tend to say that. 35:18 So it seems to be a mindset which that takes place. 35:25 And we see that like when you have a that are very strongly 35:30 I feel like they just. 35:35 I guess even if you're having an individual, 35:37 you know, sprint into it, you know how to do it. 35:40 If you don't have to drag your team along, then you will be able to keep it. 35:45 Yeah. 35:45 I mean, the 35:45 same thing is a very interesting measure because I know people who have left 35:49 because the rest of their team is not working in a way that's compatible 35:52 with them and things like that. 35:53 And so the team trying to understand the team dynamic changes 35:57 would be interesting as well. 35:58 Sure, our some of our study participants access to an analysis 36:02 that is entitled to feel that they're losing. Our 36:07 farmers are hiring farmers. 36:11 So they do all that we we have and we offer them a panel. 36:16 And they couldn't in principle look at it. 36:18 It's just such an interesting. 36:22 Okay. I'm afraid. 36:23 That's all we've got time for now. 36:24 So thank you again to Rob and Simon. 36:34 We're now going into a break,

.tessl-plugin

talk-azriel-executable-specs-agentic-coding

talk-batey-building-product-teams-age-of-ai

talk-birgitta-closing-keynote

talk-cormack-tests-lie-observability-ai-honest

talk-debois-agent-enablement

talk-douglas-training-ai-on-your-own-code

talk-dubnov-merge-rate-ai-adoption

talk-farley-vibe-coding-best-we-can-do

talk-firtman-web-mcp-agentic-web

talk-foxwell-reinvention-dev-team

talk-graziano-spec-driven-development

talk-groetzinger-skills-everywhere

talk-jones-odevo-ai-native-transformation

talk-jourdan-pipelines-to-prompts

talk-katsioloudes-code-security-ai

talk-kerr-bipolar-disorder-dysregulation-ai

talk-lamis-context-engineering-dreaming

talk-lawson-agent-experience

talk-lopopolo-harness-engineering-humans-steer-agents-execute

talk-luebken-embedding-pi-coding-agent

talk-maleix-collective-intelligence

talk-marsden-agent-desktops

talk-martinelli-spec-driven-development

talk-moss-skills-team-workflow

talk-obstbaum-willoughby-evals-hard

talk-overweg-one-brain-no-filtering

talk-podjarny-skills-are-the-new-code

talk-roberts-ai-native-brownfield

talk-roberts-brownfield-ai-native

talk-scheire-artificial-intelligence

talk-selajev-docker-sandboxes-agents

talk-sloan-harness-engineering-beyond-code

talk-smith-connecting-context-future-transports

talk-stack-humans-architect-ai-writes-code

talk-stoneham-product-brain

talk-syme-agentic-repository-automation

talk-tal-skills-security

talk-thomas-ai-native-engineering

talk-trieloff-browser-agents

talk-walter-runtime-intelligence-agents

talk-wilson-cq-stack-overflow-for-agents

talk-wotherspoon-humans-vs-slop

README.md

tile.json

ainativedev/latest-aidevcon-speakers-london-2026

transcript.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}talk-obstbaum-willoughby-evals-hard/

Transcript - Why Evals Are Hard and How We're Solving It

Talk Metadata

Transcript

transcript.mdtalk-obstbaum-willoughby-evals-hard/