Speaker-label warning. The source transcript has no per-speaker labels. May Walter is the sole presenter throughout the body of the talk; the only other voice is one audience member asking a question near the end (the DataDog/Sentry differentiation question). When attributing, prefer "Walter said…" for the talk body and "an audience member asked…" for the Q&A. Do not invent attributions.

Transcription artifacts preserved verbatim. The source contains several speech-to-text errors that have been left in place rather than silently corrected. Notable ones: "Granola breathe" almost certainly should be "agent breathe" or similar; "auditory" appears in a list where it probably means a different tool; "in rain" likely means "ingrained"; "five years"/"50 years"/"six years" all appear when referring to codebase age and may be transcription drift. Quote these as-is; flag the artifact in the answer if it materially affects meaning.

Participants known: May Walter (speaker) + an unnamed introducer + one unnamed audience question.

Section 1 — Opening / framing the pain

Added to the thing. Would you mind just skewing. Lots of introverts in the room? Thank you. We have got a couple minutes. I'm just gonna let people go. Straight to someone. Okay. Hi, everyone. Thank you so much for coming. I hope you're not too sleepy after lunch. Can you raise your hand if you were able to get the amount of food that you wanted? Okay, excellent. Great. Glad to hear it. Cool. Well, welcome back to the tool call. This room today and tomorrow is all about practical stuff. This is where I want to be. It's no, like, sea level, heady takes about what AI isn't. It's like, what can we actually use? Today and tomorrow? So May Walter is here. She is the co-founder of Hud, and she's going to be talking about the chasm, if you will, between what agents think they do well. And what they actually do well. So can we give her a round of applause for you? For taking the time and joining after lunch?

Hopefully it will. I have added some points in which you might wake up a little bit and might, might be funny. We'll see. So I wanted to start with the situation that kind of led us to this actual use case, which is a situation you might know there's this PM saying, oh, my God, this page is so slow. We have to do something about it, and then they text the engineering manager, like, so can we optimize it? And then, then we say probably that I have today needs to find out. And then they say, oh, never. Mind. Then. Right? Because they're busy. They're doing things, right?

And then a few weeks later, she said, no, no, no, no, no, no. But this is actually too slow. Now we're going to have to do something about it. How long will it take? He says, well, I don't know, right? Somewhere between an hour and a week. I'll have to look. Okay. And, and who can even take this? The answer is only there. He is the only one that has any idea what's going on in that code base. The. Rest are not in the company for five years. So suggest. One of our customers thought. It would this situation where they were constantly in this. We have to fix this. We have no idea how, but we can prioritize it. Based. And then we said, well, here we do something about it.

Section 2 — What Hud is building

So my co-founder and CTO have had, we're building a runtime intelligence layer for coding agents. So it's basically a sensor that runs with your app in production and captures what coding agents need to reason over production. So, like, for every function, how often it runs, how long it takes, whether it's paying. And if something goes wrong, it proactively captured the forensic context about why that happened so that when you ask why is this low or why is this failing, you can actually get an answer.

And the big problem is that, you know, it got easier to build and to generate code, and then you kind of have this like, yeah, looks good to me and a bunch of pull requests. That looks good. You have one agent that build them. You have another agent that says, yeah, that looks great. This is a great idea. And then the question that you ask yourself is. Why is this broken? Like, but, but, but why? It, like a lip test were passing and everything was fine. And code still finds creative ways to fading production. And we still do that, like, with the best tools and the best engineers and the best days of the universe. And the coding agents have no idea how that actually behaves. And we are basically obsessing about how come this. Can live alongside this. At scale. So maybe there's an agent we can build for that.

And basically, I'm going to go over the journey that we had there. Why was that even important for them? How they did it from how we did it from a tech perspective and most importantly, how we thought we would do it and didn't work and we did about that. And also how I work from a process perspective, which I think is also very important today to understand how to connect all these dots from it kind of work from technical perspective to people are actually using it. And summarize what we can take out of this.

Section 3 — Why performance work stalls (the leaky bucket)

So that makes faster than we fail. And I think that was true for humans as well. We kind of ignore the issue and then it degrades and then it becomes a crisis. And now we have to prioritize it. So we, we fix it. I don't know. Does anyone relate or is it just. Okay. Okay. We're good. We're good.

So you're okay. It just so happens that everyone is doing this. It's not because we're stupid and it's not because we don't care. It's because we're very busy building features, which is what we are here to do. We're engineers, we're builders. So we try to ignore things because they might not be as important. And then once they get to that crisis mode, we want to fix them as soon as possible so that we can go back to building. That is a leaky bucket. By definition.

So the research phase is a big problem there, because if I knew that I can fix it in a day, I would probably prioritize that. But what I have to do is to pay just to find out. It's kind of like taking something from the grocery store and then going all the way to the cashier just to know how much it cost so that you understand if you want it or not. And that's what we've been doing for years. We were like, okay, let's invest a few hours and then we can understand what can be done and magically, there's always something we can do, right? But we weren't aware of it. And then when we know that something could take an hour to three weeks, we don't want to prioritize because we don't know what we're going to get out of it. We don't know how much optimization we're going to get. So you don't know how much it costs. You don't know what impact it has and you're expected to just prioritize that. In your day-to-day life comparing to features that give value to the customer. So that doesn't really add up.

Section 4 — The reframe: automate investigation, not the fix

So our thought was to automate the investigation part. It wasn't even about fixing it. It was just about knowing what can be done. Automatically. And then. Being able to streamline the fix. So what if we just run on a weekly basis or bi-weekly depending on your spring planning sequences and just know about real sweet spot. With high ROI opportunities? That are safe? So we can just run it every week. And that performance sprint that we've been trying to get to for months can just happen regularly. Sounds pretty good. Except that it was pretty hard. So let's talk about what we did there and what was helpful.

All right. Did I convince you about why this is needed? Great. That's good. It's a good start. I think a lot of time when you're talking about agentic workflows, the first question is like, does it really need. To be built? And it gets so much easier to build stuff that we just stumble on building things because they're fun. You know, auditory and all these things are like very good examples of that. I think what was interesting about this case is that we were trying to automate something that is not a part of our day-to-day lives in the same way. This is built for how agents work, which is kind of like this automatic thing. You won't stop every week and ask an engineer to kind of go over and find some interesting performance optimizations. But if you knew that they exist and that engineer would just like slap their team lead saying, hey, I just found this thing like in two hours, you can get this down 30. They would say, wow, that's amazing. Right? So maybe that's. Sort of the first wedge to change that behavior.

Section 5 — Where the agent runs (neutrality, GitHub Actions)

So let's talk about how that works. First of all, we needed to ask the question of where can this run? Because agentic workflows don't, we don't want them to run on our computers. We want something to run in the cloud. And it just so happens that not every customer we work with even has their own setup for that. And we have players like Cursor and others that are building their automations so that it's very, very easy to install. But at the same time, you're now locked to that specific vendor, which does not make sense at scale. So we want something that's better neutral, both in terms of the compute, as in where it runs. In terms of the harness, as in which coding engine is and which model we choose. We don't know. No one knows what's the best. There are seasons.

And then what we wanted to do is to make it easy to switch. We wanted it to be secure, of course, in terms of the permissions. The tool calls, the authentication. We don't want to build something for scratch. We want to use something that we can trust. We want to set up triggers that could be web hooks like, hey, this is slow. Let's investigate it. And also on schedules that are per deployments or on a weekly basis or whatever that you different use of. And we wanted it to be easy to maintain and update, which is what I think people get. Most, most wrong about agentic workflows is you don't just ship them. They are like code. You ship them and you find out something you just want to change between that little thing so that it will be easier to work with. And then if it's hard to do that. Then people just won't. And in our case, the biggest question was what matters? That's something that is live and changes with your business and product. So it was very important for us.

Typically, we chose GitHub Workflows. I don't know if you're going to do that, but it was actually fine. In the audience, you can talk to later. And the idea was that people already have GitHub Actions or those similar, not everyone, but many of them. So we can just start with something that we already know that they already trust that it has these parameters in place. Again, it's not the only solution, but it was something feasible. We don't want to innovate. We don't want to build this new agentic infrastructure for whatever and prioritize the platform team. We have GitHub Actions. So now we can have those in a way that works. And again, it's about reducing friction and finding that path of least resistance destroying something else and see if it works.

Section 6 — The workflow shape

So that's kind of how it looks like. And if we go over it. A bit more. Big enough to do this. The first thing is you actually choose the engine. As in where do you want to run? And is it Claude or Claude or whatever? Which permissions you have, the network, the tools and so on, the MCP servers. In this case, Hud's MCP server. And then you just start talking. So the task is basically a prompt that runs every once in a while or trigger and does that.

And if I go back. To here. You can see that it's basically a witty report of an AI performance and reliability engineer for the repo that generates this deep insight and context. About. The repo and white dust. And this was actually, we started with the repository and then we even like did it more in rain because a lot of people are still using Monorepo. And the idea is not to report about all the cool things that we can theoretically do. It's about the thing that the tech lead for that specific service cares about. So they are getting designated reports only on the parts that they care about.

So basically we have a coding agent, it could be Claude or anything else specifically we use Claude. And run through intelligence run weekly on GitHub Actions and sends the report to Slack. And again, I'm not saying this is the only way to make it work. What I'm actually saying is find a way that matches your stack and your tools so that it would be as easy as possible. They were using Claude, GitHub and Slack. So that's why it made a lot of sense. For us to kind of latch onto that.

But let's talk about what it does, right? What we basically want is to start with the production context. How the agent analyze it to find anti patterns and opportunities. Then we want to score and flag them because not everything that can be done should be done. And that's because we still have to review that code and merge it to production. It's because we want to make sure that it actually has an impact. And we don't want to take a risk on deployment. If it's not going to be that impactful. And again, I hope no one relates to that. But a lot of times what happens in performance is you start with a thesis and then you fix it and then you deploy. And it's not exactly as impactful as you expected it to be. It was really great in staging, but production. The problem isn't really that. So we try to avoid that and to sort of convince something that's worth doing and then go over the dip. And merge the PR with a human review. And that sort of closes the loop.

Section 7 — What went wrong first

And then it didn't work. Right? Because that's how it is. First of all, some of these offers were plausible but unverified, which is kind of like what we do when we go over the code. It's like, well, we can probably optimize this. What we don't know if that's really the bottleneck. If something's taking 20 seconds. Or five seconds, I'm not sure exactly where that time spent. So things that sound right are not necessarily the ones that are going to move the needle.

Second, what's specifically for queries. So queries can be very complex. And also they are very dependent on how the code actually runs and where the data is if you have that customer with like 10,000 rows returning from the database. It's not a big surprise that it's going to take longer. It's just that you have no idea how that looks like.

And of course the biggest problem is what we call lazy fix, which means, oh, there's an exception. Let's catch it. This is great, but it's not helpful at all. A lot of time what happens is that. If you try to optimize for the syntax error or the long run in query, then the question and the answer will be local. And we actually want to look at it from a broader perspective.

So those are like things. That. I guess it's fine that they will start that way. But in time we understood that we want to explain what a stick does look like and what am I interested in? Kind of like a step engineer would explain to someone who just joined the team.

Section 8 — The context problem (prod-to-code)

And all you need is context. Right? And there are only two problems with context. One is you might have too much of it and then it's really hard to understand what matters. And the other is you might not have enough. And then it won't matter. Because the agent wouldn't just assume what's going on there. Or what I would do as an engineer. Is I would probably add more logs or metrics. Deploy a version. Gather formats, understand where the bottleneck is. And then go and fix it. What the agent will do is to say, okay, this is what I have. Let's find the shortest path. Let's assume six consumption and see where that goes. We can't do that at scale, especially if we want to interrupt engineers flow, right? That's like the most important. We love that.

So if I am going to flag an opportunity. I can't risk just saying something that won't really make sense because then they will read my report. They will open a pull request, they will review it and then deploy it and then find out that it didn't change anything. And then I just wasted their time. So it was really important for us to make sure that we have just enough context so that we can understand and have confidence in the fact that it's actually going to do something.

But that's actually hard to do because they don't speak the same language. You know, the production context is usually service level or endpoint level. This endpoint takes five seconds and the p99 is six seconds and the p100 is 70 seconds. And then the agent reasons over code like the local function files and class methods. It can kind of deduct the connection between the two, but they don't speak the same language. And that's where most of the problem is when you ask why is it long? You could go over the code and it might get it right. Don't get me wrong. Some of the issues can be found in static code analysis. But when we want to make that transition between human led to an agenda workflow, we have to be quite certain. In our confidence so that we can truly automate it. If something works 90% of the time, it's not an automation. It's streamlined in humans.

And that's why we built what we call prod to code, which is basically a mapping of what's going on in production to the function level. So you have what the end point or the service and the endpoints or event consumers and the front jobs that it runs. And then the mapping of the functions that are involved within. So one thing that we got from this is the ability to ask, this is slow. Why? But it also opened a huge window of opportunities to ask the inverse question of, I'm going to touch this. What does it impact? And should I care about it? Is it going to touch my payments and my authorization or not?

So basically. What the sensor is doing in that aspect is complete function level context. In a way that's connected to the endpoint and deep forensic context only when it needed. And the assumption here is that we can't just go through like gigabytes of logs and spans and traces to be able to do that. But if we have for each of these functions the ability to connect them to how often it runs, which business flows, does it impact? How long it takes and whether it's failing. Then we can start going over the deeper context where the slots or traces or whatever, when we kind of have the area of where we are trying to tap into. The same level in which the cottage is reason. Able.

Section 9 — Layered architecture (query language → skills → automations)

So we have the basic query language layer. We specifically use ClickHouse for that. But of course the team has their own. And on top of that, we can build a set of skills. Because it's not just enough to connect skills, to connect the data and kind of ask the agent to see what happens. I saw like five different talks about why that fails. So we have the skills of how to approach an HTTP 500, how to approach a memory leak. How to approach a performance degradation. And all of a sudden it's not just safe that can tap into that knowledge and expand it.

And on top of that already the level of the automations. So r effects or dead code removal are building on top of those skills that are building on top of the query language. And I think it was really important in terms of what we learned to allow all of these layers. The coding agent might need something that I cannot express in words. And that's why tools are not enough. But at the same time, I don't want to waste my tokens and my chain of thought. On things that I know. I do know how to tackle a memory. I want to look at the pod that had that memory. I want to look at the different one that didn't. I want to see what happened there that didn't happen there. Right. When I look at the performance degradation, I want to understand where the time is being spent.

So. That small level of. Involvement and ownership on the methodology was very impactful for us. And then we can say something like look for artificial delays like sleeps and timeouts. Look for n plus 1 queries. Look for missing indexes or synchronous blocking your sequential weights. All of these things. And when you think of the code base that exists, in their case for 50 years, obviously these things are there. It's just that we need to kind of dig them out. And know that they exist. And then you have these moments where like, oh my god, this was like this for six years. Are you serious? But if we look at it from a positive perspective, there's like so much we can do.

And I think in that aspect that gave us that. Balance. Between kind of letting Granola breathe and having that full freedom, but also kind of guiding as to how we know that engineering practices and skills work. And then we can ask something like why are my implement statements so long? Like my product manager asked and get an answer. Oh, it's actually this thing. With an impulse one that existed in the cold base for like six years. Maybe we can do something about this.

Section 10 — Why blind auto-PRs failed

And we were very happy and please about the result. So we said, okay, maybe we can just like add open pull requests for all of these high impact low risk changes that don't require migrations or anything super scary. And we could just open a pull request. We don't want any pull requests that no one's going to go over. And we don't want to go over those 80 focal press. Even though they exist.

And I think that's like if you can take one thing out of this, it is like after you get it working and you want to automate it, think about the scale. And think about the humans that are still in the loop or not in a point where no one just cares about it. And if you open a pull request, it's kind of like opening your DataDog and Sentry with like 700 issues. That's all they say is like, well. I'm not going to be able to fix this, so I'm not even going to. Try. So of course we started with automated pull requests. We realized no one cares about that. And if it's not prioritized, then we're not going to do something about it. And again, I'm not complaining. I'm not judging. We're builders, right? We have prioritization. It exists for a reason. No one has time to just go over a bunch of pull requests built by the agent that are statistically huge and try to understand what happens. We don't want to own that. We don't want to go into it.

So we actually need to convince the human that it's worth the attention instead of convincing the agent that it's worth the token that people always think that they are.

Section 11 — The working pattern: scored quick wins

So we map the hard paths. The endpoints that are invoked. The business impact of that. So is this impacting payments of authentication or the set of things that you already know that you care about for your business? And also the risk? We are not looking for the best optimizations. We are looking for the highest impact, lowest risk changes. That can be done so that we can be able to develop our community. Hey, listen. That small and it's going to do like 30% improvement. So we should do that.

And we want to make that humanly readable. So if a human gets something like, hey, this endpoint. Is like the p90s around 100 milliseconds except it kind of takes 45 seconds every once in a while and we have no idea why. Theoretically speaking. And then, oh, this is happening because you always use the mongoose distinct function. If you switch it to a search. Then that should slow you down. Like improve like 30, 40% This is something that we can actually prioritize.

And once we started talking about these quick wins, you can see that they can either open to dive deeper, they can create a ticket or create a PR because again, I don't want to be opinionated on exactly how advanced they are. That's fine. Either of those work. You can do it later, we really know. Obviously both are labeled so that we can measure. It. And all of a sudden the end point that used to take 100 milliseconds except the fact that it kind of takes 45 seconds every once in a while. Well, it doesn't anymore.

And I think that was a very, very big learning for us that we just want to get to a point where the humans kind of made the decision that it's worthwhile. The fact that merging to production did not become, it's not free, it's just cheaper. To merge and to ship. And then we kind of have the ability to learn more about them.

Section 12 — Four takeaways

So four things you can take from our experience so far.

One, we still need to define what matters. And I think if I can be a bit philosophical, even though it's a use case. I think that's going to be a huge part. Of what we do as engineers in the future is decide what's worthwhile, worth the tokens, worth the time, whether it's a human that's reviewing it or an agent. We still define what matters more and what matters less and the ability to score by what will be impactful for business. Is really important. And I think that's something that we can all take to our day-to-day lives and if we inform our agents on how we look at what they do, then they're going to do better. Because they don't just lack the runtime numbers. They also like the business context.

Second part is to automate the investigation phase to maintain prior organizations. And this is an example that we have for performance. We're doing several things for reliability and the other things. But I'm not saying that the backlog should be empty. Just because we have agents. I'm saying that we have a way to use the agents to make better decisions on what to prioritize. And this is a use case that is not as common as what we're seeing. Like a lot of people are using agents. To do what they were planned to do faster. And this was a way for us to help people. Prioritize the ROI and not just the task itself.

Context over cleverness, I guess like it's, it seems obvious. But the agents get useful once they see what's going on. In our case, it was production context. Being able to understand where it is and to have very high signal data. About it. So that they can. Build it. It's the same way that a stack engineer cannot just move to a different company and shift to production on day one. That's the context that matters and it's going to change their behavior the same way that it changes our behavior.

And last but not least a genetic engineering is not like coding with an agent. It is different. If we want something to work as a workflow. It has to actually work and it has to be confident in what it does. It is better to do less, but the right things. And if I, in any time I would suggest something that won't work, we lose trust in the process. So the automation unlocks continuous impact, but it also requires a higher confidence level. So choose the things that you feel that your repo and your automation level are right for. But I think that's where we're going.

Section 13 — Vision and value-along-the-way

And when we try to think about that art, we're engineers. When we know the scale, we can architect the system. So I think the scale at what we're going to see. The question that I'm asking myself is how would it look like to get to a point where some engineers and some of our customers companies would actually just click that merge button. And get that fixed as is. And because I can measure it now. We can get better and better until we get to that point.

And I think, you know, we can talk a lot about how that would be perfect if it just works. But the thing that was, I think really about the process here. Is the ability to start with something that is still net positive and it allows you that fly with understanding where it got stuck. Where the inside's not good enough. Were they good enough? But it was hard to merge them. Did we merge them, but they didn't give that effect that we tried. And that lets us learn and improve. Before we get to that point of the full optimization. And again. My dream is to get to a point where people would say, well, 80% of the time you just did the exact same thing and merged it to me. And then we will know that we're ready before providing value along the way as we're getting there when models are getting better, context is getting better. And everyone can kind of spend their time on the right things. So thank you. I hope that was somewhat helpful. In this tough hour and we have some time for questions.

Section 14 — Q&A

We have one minute. If someone has a very good question. One in the back. If someone has a question that requires more than a minute. I'll stay outside. Now. To answer. That.

[Audience member] Thanks also for the talk. I have one question. So most people, we probably already have integrity like DataDog or Sentry or some other like platform. You need to ingest loads of data to be able to get all these collection metrics. So how do you stay competitive with them? Because you need to also just step around the paper, but then people will wonder like, oh, we just like also pay you 10k one or whatever to be able to do these stuff. So how do you help?

[Walter] That's a huge question and especially for startups. So we always like to say that we're like an espresso shop. We only have an espresso, but it's the best one in town. We try to focus on the things that we do better. If they have DataDog or all these skills will ask, okay. But do you still suffer from production incidents and issues? Do you still spend way too much time? Are you still surprised by issues that your customers report? And if so, let's run together. Obviously, like for example distributed tracing if they have that, we will just connect to what they already have. The thing is the function level context and the forensic context that is being proactively captured. And we focus on that and integrate to whatever it is that they have.

Okay, thank you so much. And I think. We can meet them. All. The way. To find the. Kind of reg. Ret like every. Day. And. It. S more. It tries to. Find. Like. Let me. Try. I think. This. Should be. This type. Of. Review. First. I think it's possible.

.tessl-plugin

talk-batey-building-product-teams-age-of-ai

talk-birgitta-closing-keynote

talk-debois-agent-enablement

talk-douglas-training-ai-on-your-own-code

talk-dubnov-merge-rate-ai-adoption

talk-farley-vibe-coding-best-we-can-do

talk-firtman-web-mcp-agentic-web

talk-foxwell-reinvention-dev-team

talk-graziano-spec-driven-development

talk-groetzinger-skills-everywhere

talk-jones-odevo-ai-native-transformation

talk-jourdan-pipelines-to-prompts

talk-katsioloudes-code-security-ai

talk-lamis-context-engineering-dreaming

talk-lawson-agent-experience

talk-luebken-embedding-pi-coding-agent

talk-maleix-collective-intelligence

talk-marsden-agent-desktops

talk-martinelli-spec-driven-development

talk-moss-skills-team-workflow

talk-overweg-one-brain-no-filtering

talk-podjarny-skills-are-the-new-code

talk-roberts-ai-native-brownfield

talk-roberts-brownfield-ai-native

talk-scheire-artificial-intelligence

talk-selajev-docker-sandboxes-agents

talk-sloan-harness-engineering-beyond-code

talk-stack-humans-architect-ai-writes-code

talk-stoneham-product-brain

talk-tal-skills-security

talk-thomas-ai-native-engineering

talk-walter-runtime-intelligence-agents

talk-wilson-cq-stack-overflow-for-agents

talk-wotherspoon-humans-vs-slop

README.md

tile.json

ainativedev/latest-aidevcon-speakers-london-2026

transcript.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}talk-walter-runtime-intelligence-agents/

Transcript — From Blind Spots to Merged PRs