AI Native DevCon 2026 London — all conference sessions as interactive skills
66
83%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Risky
Do not use without reviewing
Speaker-label warning. The source transcript has no per-speaker labels. The body is delivered by Shachar, a Product Manager at Buzz (Tel Aviv), who introduces himself early ("My name is Shachar… Big product in buzz"). The event metadata names Simon Maple (Head of Developer Relations at Tessl, AI Native Dev co-host) as the speaker/host of the welcome session, but his introductory remarks are not labelled separately in this transcript. The Q&A section contains three unnamed audience questioners. Do not invent attributions. Prefer "the speaker said…" or "an audience member asked…" unless the transcript explicitly names someone. The transcript also contains speech-to-text artifacts (e.g. "Tessla" likely = Tessl; "platinum" in the planner/verifier section = "planner"; "Asian sessions" = "agent sessions"; "vehic" = "veteran"); these are preserved verbatim below.
I'm going to start with the question that is haunting me day by day and probably through the whole night when I'm in sleep for this talk. It's 2026. Why are you overviews and code reviews are still on like an engineering team? I started introducing myself. My name is Shachar. Quite hard to pronounce in English. Sorry about that. Big product in buzz and the mission of building the best and the most precise AI code review agent in the market. But I'm not doing product management. Waiting on surfing on waves. So that's where I am. I'm based in Tel Aviv. So when I'm not in the office, I'm in the sea. I'm a product manager vehic. I really love dealing with product management. I'm doing it for the last decade. And I think the last year, two years are the most exciting times to be a product manager and a builder. So I hope you're sharing the same passion and this is what I'm going to talk about during this conversation. Right now I'm based in Tel Aviv and I'm very grateful to be here. I think a couple of weeks ago it seems that the odds to take a slide from Tel Aviv to London will be very low. So thank you for hosting me here and thank you for spending your morning with me. I'm planning to relocate and establish a new site for our company in San Francisco. So if someone is here from the Bay Area and would love to connect, we'll make my time there better and I would love to be in touch.
So back to the question that I started. Meeting dozens of engineering teams on a weekly basis from all over the world than from different sizes, from startups that are five to ten people to enterprise teams that have thousand, two thousand developers. And I hear the same explanation again and again. That code review became a bottleneck. And that's a good product manager. I'm trying to understand why this is happening. And goes to kind of discovery calls. Why even though we're using the best coding agents in the world? Costly and expensive models available, we are still hitting the point where human beings need to review code. Where engineering managers explained that they have. Five times or three times more PRs that are waiting to be merged, but nothing is happening and velocity is actually becoming slower than faster.
One of the main things that repeated in those conversations was the matter of trust. People exclaim that they don't really rely on the agent to close the feedback control. A lot of the users told us that the problem that they had is they have specs. And they see that the coding agent multiple times just ignore most of the specs part of it. But when they see in the end feature that was implemented, it's not really how they intended it to look like. Size of product manager, when you hear that kind of pain, you decide, okay, this is going to be my mission. I hear this pain that nobody else's soul hanging. I'm going to solve it. And what you're going to see in this session is the journey that we did to build the tools and the context to give coding agents the ability to really verify that your feature was implemented as you asked.
If you don't believe this issue, I'm going to show it from an example from my day-to-day life. So I'm a product manager. I define, I design skills, I design features, I decide what the company is going to implement. And this is an example for a ticket that we were supposed to implement as part of the sprint. We have as part of our product the ability to give our customers the option to integrate their ticketing system. It can be Jira, linear and fiber is one of them. Not a lot of people are using it and not a lot of people are familiar with it. But that was one of the features that we had to complete. And part of what we wanted to do is to add this to the onboarding flow of the product. So when a user is onboarding to Buzz, it will have the option to integrate which of the ticketing system is available that they have. And because I'm reviewing every product, every feature that is being released, I asked to make sure that the same bug that happened when we integrated another ticketing system won't repeat again. I literally added a recording from a previous time when we added another integration where the continue button was overlapping the new integration that was added.
A couple of days later, the developer sent me a slack message saying telling me, hey, I just completed that feature. Here's the vector preview environment. Please review it. Guess what? The continue button is exactly where I asked him not to put it. Guys. Like I couldn't be more explicit than that. And I'm showing this because this is a really simple example, right? Like for who of you are taking part in being in an engineering team, product design, this is our day-to-day frustration. We're writing specs. They are very detailed. And we're very happy to deliver them to the engineering teams that we're working with. But unfortunately, when we see the results is not as we expected. And you can understand that if this is such a front end simple task that gone wrong, how like what would be the results if we're doing a complex tas. K?
So the bottom line is that coding agents that we are using are not focused on verifying features. They are amazing in generating code and I think one of the themes that is repeated in a lot of the talks here and in other conferences is the fact that it's so easy to build right there. But the way that the coding agents are built, they are focused on generating features that are not focused and fully extracting all the specs that we have, all the specs that we run. And they're not focused in verifying the feature as it is. And when I say verifying the feature, I'm not leaning and verifying the code. For those of you who are working in a team and you have like your preview environment, your staging, your databases, your other features that are connected to the specific feature that you're working with, you understand that comparing code to code isn't the answer. You need to deploy the feature in your staging environment and really see how it looks like. The good news is we don't need to create any data that nobody has. We have specs. Specs are there. And if you're a good team and it's a really good practice, you focus on writing space.
So this is my dream as a product manager. This is the overview of the architecture that I started to create together with my CTO. I'm going to create an agent that is called spec reviewer. It's going to have access to specs. Specs will be saved in a ticketing system. It can be jira. It can be linear. It can be monday.com lotion. It can be on github. It will be able to access designs that is something that is not so standard. To visual assets in the designs. And we'll be able to access the feature that is deployed in the staging environment or in a preview environment. And verify it. The way I see it is that agent is able to extract the requirements, all of them. And go every requirement that are in the specs and validate if they were really implemented or not. I'm thinking about an agent that is navigating through different types impersonating to different roles into the product and just going and clicking and trying different cases in the product.
So as a product manager, already vibe coding, I have access to the best tools and I'm going to ask Claude, what does he think about this idea and if we can implement it? And the answer that I get is this. And this is the moment that I'm very concerned about. My mind tells me this unhappy, but when Clark tells me there's some, that's the point where I need to be concerned and I click enter. And I get this. The reason I'm getting this is because it's a very hard task to do. And you can see here this arrow shows that the context window just explode. So what I'm going to show now is how we took the simplistic idea of. Implementing an agent that is able to extract requirements and verifying them to make it actually.
So this was the first problem. The problem is that one agent just can do it. One agent can take all the requirements that are in the ticket, all the requirements that are in the specs, all the design aspects that were introduced and this was a specific feature and also go and verify them. It just explodes the agent. And by the way, that's probably one of the reasons that your coding agents don't do. It.
So what was our solution to deal with a crashing sessions? The idea was to divide the task between two different agents. It's also, it's another best practice you can use for other agentic tasks. It's dividing it between planning. So we're going to have one agent that is the platinum. Planner's role is to extract the requirements from the spec and understand what are going to be the failure cases that I'm going to verify through the verification process. Only one task to extract requirements. The second agent is going to be the verification agent. The verification agent is going to navigate through different files through the UI through the design and understand if the specific specs that were provided by the planner were met in this feature.
Okay. Good news. Sessions are running. We're not crashing anymore. But unfortunately the results that we're getting are hard. What we understood is that the agents started. Skipping some of the requirements. We had tickets that included 10 or 12 requirements. Sometimes we had PRDs that included 20 requirements and the agent. Just consistently skipped some of them. The second thing that happened was that we noticed that the quality of the verification and the quality of the requirements that were extracted was inconsistent. And we understood that even though we divided the task between two different agents, we understood that it's still too much context for one agent.
So this is kind of going to be the only graph that I'm going to show in this presentation, I promise. This is an illustration of the context problem that we're having in this task. So how the verification agent works. It gives the list from the planner of what is going to be the potential failure scenarios that I'm going to review. And starts with the first requirement. From the spec. Going into specific files, navigating through them, testing the UI, checking if the requirement that it's supposed to check was implemented or not. Defines what would be the verdict and proceeds to the next one and the next one and the next one. You can understand that at this point where I'm getting to the fifth or the sixth requirement, I'm having loads of contexts that are irrelevant for that specific task. And that makes the agent just dumb.
The way to solve it would be to delegate that task between multiple sub-agents. It's understanding that I can verify each of the sub requirements in the spec by different agents. So instead of one agent that is checking 10 or 12 or 15 requirements sequentially, I'm going to have 15 or 12 agents that are running in parallel. Each of them is reaching to a specific verdict and in the end there's an orchestrator that collects all the verdicts at the end of the process.
Great, we're having sessions that are running. We have all the requirements that are extracted and validated. Now we started to notice another problem that is related to the quality of the context. We notice that the agent is starting to make up. Requirements that nobody asked them to do. This is like an example from one of the first reports of our spec reviewer. You can see it on github here. For example, you can see here the agent decided that we need to introduce a new responder command. Like nobody asked that. The second example is maintaining backward compatibility. This is again like something that the product manager, the CTO that designed a specific task. Like nobody really had that specific requirement.
The reason we understood what's happening here is the fact that we gave the agent only the space. And what's really interesting here, and I added this gift from inception is the fact that when we gave the agent the tickets, it only understood the specs and the design. It only understood what's going to be. The goal. Like what is the intent? In that specific moment? But without to be grounded with the code, it's just not connected to reality. The specs are a snapshot and what we're trying to achieve. In this coding task. But the code is how we ground that agent to reality.
Two interesting takeaways in that area are one. We understood that if we give the agent the base branch and the base code and not the diff, we're getting better results. Why? This was a really interesting case. We understood that if we're giving it the diff. It's biased to the specific solution that the engineer choose to implement. But if we give it the base branch before the change, the agent is open-minded to different kinds of approach and is more critical about the solution that was chosen. The second way to improve that kind of result was to scope the agent to understand what it's revealing right now. So for example, if I'm reviewing a front end feature, there's no reason to be concerned about backend issues because I'm just going to create noise that are irrelevant for this specific feature.
So the good news is we have sessions that are running. We have requirements that are extracted perfectly and we're verifying them amazing. Ly. Now we're getting to a point that is a really exciting point of sharing it with our customers that are waiting for this specific feature. And we get to the point where we're starting to plan how the launch of this feature is going to look like. And we get to the point where we're saying, okay, now we need to integrate our customers integrate. Buzz into their 3D environment. So that means that Buzz needs to access multiple URLs, unknown ones. It can be teams that are just giving us access to this URL. And you can understand that that's not a good idea, right? Like clicking on unknown URLs, it's just like clicking on phishing link that I'm getting on an SMS on an email. And I'm going to do it a hundred times. A month. My agent is sitting on our S3 bucket where our code is, is where our data is, it's where our customers data is, it's where our open AI credentials are. That's not a good idea to run arbitrary code on my.
So the way we decided to solve that, and again, this is our personal approach, but I'm going to talk about how you can do it again. It's using a third party tool. There's a lot of nonsense people are doing there with coding agents. People are just vibe coding a lot of software right now. And if you're entering an area that you think is a dangerous area related to cybersecurity, vulnerabilities prefer using a third party tool instead of developing yourself. There are this specific solution didn't really impact the results that we generated. It just helped us deal with things that we didn't want to deal with as a dev tools company. In this case, what we implemented and we used together with AWS agent core is the ability to run an ephemeral sandbox for every requirement that is being validated. That means that when the agent needs to validate a specific requirement, there will be a sandbox that would be running for that specific requirement with a browser session checking the specific feature through the premium farming. Sending back the verdict to the agent that is in our cloud. And then the agent can understand what was that specific requirement if it was implemented or not.
And this is the dream become true. And this is the moment where spec reviewer is running on our code. I waited for this moment for close to six months. And you can see here multiple sessions that are not human sessions, but Asian sessions that are navigating through our UI. You can see here cases where the agent is navigating through dashboards, checking if data integrity, when we added different features to the dashboard. You can see cases where the agent is looking for different integrations and trying to integrate and activate a specific integration. You can see the agent trying to subscribe through Stripe that sometimes breaks and that's like probably the most frustrating point for you as a product manager when someone tries to pay and the product isn't working and also trying to subscribe to onboard with a Google account or with Tessla. And this is running multiple times, like dozens of times in every PR that we're doing to validate every feature to ensure regression tests and to ensure that the features are really implemented as we describe. Them.
The last slide. The key takeaways or if you want to take this dream and build it by yourself. First of all. It's 2026. And a context engineering is still a hard problem. There's a feeling that when you look at what's going on today, this is something so easy to do, right? Like context windows are so big and I have so many tokens. It's just so easy to do it. But the truth is it's not. Complex tasks that are not out of the box from coding agents. Requiring a lot of context engineering and two tips that I can give here was one dividing between planning and execution. And the other one would be delegating agentic tasks between multiple subject. S.
The second takeaway here is the fact that you have your specs and you have your code. And you don't need other resources except them. And if you combine them together, it's really a gold mine. So you don't need to create more resources to ensure that your agent is running smooth. Ly.
And the third takeaway is the one that I talk about earlier. If you identify highly area, high risk areas. That you think that can be vulnerable are your excellency or part of your architecture before using third party proven tools that you can use instead of exposing yourself to security issues.
That's the way is a personal one for someone that is a builder and a product manager and as part of the startup life. These days can be very frustrating, I think, for someone that is looking at what's happening on Twitter or news can think that there's no way to beat the big companies. Every day on tropical cursor, a codex release a feature that you're saying like, oh damn, this is what I'm thinking about. This is going to be the big feature that I was planning to do next quarter. But the idea is that. These big companies are overlooking a lot of cracks. That real teams are looking for products to solve. Spec reviewer is one of those examples that for me as a product manager, it was mainly to tune in to the pains that I felt in our customers and trying to find a place where coding agents don't really solve it. And finding a way to build it. And I think like this would be the best tip that I can give you from this conversation. Look for the gaps that the big coding agents are not able to fill and build the product there. That would probably be the best way for you to be tech. That's it. Thank you very much for your time.
Three hands in the end. Like you just don't care for it if you're on twist. Did you say you run reportedness? Regression tests every time you implement a new feature? Yeah. So you test everything that's already there. That's right.
So what spec reviewer does, that was. The next part of it. So the first version of it was to make sure that everything that was in the spec was really implemented. So for example, the colors are correct, the resolution is right, that everything that the product manager or the designer in all the states were implemented. But then what customers started asking was, okay, this agent is able to navigate through the UI and ensure that everything is working as expected. Let's also give it like a set of critical flows that I want to ensure that nothing of them is breaking.
So for example, when I give this example to every engineer that is coming, I said like, you can do a lot of things. Like if there's something I'm not like I'm going to be like, I'm going to remind. Like I won't forget forever is if you're going to break subscription. Because if I'm like one time, I just looked at a customer on like a screen recording trying to subscribe with like 100 seats and just didn't do it. It was unsuccessful and just like ditched the product and I said like, okay, I just missed the customer. And this would be something that I would check in every PR from now on. Like ensure that is happening. And you can think about your product and your critical flows and ensure that they are all the time intact and nothing is happening to them even though you're introducing new features.
Thank you. Thank. You, Father. Doc. My question is about. Verification. It could be. Executed into itself. You can apply around the agentic session or at speaking. And from there with that of clever things, for example, let the agent to go through the application and verify it. Or you can build a description of the scenario. And then make the agent code this test and then just run this test code. Like a script. From committee to the code base. So once you click on the ratio resolve to one approach, one to another. How you make these decisions.
Okay, so it's a good question. The problem that I have with tests is that they test things that nobody really. Like, they're not testing real life. Like QA teams are sitting and thinking about these imaginary scenarios that are not going to be anytime in your product. And there are plenty of AI coding tools that generate tests. Like here. Take like 100, 200, 500 unit tests. And at the end of the day, there is one scenario that you're not thinking about. The idea we chose this specific approach is we wanted to leverage data that is already there. Like I don't want someone to think about scenarios that should, that might happen or might not. And let's use the specs to do that.
By the way, the byproduct of this process, and I didn't talk about it, but I heard about it from a lot of our customers is the fact that when they knew that the specs are used to be for the agents to verify the feature, it encouraged them to write better specs. So it's like renewable energy, right? Like I'm bringing a new tool that improves best practices and improves and encourages product managers and designers to be more specific in the way they write specs. And I think that like if testing was the answer, there wasn't room for this.
Thank you. Hello. Thank you. For. The talk. Thank you. Two small questions. First, you mentioned the sub agents. Sort of simplifying and speaking in different. Acceptance criteria to check. Rather than give that to different sub-regions. Have you used lighter models for those given that those are faster and tokens are becoming more and more extensive. So I wanted to understand if you have a way with that. And the second one is, as you mentioned, regression testing, which we kind of do also with time to invest in your testing after release.
So I'll address the first question first. We are using different models where specifically using all of them or using openai and anthropic ones for different tasks. For extracting verification, the requirements we're using a heavy model. Because it's a big task. The fact that human beings aren't consistent about how the writer expects. So there will be people that will say, this is something we talked about on breakfast. Here's the print screen. Good luck. And the other one would write like a really big step. So we need a model with a big brain that is understandable and able to understand what was there and ground it to what really the requirements are. While the verification ones, we used like small agents that just needed to navigate through the product. Get to the right section in the product and say like the color is the color. The resolution is the resolution or the click is working or not.
Can you repeat the second question? I just want to make sure that I'm addressing it correctly for exploratory testing. So the recursion testing we have those core journeys that we want to make. And. We want to make sure that we are not regressed in what's called functionality. But what about extra void testing? Because as we know every time it will change, maybe something unrelated will break. That might not be that. Culture.
So what would be the example of what would you like to test? Regulatory testing is just generally in the QA. So if you do it memory, QA is just navigate the product in certain functionalities to understand if things are working. Yeah. So we focus there on where matters and not like all the app. So there will be things that. Like I will not go and test everything in the product. In the end of the day it's dividing like putting an agent into work. And using tokens and like a genetic workflow, we'll be focusing on the one or 5% in the product that matters most. And not going into every part in the product and doing that. So that like you can have static testing, testing your code to do that. It's mainly focusing on where matters most and putting the agent there. Like I wouldn't like to cost tens of thousands of dollars for verifying every quality and the product and only focusing where I really am. Needed.
Thanks for the investigation. I know he said your short one, but how did you make sure that the spec that you gave the agent were actually implemented the way the spec has been written? And because you said it was dropping out at points, right? Yeah. The other bit is that the very important one you said specs plus four is gold mining and there's an argument here that a lot of us don't care about the code that's being generated. I think it's a bit of a mistake because if you create if the agents are created spaghetti code the next time you go and try to either create a new feature or fix a bug, I think you're going to go into this negative spiral and then end up getting more bugs. So how do you address both of these?
So I have three seconds. And it's okay. Sure. Thank you very much. Next. Up in here. I'm going to. Ask if this is. Really my money. Whatever. Way it's gener. Ating the code, the. Unfortunately.
.tessl-plugin
talk-batey-building-product-teams-age-of-ai
talk-birgitta-closing-keynote
talk-debois-agent-enablement
talk-douglas-training-ai-on-your-own-code
talk-dubnov-merge-rate-ai-adoption
talk-farley-vibe-coding-best-we-can-do
talk-firtman-web-mcp-agentic-web
talk-foxwell-reinvention-dev-team
talk-graziano-spec-driven-development
talk-groetzinger-skills-everywhere
talk-jones-odevo-ai-native-transformation
talk-jourdan-pipelines-to-prompts
talk-katsioloudes-code-security-ai
talk-lamis-context-engineering-dreaming
talk-lawson-agent-experience
talk-luebken-embedding-pi-coding-agent
talk-maleix-collective-intelligence
talk-maple-ai-native-devcon-welcome-slick
talk-maple-ai-native-devcon-welcome-spec-reviewer
talk-maple-aind-devcon-welcome
talk-maple-context-engineering-skills
talk-maple-continuous-ai-github-workflows
talk-maple-harness-engineering
talk-maple-tldraw-ai-canvas-experiments
talk-marsden-agent-desktops
talk-martinelli-spec-driven-development
talk-moss-skills-team-workflow
talk-overweg-one-brain-no-filtering
talk-podjarny-skills-are-the-new-code
talk-roberts-ai-native-brownfield
talk-roberts-brownfield-ai-native
talk-scheire-artificial-intelligence
talk-selajev-docker-sandboxes-agents
talk-sloan-harness-engineering-beyond-code
talk-stack-humans-architect-ai-writes-code
talk-stoneham-product-brain
talk-tal-skills-security
talk-thomas-ai-native-engineering
talk-walter-runtime-intelligence-agents
talk-wilson-cq-stack-overflow-for-agents
talk-wotherspoon-humans-vs-slop