The Tessl Registry now has security scores, powered by SnykLearn more

Registry EnterpriseCareers Docs

Smaller Context,

Bigger Impact

Guy Podjarny

Founder & CEO, Tessl

Back to podcasts

What Holds Devs Back From Multi-Agent Thinking

26 Nov 2025with Guy Podjarny

Also available on

Transcript

Guy Podjarny - Spec Driven Dev - From Single Player to Miultiplayer to Ecosystem

[00:00:00] Simon: Now our next speaker is the founder and CEO of a little company called Tessl. 18 month old company. GuyPo previously is, he's a serial founder. He founded Blaze, which was acquired. He founded Snyk, a massively successful developer security company and organization that actually, when we think about it, a lot of the challenges that had in and around moving the challenges that that had in and around moving the thought process, moving the mindset of people to a developer security path is actually very similar to how we are thinking today about AI-Native through development.

[00:00:34] Simon: And now we're gonna talk about how AI can be done collaboratively. So please give it up for our first keynote speaker, Guy Podjarny.[00:01:00]

[00:01:03] Guy: Hello everyone. Welcome to the first in person and AI-Native DevCon, love in person events. You know, I think you convey things virtually, consume things virtually, but nothing beats talking to people and engaging with them. And I hope you're all feeling that way. So thanks for schlepping in, especially in the rain here in New York.

[00:01:19] Guy: And I hope you enjoy the event, and I look forward to meeting and learning from many of you. I am Guy Podjarny or GuyPo. I am the founder and CEO of Tessl, and I'm here to talk about spec driven development's growth, from kind of single player to really how do you use it in a team, and how do you think about ecosystems?

[00:01:39] Guy: Let me set the stage first. So hopefully we're all aligned that AI is transforming software development. It's not a little hiccup. It's not a little change. It's something that is substantial that is happening. And we've seen a bunch of evolution already. We've had two kind of waves of what I think of as AI augmented software development, with tools that were pioneered by Copilot and Cursor that autocomplete things better.

[00:02:03] Guy: You know, that helped a lot, that built in chat and allow you to talk to your code and understand things that Cursor pioneered. And then we, in this last six months, maybe twelve, we've seen the rise of more truly AI native. I think of them as more AI native dev tools because they're more about delegation. They're more about asking the agent to do a thing for you and then interact with it.

[00:02:26] Guy: So it's a much more substantial change for us, and there's a lot to learn. And while all of these tools are valuable in their own right, they've all kind of consolidated toward this agentic behavior. And I think it's pretty clear that that is the future. So I'll focus on where is that taking us. How do we make this agentic development work?

[00:02:47] Guy: So the reality today is that agents are super powerful, but they're massively unreliable. They're sort of sometimes spooky, sometimes kooky. You know, they're amazing and they do these things that astound you, but then they just make these foolish mistakes. They're also very untrustworthy.

[00:03:03] Guy: They'll tell you something is done when it isn't really. They will break things along the way as you ask them to build a feature. They'll burn down the house to get it done. And so they're very powerful, but they're kind of hard to use. And this creates this capability reliability gap. You see studies like this one from Meter where they assess how much developers think AI tools will make them better.

[00:03:28] Guy: And you see this bar on the left where people, as well as developers, think it's going to make them faster and complete something faster. And then they observe how long it actually takes them to do it, and it turns out it was actually slower.There's a lot of caveats, a lot of flaws in this study and in any of the others.

[00:03:44] Guy: It's a very complex area, but I think everybody relates to the fact that it's not easy to turn this power into improved productivity, into something that actually works. And so what do we do about it? [00:04:00] Well, what we can't do about it is say bury your head in the sand. Let's not use agents.

[00:04:05] Guy: It is tempting, and a good number of people, probably none in this room, are saying that. But it's clearly shortsighted. I mean, this is a technology that's too powerful to ignore or think it would fade away. It's here, it's going to shape our development, and we need to think about how do we embrace it, not how do we ignore it.

[00:04:28] Guy: And so the real question is how do we help agents succeed and close this capability reliability gap? Let's look a little bit at a history of attempting to do so. You know, this is our quest for silver bullets that would make LLMs work the way we want, right? And a bit of a loose timeline.

[00:04:47] Guy: We started off with this as an industry thinking about fine tuning. We're going to tell the LLM here are tens, hundreds, maybe thousands of examples of how I like to code, how you code in my environment, and the model's weights will magically adapt and work like that. And it turns out that this works very well for brand new skills, for cases in which you're teaching the LLM to do something that it didn't know before.

[00:05:11] Guy: But if it already has opinions, if it already knows how to code, it's very hard to sway. It's kind of like people. We talk about old dog tricks. It's hard to sway how someone will behave. And so fine tuning is useful, but it's not a silver bullet. In fact, in many cases, it is hard to really have it shine.

[00:05:31] Guy: The second kind of wave of this will solve our problems was RAG. I will take my prompt and put it into these magical embedding algorithms and vector databases will pull out the right context that I need, everything precisely that I need to know to be able to address this problem.

[00:05:48] Guy: And again, RAG is very powerful when you have your own keywords and terminology and things like that. But context gathering is more complicated than that. There's a lot of subtlety. Oftentimes things are not in your prompt. They're not in that initial path. And so it remained a useful tool, but was deemed not a silver bullet.

[00:06:08] Guy: The next kind of this will save us all was the world of big context. You know what, we don't need to worry about all this stuff. We've got like million token, two million token context windows. We'll just tell it everything, and it will figure it out. It'll just sort of know what to do. And it turns out, you know, like through my career, I've often gotten the guidance that says if you have ten smart things to say and you say all of them, each of them will get less attention than if you said three of them. And, and I still struggle to not say everything I think, but it's true. You know, if you say fewer things, each of them gets more attention.

[00:06:31] Guy: And LLMs are the same. If you give it more information in its context, each of these bits will get less attention. It would be less focused. And we'll show a little bit of that in a second.

[00:06:49] Guy: And so we get into the modern world, and we have these two additional tools. One is agentic search. You know what, let's delegate the actual problem of context creation to the agent. Let's give it tools to find the information. That's what we do as humans.

[00:07:07] Guy: We go, we Google, we look at files, we, we do that work, and that is useful. But again, it's useful when it's constrained. If you tell someone just go find the information about anything, it'll take time to find it. You might find the wrong information. And, you know, it's sort of both inefficient and unbounded, and you can get down rabbit holes that really kind of make things worse.

[00:07:31] Guy: And so it's a very, very powerful tool. We'll refer to it again. But again, it didn't just solve the problem, because there's another piece that we need on top of that. And that is context engineering. And that's the sort of the state of affairs today, which is you actually have to give up on the magic solutions and think about what do you want to tell the agent. What information does the agent need to succeed?

[00:07:50] Guy: And I love this. This is Tobias, the founder and CEO of Shopify. He's a very thoughtful guy, super smart guy. And he was on a podcast a couple of months ago, and I like his quote here. Which is, I like the term context engineering because I think the fundamental skill of using AI well is to be able to state a problem with enough context in such a way that without any additional pieces of information, the task is plausibly solvable.

[00:08:18] Guy: Which I love that. It's this notion of like, let it lean into intelligence, but don't assume mind reading. You know, like tell it what it needs to know so that it can succeed. So context engineering in my mind, and I might be biased, is basically the same as specs. You know, this is about, we just had this long conversation about should we call something spec or context.

[00:08:41] Guy: It doesn't really matter. Like for me, it is about getting out of your head and out of your knowledge, getting the nuggets of knowledge that you need to give the agent to succeed. And so I'm gonna use the term context and specs interchangeably throughout the talk. It's the way I see the world.

[00:08:58] Guy: So I'm gonna go into that sort of single [00:09:00] multi ecosystem player. But before that, I wanna remind us that you can't optimise what you can't measure. This is a known phrase in the world of software development, but really mostly you hear it about ops, because in ops we are familiar with these systems that sometimes behave well and sometimes don't, right?

[00:09:16] Guy: The servers sometimes work, sometimes don't. Sometimes they crash. And so we have to measure them so we know what's happening and we can optimise it. Now we have the same statistical behavior in our development. You know, we have these statistical creatures, agents. They build stuff. Sometimes they work, sometimes they don't.

[00:09:31] Guy: And that means two things. One, we need to leave the notion of agents can absolutely do this or cannot absolutely. And embrace a statistical number around their success rate. And two is we have to measure that. We have to think about how do we evaluate, how do we assess success? And I'll try to give some data to back some of my statements in this talk.

[00:09:50] Guy: So now we'll actually get to the thing that is on the title. Just let's talk about single player. How do you do context engineering in a single player context? So, [00:10:00] um, just to demonstrate a little bit, the base, this is a little sample to do application with a Jurassic theme that I worked very hard on, to just sort of demonstrate it.

[00:10:10] Guy: And the base context that an agent has is your project. Here I'm asking the agent to add an edit button to each of the to do items in my list. And you'll see the agent automatically go off and read files in the project. It doesn't read all of them. It's already doing agentic search to load some of that content.

[00:10:30] Guy: And then it would load that code, understand that code, and I'll just sort of pause here, and it loads that code and reads that into context. So I didn't need to tell it what a to do item is. I didn't need to tell it where to put that button. It had enough information in the code, and this is the base context that's important.

[00:10:52] Guy: However, you'll notice, for instance, that it created this blue edit button, and that doesn't fit my beautiful Jurassic theme. And it turns [00:11:00] out that in the code, some information is not there. It doesn't know that I want to stick to my theme colors. And so let's tell it to do that. So we can add explicit context that doesn't rely on agentic search, like an agent's MD file that says always stick to the theme colors. Use British spelling. And it'll go off, and I will ask the same action. You know, I'll say add an edit button and it will go off.

[00:11:18] Guy: It'll do the same type of like dynamic loading. And maybe I'll sort of spare us a moment here as we get to the end. You will see that it indeed created the button with my beautiful theme colors.

[00:11:38] Guy: And so this is 101 of context engineering. At the base, your project's context, your project code is the agent's context. How well documented it is, whether it is confusing and has references to old data structures that you have, how reasonable are your file names. All of those things help the agent load the right files.

[00:11:58] Guy: But then there's gonna be some information that's [00:12:00] not in the code, and you put that in your explicit agents MD, Claude MD, Cursor rules, local context. You probably know much of this already. The challenge is this is nice when you want to tell it stick to the theme colors, but typically you have a lot more to say.

[00:12:14] Guy: There's just a lot that you want to tell it about how to work in your environment, in your style. And as we talked about before, if you tell it ten things, it's gonna give each of them less attention than if you told it three. And so this is a problem. Before I talk about how to solve it, let's maybe back that claim that this is a problem with some data.

[00:12:38] Guy: So to do that, we took a little game. The team took a little sort of dodge the blocks game that was vibe coded and asked, as an agent in this case, Claude Code. Actually, we tried three agents. We said add a login to this, like add session backed authentication so this game will be behind the login.

[00:12:56] Guy: The other thing we did is we created, again with agents, an [00:13:00] evaluation scorecard. Like how good is the security of this thing that you've just added. There's like a benchmark. There are standards today for good rubrics.

[00:13:14] Guy: And then we ran that in three different modes. One is we ran the agents without any context and we scored them. The second is we put into the agent MD, or the equivalent, the OWASP, the Open Web Application Security Project guidance around authentication, about three kilobytes of data. And the second is the longer twenty kilobytes of full security guidance that includes the three kilobytes. There's no word that didn't exist in the second one.

[00:13:34] Guy: And indeed, as we ran this, we saw, and this is no surprise, that without any guidance, the agent scored a 65 percent score. And with the guidance, with the precise guidance, it got an 85 percent score. And if we told it more things, it got a lower score. It got 81. So this was good sort of validation.

[00:13:56] Guy: We need to know that even though we expect this behavior, now we have a [00:14:00] bit more confidence. And there are many examples of this, but also we find information that is maybe a bit unexpected. For instance, we ran this on three different agents. We ran it on Claude, on Codex, and on Cursor. And what you can see is they all fared about the same mediocre results without any guidance.

[00:14:18] Guy: There are many pitfalls that you can have from a security perspective when you add a login page. But there's a big difference in how they handled the deeper context. Claude shined when you gave it precise instructions, but dropped a fair bit when you gave it the longer instruction.

[00:14:35] Guy: Codex didn't do as well in total, but dealt with the longer context a lot more. And it's interesting because it's not obvious to people that not all agents listen the same. We told them the same words. Some of these actually use the same model behind them, but agents process and make many decisions. And granted, there are ten runs behind each of these columns, so maybe there's also not enough data for statistical significance, but the results are substantially different.

[00:14:53] Guy: And you have to think not just about what is the question you're asking and the situation, but also who are you asking about? So we have this challenge. We wanna tell agents a lot of things, and we can't because it's gonna deteriorate their performance. What do we do?

[00:15:13] Guy: So it's a broader problem, and there's a lot to say about it, but one common pattern is the separation between rules and knowledge. The rules would be a small amount of data that you're providing it, that you're kind of shoving down their throat. You're saying this will be in your agent MD. You're gonna see this every single time.

[00:15:30] Guy: And they have to be succinct. And what they should do is they should include links and references to guide agentic search around where can it fetch additional information. And that is the knowledge. And I'll give an example from Tessl's content. So hopefully many people on this event, here or virtually, know that Tessl has a spec registry.

[00:15:52] Guy: And it's a dependency system for knowledge. You can publish context into it in this form we call tiles, and then you can consume [00:16:00] that down to your consumers, to your agents, and it will adapt it to the local agent. And you can do it for your context. But we also pre populated the registry with over ten thousand specs or context that helps agents use open source libraries better.

[00:16:19] Guy: We've analyzed those libraries, we've iterated, we've evaluated, and we created those. So let's look a little bit at what is the structure of those. In first of all, we define a knowledge base. So what is the knowledge that you want? So we do that in this JSON file. In our example, there are variations of it we'll touch, to say this is the information I want.

[00:16:40] Guy: In this case, I have a bunch of libraries. I want to know how to use them. And then you'll see that each one of those is fragmented into pieces so the agent doesn't need to suck it all up right away. Especially for some of these libraries, there's a lot to say. You don't want to hog the context window, so there's a bunch of different files.

[00:16:58] Guy: The second thing we would do is we [00:17:00] would put a little bit of knowledge or a little bit of rules in the agent's MD file. And we'll say here's the link to your knowledge. We actually, it's a bit longer over here, but we also give a little bit of instruction of what is the structure of the knowledge you might find to the agent. But it's a very small amount of context that is in the rules and always loaded.

[00:17:19] Guy: And then if you click that Knowledge MD file, you will get a markdown file that the agent can easily find. It tells which libraries you have a little bit of information about, and where can it learn more. And then down the rabbit hole, it can learn more. It has an index file. It explains, in this case, Next.js, and then it can, in that file, find links to additional information.

[00:17:40] Guy: So this is like a human clicking links on a website, right? You have a summary thing, you're expanding a section or you're clicking through and it's very effective. This is a good way to give the agent a way to find the right information, lean into its agent search capabilities, but don't ask it to sort of browse the web and just figure it out. [00:18:00]

[00:18:02] Guy: SSo hopefully at this point you're saying, well, show me some data. I realized as I was building it that this is a Jerry McGuire reference and not everybody knows that. It might be like my age is starting to differ a bit. It does say show me the money typically. So okay, can we see some data?

[00:18:18] Guy: Right? I make this claim that this is a good way and a good effective way to make the agents work. Let's see some data. I'll show two examples of this. First is thanks to the great work that the team at Vercel has done on Next.js. So Next.js is a very well used, very popular JavaScript framework that is maintained and built today by the Vercel team.

[00:18:41] Guy: And they published, as a thought leader in the AI space, a benchmark that tries to measure how well agents can use Next.js. And they created about 50 different test cases, and they ran it through different agents to see how well they can address those. It's interesting, by the way, immediately to see that even though Next.js is so popular and has so much content, it's like the sweetheart of the training data, right?

[00:19:07] Guy: It's well documented. It's an amazingly clear set of data. The top agents, the ones that did the best, still only got a 42 percent score in this benchmark. And so that's already interesting and a bit sobering, but it goes to show, okay, can we improve this with context engineering? And in general, if you measure this now, I bet you all the agents are now working on optimizing it.

[00:19:29] Guy: And so our question was, okay, we've done this. Can we optimise this with context? Can we test our knowledge, like these tiles that we create, do they actually make it better? And we ran this eval, because we already had the eval, so we just had to compare the two test cases, and we're happy to see that, yes, yes we can.

[00:19:46] Guy: With the knowledge, the success rate, like for us, it was about 40 percent in our test environment success rate of using Next.js on this benchmark, and it jumped up to 92 percent if we got that info. This is if we got the very thoroughly created piece of information. But we also find, for instance, that a smaller prompt midway that just guides the agent to look at the things that you care about, in this case some linting and some build errors, actually also bumped up the process substantially.

[00:20:11] Guy: So sometimes it's knowledge and sometimes it's steering around. What is it to pay attention to? This is great for Next.js, but this was a very thorough benchmark set that required a lot of attention. We can't do that for everything, but what you can do when you evaluate your content and when you have something that is you don't have the bandwidth to create a deep benchmark for is that you can use LLMs to create evaluation data.

[00:20:42] Guy: And so we took 270 random libraries and we asked the LLM to create test exercises for them, a scorecard to evaluate how well they did on that test, and then we ran it through agents. The data I have here is for Claude Code. We saw similar data for Cursor. And then [00:21:00] eventually you ask an LLM to judge the result without a rubric.

[00:21:03] Guy: And indeed, when we ran that, we saw that on average across those 270, we bumped up the success rate from about 60 percent to about 81 percent in open source, in agents' ability to consume open source across these 270 libraries. And we did that with a little bit of a shortening of the time. So the agents were a lot more successful in a little bit less time, a little bit faster, which was really encouraging.

[00:21:25] Guy: So this was great. And this is again the validation part. We also had, you always find these bits of information that you don't know how to expect. Some interesting information about the age of a library. And so this graph shows, with the same test, the success rate of the agent, this is Claude Code, correlated to the age of the libraries.

[00:21:45] Guy: The libraries at the left are the oldest. The libraries on the right are the newest. This is release date, just as a simple measure. And it's interesting to see how the libraries that are five to ten years old are the ones that the agents do the best with. And we've seen this with other agents as well.

[00:22:00] Guy: And the distance is substantial. And that makes sense because very old libraries might have confusing information on the web. Maybe they don't have as much web representation if they're truly old. And the very new libraries, oftentimes they haven't accumulated enough information on the web, and some of their data is actually after the training data.

[00:22:20] Guy: And our theory was that with the tiles, we actually can flatten that line, right? Because we gave them the information, they have it, so they should use it. And they did. We've seen that we elevate the result more broadly. And this was useful because it matches our kind of anecdotal experience that using this type of knowledge, using the right context, sometimes just helps, and sometimes it turns it from it doesn't work to it works.

[00:22:45] Guy: I will move on to multiplayer, but just one thing to say is that context engineering is actually a lot more than what I'm showing right now. And there are many, many different tools. And one notable one is the ability to use a sub agent in Claude, and now a few other agents, to be able to give it its context and focus it.

[00:23:01] Guy: So there's a lot more to do here and it's a worthy domain to invest in. So with that, let's talk from single player to multiplayer. In multiplayer, the question is, in your team, in your organization, what context do you want to provide and when? What are the ways in which you can share context across the team?

[00:23:23] Guy: I'll talk about three versions of this. The first and kind of easiest, most direct way to share context is to put stuff in the repo. As we said, the project itself is the base context for an agent. And so if you check in an agent's MD file, anybody that uses that repository would load it. It's the repo or the path within the repo.

[00:23:43] Guy: This is great and you should do it. And it is a perfect fit for things that are repo specific knowledge, right? This is the documentation about how this application behaves or things like that. Awesome. Perfect place to do it. Highly recommended. Where it falters is one is cross repository information.

[00:24:00] Guy: Your best practices, shared libraries, and all those feel a bit odd to check into one repository. Would you check them into multiple repositories? You might want a different solution. And the second is, and I'm picking a little bit on Backlog MD over here, you know, it's a great framework. We had a great workshop yesterday about it.

[00:24:17] Guy: And you can see, though, that those who adopt agents and lean into them, what you find in the repos is you find many duplicate files.You find your Gemini MD and Claude MD and Agent MD and Cursor rules, and you'll find skills and GitHub stuff. And it's not that you have this many files that is the problem, but rather that they are duplicative. There's often content that is very similar within each one of them. So that's a challenge.

[00:24:37] Guy: The second tool that you have to accumulate and to provide shared context that makes sense to your team is to give agents tools to gather context. Most common is of course browsing the web, but also things that are more focused, like use this MCP tool to be able to go and fetch code from GitHub repositories and load them.

[00:25:02] Guy: So this is, I'd say, first of all, super useful for discovery and exploration. Like you don't know which library you're gonna use. Clearly you don't have the context for that library. As you're exploring, you should have dynamic tools to go and fetch that.

[00:25:16] Guy: It's also great for dynamic data, like production data about how your system ran or something that is constantly changing. So for those, it is amazing. It is not great when there's a high potential to get the wrong answer. Versioning is a strong example of that. If you're using a version that's not the latest and your agent went to the latest GitHub repo, it's gonna get the wrong information to operate with.

[00:25:40] Guy: And for complex topics where there are subtleties, authentication related topics or just sort of complex environments, it is also highly inefficient for repeatedly needed info. So let's say you can go off and find the relevant commit at the relevant repo and read the code. If you need this information seven times a day, five days a week, why?

[00:26:05] Guy: Like, Why wouldn't you gather that context one time, store it, evaluate it, and then optimise for that? Which gets me to the third mode, which is today kind of the cutting edge, and it's the notion of curated context or curated knowledge. What you would've seen is over the last couple of months, you've seen a lot of progress from all the big agents providing reusable context frameworks.

[00:26:28] Guy: So you saw Claude skills, you saw Cursor team rules, you saw GitHub Copilot spaces. And these are acknowledgements from these agent companies to say, well, we don't want the agent to reinvent it every time. We actually want to allow you to share things and to share practices. And that's great.

[00:26:47] Guy: I highly recommend these platforms. They should learn about your environment from it. They're great for cross repository knowledge.They're great when oftentimes you start seeing people use maybe a security code review bot that runs in your review surroundings, but that same context you also want it when you're building locally and maybe when you're assessing an incident with another agent.

[00:27:07] Guy: So it's great to have not just cross development repo context, but also cross use case knowledge. And because they're built natively, they tend to be kind of nice and elegant in terms of how they interact, you know, Claude with its skills, Cursor with its rules, et cetera. They're a bit of an overkill for single repo information. Maybe that's not too bad.

[00:27:22] Guy: But really the biggest challenge with them is that they are single agent. And that kind of begs the question of do we think long term that knowledge and context should be managed per agent? Do you think in your organization, in your practice, do you expect that you would use one agent for your org over time?

[00:27:43] Guy: And I think the answer to that is no. Like I think just like any technologies, technologies will be better at one thing versus another. They will move. They will change. Companies are big. People will have preferences. And that knowledge as a whole should be something that is a core competency of the knowledge that you have.

[00:28:00] Guy: And it should be adapted per agent, but it should be your asset that is broader. So that's the approach we have at Tessl. You've sort of seen a bit of this already. We have the Tessl JSON, and when you install this knowledge, then it will adapt it to the relevant agent. It will store it as a Cursor rule, or it would store it as an agent MD or whatever that is.

[00:28:19] Guy: And then also once we download that knowledge, the knowledge is there as files, and so you can use it in whatever way that you want. We do provide tools to make it easier and better to work with it. We want it to be seamless, but you're not locked in. You don't need to use those tools if you want to use different tools to consume it.

[00:28:36] Guy: So we think this is important when we think about broader long term context management and development. And in fact, we're doubling down on this. And just as a slight plug here, we are expanding the registry into a full blown Tessl agent enablement platform that will help you gather this type of knowledge, like the packages you've seen, create and run evaluations for them, distribute them so that every repository gets its knowledge, and then help you optimise that over time.

[00:29:07] Guy: So this is a commercial product, and if you're interested in being part of it, come talk to us, see us, or email us at contact at Tessl if you're online. So that's a good segue for the closing points around ecosystem context. And you know, I've just sort of sang the praises of why ecosystem wide context from the registry is helpful and you should use it and it'll help you today.

[00:29:35] Guy: But in the long run, you do have to wonder who should create this type of open source context. Now we come along to someone else's library and we create these agent docs for it and we track them. And we do it today because it's helpful today and we want to help. But long term, is that right?

[00:29:52] Guy: And I think the answer is no. Like I think over time, open source context is really an aspect of agent experience, and it is the creator's responsibility. If you are an open source maintainer, if you are a vendor, it's your responsibility to have a great developer experience. It's also your responsibility to have a great agent experience.

[00:30:13] Guy: You know, it is in your control. It is something that you can choose where to invest. And context is your best tool for that. Context is both your documentation and your UX. You need to think about what is it that you say there. How do you distribute that to your users? Do you want this to be single agent or multi agent?

[00:30:31] Guy: And evals are your tests. They're your moment of defining what is correct and what is not correct. And so you should invest in those. You should define them, and you should ask your users for feedback on when they work and when they don't work. And it's your responsibility, but you're not alone.

[00:30:47] Guy: You have us to help with our technology, with our platform. Feel free to talk to us. You have the amazing AI native dev community that you should tap, and they will help you. And together, I think we want to create successful methodologies and actual context that will help us all consume software together better.

[00:31:05] Guy: And if we achieve all of that, then we'll help successfully put everything in the right context. Thank you.

Spec Driven Development (SDD)

AI-Native Development

Agentic Systems

AI Coding Tools

Chapters

Trailer

[00:00:00]

Introduction

[00:01:07]

Keynote Speaker Introduction

[00:02:05]

The Evolution of AI in Software Development

[00:03:40]

Challenges and Solutions in AI Agent Reliability

[00:04:44]

Context Engineering in Practice

[00:11:52]

Conclusion: The Future of AI and Context Engineering

[00:24:42]

In this episode

In this episode of AI Native Dev, host Guy Podjarny explore the shift from AI-assisted coding to agentic development, where AI systems autonomously handle tasks. They discuss the importance of spec-driven development, emphasizing the need to provide AI agents with concise and targeted context to ensure reliability and productivity. Learn practical strategies for integrating this approach into individual, team, and ecosystem-wide workflows to harness AI's potential effectively.

AI is transforming how we build software, but letting agents “do the coding” introduces a new reliability problem. In this episode of AI Native Dev, Guy Podjarny unpacks spec-driven (or context-driven) development: how to feed AI agents just enough, but not too much, of the right information so they can deliver trustworthy work. Drawing on lessons from Snyk’s developer-first security journey, Guy connects the mindset shift required for AI-native development to practical tactics you can adopt today—moving from single-player use to team workflows and, ultimately, to ecosystem-wide practices.

From Autocomplete to Delegation: The Agentic Shift

The first wave of AI dev tooling boosted individual productivity with autocomplete and inline chat—pioneered by tools like GitHub Copilot and Cursor. The second wave has been agentic: developers now delegate tasks to AI systems that plan, search, edit, and test code. This shift promises step-function productivity gains, but it also surfaces new failure modes. Agents can be “spooky good” one minute and confidently wrong the next, breaking unrelated parts of a codebase or declaring tasks complete when they aren’t.

Guy frames this as the capability–reliability gap. Studies like Meter’s show developers expect AI to speed them up, yet observed completion times can be slower due to rework and verification. Ignoring agents isn’t viable—the tech is too powerful—but harnessing it requires new operating principles. The question isn’t “should we use agents?” but “how do we help agents succeed predictably?”

Beyond Silver Bullets: What Worked, What Didn’t

The industry has cycled through “this will fix it” phases. Fine-tuning can teach brand-new skills but struggles to override entrenched model behaviors like coding style or architectural preferences. Retrieval-augmented generation (RAG) helps when your domain has unique terms and scattered docs, but context gathering is messy—relevant facts are often implicit, non-obvious, or not well-indexed. “Just use huge context windows” also disappoints: as context balloons to millions of tokens, attention dilutes. More input often leads to less focus.

Agentic search is a real step forward—let the agent explore the codebase, docs, and tooling as a developer would. But unconstrained exploration can be slow, expensive, or misdirected. The modern consensus Guy endorses is context engineering: stating the problem with precisely the information an intelligent system needs to plausibly solve it without additional fishing. As Shopify’s Tobias Lütke put it, the craft is in composing context so the task is solvable without mind reading. In practice, that looks a lot like writing specs.

Context Engineering = Spec-Driven Development (Single Player)

Start with the base context: your code. Good file names, clear module boundaries, and up-to-date docs enable an agent to perform agentic search and load the right files automatically. In Guy’s demo, asking an agent to “add an Edit button to each to-do item” prompts it to scan the project, locate UI components and state, and make changes without any handholding. But aesthetics and conventions are rarely encoded in code. The first attempt renders a blue button that clashes with the project’s Jurassic theme.

Enter explicit context—your “spec.” A small agents.md (or claude.md, cursor rules, etc.) that states “use the site’s theme colors” and any other non-code conventions (even “use British spelling”) nudges the agent to the correct outcome. The lesson: keep these specs short and targeted. Overlong guidance dilutes attention; short, well-scoped instructions travel further. Treat context files as code: version them, keep them close to the repo, and prefer concise “rules of the road” over exhaustive style treatises. For task work, add micro-specs in the prompt or PR description, scoped to the change you want.

Practically, think in three tiers:

Base context: the codebase and in-repo docs the agent can read.
Global explicit context: a brief, evergreen spec with org/project conventions.
Task context: a small, situational spec attached to the change request.

Measure Before You Optimise: Evaluating Agent Work

“You can’t optimise what you can’t measure” isn’t just for ops anymore. Agents are statistical systems; success should be expressed as rates, not absolutes. Guy emphasises building lightweight evaluation harnesses so you can iterate on prompts, specs, and tools with feedback loops.

His team’s experiment illustrates why brevity beats breadth. They asked agents to add session-backed authentication to a small “dodge the blocks” game. Alongside the implementation prompts, they generated a security scorecard rubric (using agents) to grade the result. Three modes were tested: no guidance; a precise ~3KB slice of OWASP authentication guidance; and a longer ~20KB version that subsumed the short one. Results: no guidance scored ~65%; the short OWASP context jumped to ~85%; the long version fell to ~81%—more words, worse focus. They ran this across multiple agents (Claude, Codex, Cursor) and saw the same pattern.

Action this by creating a mini-benchmark for your codebase:

Define realistic tasks (e.g., add feature X, refactor module Y).
Pair each task with tests and a rubric covering correctness, security, style, and performance.
Run agents multiple times per condition to capture variance.
Track success rates, time-to-completion, diff size, and human review effort.
Modify specs and constraints, then re-measure. Keep what moves the metrics.

From Single Player to Teams and Ecosystems

Spec-driven development becomes truly valuable when it scales beyond an individual. For teams, establish a shared, minimal global spec—naming conventions, architectural decisions, security posture, logging standards, and UI tokens. Keep it short and stable. Then augment with module-level specs (e.g., “payments module uses this idempotency pattern; API errors follow RFC7807”), and attach per-task micro-specs in tickets or PRs. Encourage contributions from design (tokens and interaction patterns), product (acceptance criteria), and security (threat models, guardrails) so agents inherit institutional knowledge.

Constrain agentic search to trustworthy sources. Point agents at the repo and a curated docs folder rather than the whole internet. If you add RAG, index only vetted docs. Prefer tool usage that keeps the agent inside the project boundary (e.g., read_file, run_tests) and make external calls explicit and auditable. Small, composable specs also unlock reuse at the ecosystem level: versioned spec packages for security practices, API contracts, or UI systems that can be imported across repos. This enables teams to share conventions without copying walls of prose into every prompt.

Finally, put specs in the loop. Add a “spec check” to code review: the agent summarises how the diff adheres to the spec and flags divergences. Use CI to fail builds when generated code breaks spec-defined invariants (e.g., missing auth checks, violating logging formats). Measure adherence and outcomes across teams to see which specs improve reliability—and prune the rest.

Practical Workflow Patterns Developers Can Adopt Today

A reliable agent workflow blends code context, small specs, and constrained search. Start by investing in code readability: descriptive file names, clear directory structures, and up-to-date READMEs make agentic search effective. Add a repo-level agents.md with 10–15 bullet rules max. For each feature, create a micro-spec in the issue/PR as a checklist of acceptance criteria, edge cases, and constraints. Ask the agent to plan first, then implement, then run tests, then justify how the result satisfies each checklist item.

On the tooling side, lean on agents that support dynamic file reading, test execution, and iterative planning. Use big context windows judiciously—reserve them for complex, tightly scoped tasks where the extra context is truly relevant. Prefer “bring the right 3KB” over “dump 2M tokens.” If you need RAG, keep the corpus tight and the retrieval focused on task-relevant namespaces. When you’re tempted to fine-tune, ask whether a small, well-placed spec would achieve the behavior change faster and more reliably.

Key Takeaways

Treat context as a spec: Write down the minimum information an intelligent agent needs to plausibly solve the task—no more, no less.
Keep specs short and layered: global repo spec (stable, minimal), module specs (focused), and per-task micro-specs (temporary, precise).
Constrain the agent’s world: Prefer curated sources, explicit tool calls, and in-repo docs over unbounded web search or bloated context.
Measure everything: Build small evaluation harnesses with tests and rubrics; track success rates, time-to-completion, and review effort.
Optimise by subtraction: Shorter, sharper guidance often beats long-form policies; attention dilutes with context bloat.
Evolve to team and ecosystem: Version specs, share them across repos, add spec checks in review/CI, and let design/security/product contribute rules of the road.

The
bottom line: agentic development works when we pair powerful models with disciplined context engineering. Start small, measure, and scale your specs from single-player wins to team-wide and ecosystem-level reliability.

Resources

Related episodes

Why 95%

of Agents

Fail

Founder, Agentics Foundation

Can Agentic Engineering Really Deliver Enterprise-Grade Code?

23 Sept 2025

with Reuven Cohen

Agents Explained:

Beginner To Pro

Maksim Shaposhnikov

AI Research Engineer, Tessl

AI Agents Beyond Context Limits

28 Oct 2025

with Maksim Shaposhnikov

What Developers Can Build Next With AI

12 Nov 2025

with Baruch Sadogursky, Liran Tal, Alex Gavrilescu, Josh Long

Spec Driven Development (SDD)

AI-Native Development

Agentic Systems

AI Coding Tools

Chapters

Trailer

[00:00:00]

Introduction

[00:01:07]

Keynote Speaker Introduction

[00:02:05]

The Evolution of AI in Software Development

[00:03:40]

Challenges and Solutions in AI Agent Reliability

[00:04:44]

Context Engineering in Practice

[00:11:52]

Conclusion: The Future of AI and Context Engineering

[00:24:42]

Resources

Related episodes

Why 95%

of Agents

Fail

Founder, Agentics Foundation

Can Agentic Engineering Really Deliver Enterprise-Grade Code?

23 Sept 2025

with Reuven Cohen

Agents Explained:

Beginner To Pro

Maksim Shaposhnikov

AI Research Engineer, Tessl

AI Agents Beyond Context Limits

28 Oct 2025

with Maksim Shaposhnikov

What Developers Can Build Next With AI

12 Nov 2025

with Baruch Sadogursky, Liran Tal, Alex Gavrilescu, Josh Long