
Why Context Beats Every Prompt You'll Ever Write
Also available on
Transcript
[00:00:00] Simon Maple: In this episode, we're gonna be asking the big question: is agentic development just coding faster with agents, or is there a bigger, more fundamental paradigm shift that we need to make with our software development? Today, we are gonna be asking the questions: do I just need to learn how to talk to agents better, or are there greater things that we need to learn in the practices of agentic development?
[00:00:22] Simon Maple: So there are a lot of problems there, Guy. What is the solution to this? I'm guessing you're gonna say something about context.
[00:00:29] Guy Podjarny: Well, you know, now that you put this in context, it is so. Yes, I think there are a lot of challenges over there, and some of them are human, and we need to deal with them.
[00:00:39] Guy Podjarny: But I do think that the hammer that we have as we look at all these nails or nail-like problems is context. And the reason for that is that LLMs eventually are just sort of stateless machines that we pass a bunch of context to, and they calculate their weights, and they figure out what the next words are.
[00:00:59] Guy Podjarny: And so managing what are the words that come in and what is the information that comes in is really our primary tool that is both our human means of alignment. Those are the words, the institutional knowledge preservation, and how do we make sure that we work as a team and as an ecosystem? They are the way in which you communicate and convey things to the LLM. I guess an easy way to understand this is to think about humans. What are your tools for managing a team?
[00:01:28] Simon Maple: Mm-hmm.
[00:01:29] Guy Podjarny: Communication. That's really how you can work with it. How do you incentivise behaviour, as in, what do you respond to, and how do you respond to it? I think a lot of those analogies really come down to context.
[00:01:44] Guy Podjarny: And so I think the core competency in agentic development is context management. And there are a lot of types of context. There are rules that you are kind of explicitly and aggressively pushing. There are skills that you are hinting at and making available to the agent to try and pull down.
[00:02:04] Guy Podjarny: There are docs that is information that is available for the agents to find and use at its time. And I'm sure there will be other variations of how you drive the agents to consume and load to the right context. Then you have to think for yourself. When you think about the development paradigm, how do you manage that context?
[00:02:25] Simon Maple: And that's actually to keep that parallel with humans and people. It's really interesting because rules, for example, things that are mandatory for an agent to learn, are very similar to the rules that we have to abide by, whether they're organisational rules or development rules that we want to follow.
[00:02:41] Simon Maple: Docs, for example, could be reference docs. A developer doesn't keep all of that in their mind, but as and when they need to learn something, they know where to go and then they can learn that. So realistically, it's a little bit like onboarding these agents with amounts of knowledge and intelligence or those types of things.
[00:02:58] Simon Maple: You wouldn't expect a human to do that without this type of knowledge. So I guess when we use something like context, what is the key to using context and making agentic development successful?
[00:03:11] Guy Podjarny: So I think there's a bit of a loop here that we can talk about more, but the sequence of steps is: one, you have to define and capture, kind of write down, what it is that you want the agents to do.
[00:03:23] Guy Podjarny: That's oftentimes quite hard. I think we've had some analogies in other conversations where if someone asks you, "What do you want for your birthday?" it's easy to say, "Well, you should just know it."
[00:03:37] Simon Maple: But I've got a list, Guy. I never thought you'd ask.
[00:03:39] Simon Maple: We can start.
[00:03:41] Guy Podjarny: So a list is useful. It's nothing for me to doing, but it makes you-
[00:03:46] Simon Maple: That's good, isn't it? When you know it's hard, you better go, "Oh, this is what I want." You have to actually spend time to think about-
[00:03:54] Guy Podjarny: Exactly.
[00:03:54] Simon Maple: What is the answer.
[00:03:54] Guy Podjarny: Yeah. And even more so when you have a team that have opinions and preferences.
[00:03:59] Guy Podjarny: Not everybody agrees on things. And so you have to have some of those conversations and write that down. So you have to define the correct behaviour. And there are many levels of that. There might be the correct behaviour in a specific product, in the specific screen that you're modifying. And there might be an overall company-
[00:04:18] Guy Podjarny: Practices or an ecosystem's best practices. But you have to define that. You have to capture that. It's totally okay to use agents to use LLMs to help you write this down and then refine it. But these are the documents or the definitions that are important for you to review and to ensure that they are correct.
[00:04:35] Guy Podjarny: Call them specs, call them docs, call them whatever it is that you want, but you have to define those. Once you've done those, you need to evaluate how well they work. I think it's easy to understand that if you wrote a 20-page document to the LLM and it was just repeating and giving analogies and also some things like that, it would be harder for the LLM to understand what you meant than if you gave a very concise set of bullets.
[00:05:02] Guy Podjarny: Again, very similar to humans. So those are relatively easy to imagine how one might be better than the other, but it really is a lot more elaborate than that. And you need to consider different models and understand different formats of communication differently; different types of instructions merit, for instance, code examples versus others that merit more looseness.
[00:05:27] Guy Podjarny: There are many, many different variations. And so for that, and we've spoken about this at length, including in our last conversation on this podcast, you have to build a competency to evaluate. I find the best analogy here is to think about monitoring runtime systems. The closest analogy to having non-deterministic systems is servers as they run.
[00:05:48] Guy Podjarny: And we understand that you have to instrument a system; you have to observe them. DevOps has taught us that. And so similarly for agentic development, you have to be able to assess and evaluate how well something works so that you can monitor it, you can try it, and you can see how often it works.
[00:06:07] Guy Podjarny: The last thing was a bad analogy. Because that was for the evaluation, for the observation. Let me preserve that, too. I want to use that analogy for observation.
[00:06:20] Simon Maple: Yeah. Oh, we were talking about observing, though, right?
[00:06:23] Guy Podjarny: No, I was
[00:06:24] Simon Maple: Or did you miscommunicate?
[00:06:25] Guy Podjarny: I was in the-
[00:06:25] Simon Maple: I did not miscommunicate Did you?
[00:06:27] Simon Maple: Okay.
[00:06:27] Guy Podjarny: Yeah. Let me do all of this section a little bit faster because I do wanna get to it, and I'm gonna repeat that a little bit with a cycle. So the first thing you need to do is you need to write down what it is that you want the agent to do, right? Consider you have a team; people have different opinions.
[00:06:41] Guy Podjarny: Consider that there's a lot of institutional knowledge. You have to write those down. You should absolutely use agents to help you write that content and refine it. But you need to review it and understand what is written. Again, think about a team and think about aligning between the team.
[00:06:57] Guy Podjarny: At some point, you need to talk through the possibilities and choose which ones you want. The second bit is you have to evaluate how well the agents listen. Agents vary in the type of content and the type of length and density that they would need to be able to apply changes. It's really the models that vary, not the agents.
[00:07:18] Guy Podjarny: But even so, some are tuned to use more tools. Some are smaller models that need more explicit instruction. Some are very big models that lean into intelligence. So we've discussed many of these things on this podcast, and you really have to, after you've defined it, optimise the way you say things.
[00:07:36] Guy Podjarny: Once again, human analogies work very well. You can communicate some things better or worse, and it depends on your audience. The third is you have to then kind of communicate it out or broadcast this to the broader set. Understand that. Make sure that the agents that need to hear your words hear those words.
[00:07:53] Guy Podjarny: Then lastly, you want to observe what happens in the real world. If evaluations are more like your tests before you roll something out, they're more the focus audience that you've tested your message with. The observation is really actually polling to see what happened after.
[00:08:10] Guy Podjarny: And just like we've learned from DevOps, in the non-deterministic systems that are our runtime servers, you can't just anticipate whether they will go down or not; you have to monitor them and respond. Similarly, agents are non-deterministic creatures, and you have to assess: okay, are they actually working the way I expect them to learn from that and come back?
[00:08:31] Guy Podjarny: So these are notions of defining and capturing, evaluating, communicating, or distributing or broadcasting your context and then observing what happened. And rinse and repeat.
[00:08:43] Simon Maple: And this isn't something that you quickly run back-to-back and then kind of learn from straightaway. This happens in our existing workflow at different times, right?
[00:08:51] Simon Maple: Yet different loops that can kind of help you with this feedback. Where would you say each of those fits in with today's development workflow?
[00:08:59] Guy Podjarny: There definitely are various places in which you'd apply these different steps. I think, especially for a developer audience, the most useful thing to imagine is the DevOps loop.
[00:09:08] Guy Podjarny: We're all familiar with that sort of infinite loop. Always keep going back to that loop. Right. It is might. You might say it's infinite.
[00:09:16] Simon Maple: It is infinite
[00:09:17] Guy Podjarny: Returning to it. And so clearly the DevOps loop has more dev on the left side and more ops on the right side.
[00:09:26] Guy Podjarny: But really you can start at any point depending on what your current situation is. So you can think that on the more dev side of it, the left side, you can think about analysing your current situation. So what do you already know about your code base, about what has happened, and about people's desires?
[00:09:44] Guy Podjarny: So that's a little bit of that defined capture. So you analyse that, then you generate the right documentation, and then you evaluate that, and you can continue with that loop as much as you can. You know as much as you feel is needed. So you've evaluated, you've seen whether it's good, and maybe you've learned something.
[00:09:59] Guy Podjarny: So you come back, and you analyse those results you generate again. So you repeat that as much as you think is right. And then once you're ready, once your context is good, then you go on to more of the op side, which is you distribute similarly to deploying, and you leverage. So you actually kind of execute; you run the context. Nontrivial effort over there, right?
[00:10:18] Guy Podjarny: You have to think about activation. There's a lot of complexity over there. Then you observe, coming back to the running system. And so this type of infinite loop continues again and again; you observe, and you now have more information. You might modify your synthetic tests to represent the real world again, and that might imply some problems.
[00:10:38] Guy Podjarny: You identify, therefore you regenerate the context, and so on and so forth. And in that infinite loop, I think there are the classic needs and sort of tools that you need that are similar to those that we have in development. There are some build-time, interactive, sort of development-time systems.
[00:10:55] Guy Podjarny: There are some tests like the evaluations that we need to put at certain points in time. You might not have a comprehensive test, but they need to be sufficiently representative of reality so you don't progress with the different changes. And then in the runtime you need things that are more scalable.
[00:11:11] Guy Podjarny: You might sample behaviour versus tackle all of it. You need statistically successful behaviour, which might not need perfection. So I think you need that type of looping. And this is, I guess, what we think of as the context development lifecycle as it evolves that complements the still say.
[00:11:34] Simon Maple: Absolutely. And it is funny how we keep going back to that figure of eight. But I think what is most interesting is we use the right tools at the right time. And so, for example, evaluate, as you were mentioning, the evals that we use there; they are fast because we are able to do it with the data that we have at that right time.
[00:11:54] Simon Maple: They are not necessarily going to be the most precise given all the data that we have. But it is that feedback that then adds back into that eval to actually provide us with the data that we know that actually this is a correct eval, or actually we need to update this eval in this way. And so that kind of gets us closer to that perfection.
[00:12:11] Simon Maple: Yeah.
[00:12:12] Guy Podjarny: Correct. And we are familiar with that from tests, right? We have unit tests, we have integration tests, and we have end-to-end tests, and they are increasingly expensive to run and harder to run. And so we use a sort of smaller scope indication. So maybe we evaluate our pieces of context frequently every time we change some policy instruction or some documentation about the library.
[00:12:34] Guy Podjarny: You might want to evaluate that every time. But other types of systems, like maybe evaluating a repo's context, so to say, well, there are a lot of bits of context in this repository on it. Can agents handle that? I want to run an evaluation for that. I want to draw conclusions about how well it works and what I can do to improve it.
[00:12:56] Guy Podjarny: So those might be more expensive. You might run them more like end-to-end tests every now and then, but not regularly. So I do not know if the one-to-one analogies of unit tests, integration tests, and end-to-end tests will work. But I think there will be a version of kind of lightweight evaluations and local evaluations that expand into broader system ones.
[00:13:16] Simon Maple: Yeah, absolutely. It gets us closer to where we need to be to tune.
[00:13:20] Guy Podjarny: Exactly. I think the very first guest we had on the podcast was Des from Intercom. And he talked about how at the time, even for the agents that they had built in, it was before coding agents was really a thing; they had their regression test evaluations for their agent to know if it was doing the right thing.
[00:13:34] Guy Podjarny: Then they had their torture test, the comprehensive one, and so they did not run the torture test for every change to their prompts.
[00:13:54] Guy Podjarny: They ran it when the new model came along. And I think as we use agents to understand our unique environments in development, we currently need to develop our regression evals and our torture evals for agents as they enter our development environment.
[00:14:01] Simon Maple: Yeah. Still the best name test, by the way, that I have heard: the torture test.
[00:14:04] Guy Podjarny: Yeah. Torture. Much better than end-to-end.
[00:14:06] Simon Maple: Yeah, it is true. Very true. Let us get deeper into context, and maybe we can talk about use cases of context and what people want to use context for.
[00:14:13] Simon Maple: What would you say?
[00:14:14] Guy Podjarny: Yeah, I think that is probably the part that has evolved the most. And we had to see it in kind of the real world. I would say today I see three types of context. I know I sort of keep repeating things in threes, and maybe it is too ingrained into the corporate world. But I feel there are these three types of context that we see development teams and organisations roll out.
[00:14:33] Guy Podjarny: One is more policy- or best-practice-related context. So this might be a security policy or disseminating a choice of what is good design over here, or sometimes constraints around finances, like how much to optimise for budget versus for speed or something like that.
[00:14:58] Guy Podjarny: So those are oftentimes wrapped in skills or in some other document. They are not tied to a specific piece of code-
[00:15:05] Simon Maple: And fairly reusable across an organisation.
[00:15:07] Guy Podjarny: And they should be. You could have actually what we see in enterprises is we see that they are hierarchical. You might have, just like with any policy you have, just like you would communicate to humans, something that is company-wide, and then there might be a slight override within a business unit and maybe an even more specific override in maybe a specific application-specific team.
[00:15:25] Guy Podjarny: And so they might augment and inherit from one another, right to the Java fans in our audience. And so you have those. In those cases you want to create the words. Write it down; that is the core of it. You want to evaluate how well agents would listen.
[00:15:42] Guy Podjarny: And in this case, it is very important to define what good looks like. Because you might say, Here is a policy to say write something; make sure my code is secure. Well, that is a very broad definition. Be specific; get a little bit more detailed.
[00:15:59] Guy Podjarny: And oftentimes the evaluations are the ones that indicate what you mean by the words that you say. And so you want to evaluate it, and then you can optimise to that evaluation. If you did not invest in the evaluations, you are going to optimise for the wrong thing. So be mindful. Once you have that, you want to distribute that and make sure the right agents are getting the right information and that they update that over time. And of course, observe behaviour.
[00:16:18] Guy Podjarny: So that is kind of one strand, right? That is the policy/practice path. The second one that we see is documenting your internal platform. That is maybe the most common, which is why I have libraries of my own in my organisation. I have my billing system, and I have my technical cloud infrastructure that I am using, which all my applications need to be deployed on.
[00:16:41] Guy Podjarny: There is no reason for the agent to know any of this stuff. It should not. If it did not leak into the weights of it in some of the first waves of the LLMs, it does not know about it. It is not in its weights.
[00:16:55] Guy Podjarny: And so you have to inform the agent about that. Yes. Technically the agent can sometimes go out and try and find it amidst the code base and extract that out. That is error-prone and expensive. Very inefficient. Because you're gonna need this information again and again. And if it gets something wrong, and you don't, well, how do you find out?
[00:17:14] Guy Podjarny: Because oftentimes the consumer of the platform is not the one that actually has built the platform. And so this notion of a central rollout of the knowledge of your ecosystem, of your technology, is very, very common. Once again, you want to generate the documentation. In this case, you typically have a source of truth, which is-
[00:17:32] Guy Podjarny: The actual code of that platform, examples of usage of it. So that generation can oftentimes be quite automated. You want to evaluate it to know that modifications to it do not regress. You want to wall that out and make it available to anybody consuming the platform, and then you want to observe that behavior and solve it.
[00:17:48] Guy Podjarny: Importantly, for this one, also, you have to maintain it because your library, your platform, will change over time. You now need regular processes to update that. Then the last one, I would say, is more application context or in repo context. So this is the case where increasingly people understand that with agentic development you have to be somewhat disciplined around capturing the definition of what your app does.
[00:18:15] Guy Podjarny: You know, what is the functionality here? Because otherwise, if you ask the agent to make a change and it goes sideways, you don't know where it went wrong. There was never a definition of what right is. So you have to have some definition of what "correct" is. You have to have some documentation, and those get captured.
[00:18:34] Guy Podjarny: But again, just like software rots, context will rot. So over time, if you created those docs, but you're not methodical about knowing that it is good, that it hasn't regressed, that it is stayed up to speed as things changed. So what we see over there is actually a flow that starts with evaluation. So you have your first question: can agents handle my code base?
[00:18:55] Guy Podjarny: Maybe if it is small enough and it is written very well, they can handle the code base fine. You don't need any additional context, but as the systems grow, you need, they would need more support to be able to effectively question-
[00:19:08] Simon Maple: The challenges that you kind of outlined at the very start amplify as that code base grows?
[00:19:14] Simon Maple: And the team grows.
[00:19:15] Guy Podjarny: Precisely, the team grows, the code base grows, and the complexity and subtlety of some decisions. And so I would say this flow starts with an evaluation. It starts with taking the repository, go back in history, and extracting out some set of representative commits or pull requests that show typical work in this repository.
[00:19:31] Guy Podjarny: And turn those into evaluation scenarios. And that is, by the way, already very interesting to see, well, how well can agents develop here? And out of those you can say, okay, seeing the failures that they have had, what type of context changes might I need over here?
[00:19:50] Guy Podjarny: What can I add? What can I remove? You will find that there are some documents that you have that are just not necessary. The agent totally gets it and does not need it. So you're just wasting context space. And then you would find there are other cases in which the agent is failing. It does not understand some type systems.
[00:20:03] Guy Podjarny: It keeps airing on some specific piece of the code. And so for those, you can create context. Once you have that evaluation, you do that optimisation. You come back to make sure that context is available, observe real-world scenarios, and keep that context fresh as it moves.
[00:20:19] Simon Maple: And that is the important thing.
[00:20:20] Simon Maple: Keep it fresh because this is a point in time of how this project looks today, in six months' time, and in one year's time. It is going to be different, and you need to keep that up to date. Yes. And this is an evolving process. Continuous process that needs to continually update.
[00:20:34] Guy Podjarny: Yeah. Yeah. And I would say the weight on each of these different steps changes from team to team, from application to application.
[00:20:41] Guy Podjarny: There might be some cases in which observing, for instance, is sufficient for you. Your team is very nimble, very fast moving. The application is sufficiently small that you feel like it is okay for you to observe and identify context failures. Create some new context and roll it out without evaluating, without tests.
[00:20:59] Guy Podjarny: Right. And you might be fine with that for a while because evaluations or creating tests is an effort. But I think, as we have seen with tests, you are going to regret that decision at some point as the system goes, because it means every time you make a context change, you have no way of knowing whether you have just introduced the problem.
[00:21:20] Guy Podjarny: It means when a new platform comes along, a new model, or if you want to run it on something cheaper, you have no means of knowing whether that system would work. The only way that you have to do that is to roll it out and hope for the best and see how people respond or even observe the logs.
[00:21:38] Guy Podjarny: And so I think different emphasis on do you want to lead with evaluation, lead with observation, lead with generation, or lead with optimisation, all of these things, they are choices, and they are okay, and they will vary from time to time, but I think they all repeat in various orders in these three use cases of kind of policy dissemination or sort of practice dissemination, platform documentation, and kind of application context.
[00:22:03] Guy Podjarny: So we talk about this a lot. So alongside talking about it, we actually have been building products around this for quite a few months now and working with partners and building it out. And we have a platform. We, from a business perspective, think of it as an agent enablement platform.
[00:22:21] Guy Podjarny: So it is a platform that helps agents be enabled and successful. It onboards agents onto your environment with all those policies and practices. And then it helps continuously educate them; continuously improve the context. And so we have that on a technical level. It is more of a context development and distribution platform.
[00:22:40] Guy Podjarny: So we give you that sort of CDLC, or dubious choice of acronym for it.
[00:22:46] Simon Maple: Just drop a new acronym in there.
[00:22:47] Guy Podjarny: Yeah. The context development lifecycle, I am not sure we are going to use that. But we enable this context development lifecycle. We help you generate context, evaluate that context, and extract evaluation scenarios from existing knowledge.
[00:23:00] Guy Podjarny: We help you, of course, run them and observe. So we do all of that sequence so you can develop and own your context, and then we, of course, help you distribute. You have sort of seen that we are the package manager for skills. We are, and we support and enable skills within organisations as well, but also externally.
[00:23:17] Guy Podjarny: And so we were very committed to making these types of evaluation services and context distribution available for the open-source context as well. And so we help you do both of those. And in general, coming back maybe to the root of it, we think there is a new development paradigm in agentic development.
[00:23:37] Guy Podjarny: It is substantially different than what we have seen in software. And it does not actually replace the SDLC. It integrates into the SDLC in various places. And we think that context is needed in many places. For instance, we see the same skill or the same tile, which is our kind of package of context, get consumed while someone is locally developing and trying to get the agent to succeed.
[00:24:05] Guy Podjarny: We see that very same tile be used in code review when you are deploying it or maybe in incident review when it tries to understand what has happened and it needs some information about how the system operates. The same context is applied across different agents and across different models.
[00:24:22] Guy Podjarny: And so we think context is a separate asset that you want to develop, and your SDLC will right now is very critical. But over the years, it will become closer to a build system. It will become a very critical system within your platform and one that you should invest in. One that is kind of run by bots, used by bots, mostly. It is not; it is set up and configured, but most of the activity within it is done by automated workflows.
[00:24:57] Guy Podjarny: So we are excited to be the sort of new agent enablement platform, maybe. CDLC platform? We will sort of see, and you will see us use the word "skills" in our communications a lot. We think context is broader than skills, but "skills" is a very helpful term right now as people think about a unit of context that they move around.
[00:25:18] Guy Podjarny: And so we embrace that. We think terminology will probably shift three or four more times as the world moves around. And that is not our focus.
[00:25:26] Simon Maple: Maybe in and out of CDLC, maybe.
[00:25:30] Guy Podjarny: Yeah, maybe CDC is premature because C tells it. C is gonna be the word there. We started our journey with specs.
[00:25:35] Guy Podjarny: And it is the same thing. It does not really matter what it is called. And so we are emphasising the term "skills," although skills are really just a subset of the context that we support. But really we are committed to this notion that whether it is skills or it is context or it is tiles or it is specs, we help you create it.
[00:25:54] Guy Podjarny: We help you own it. We help you develop it over time, and we think that will become the core competency for a software development organisation.
[00:26:01] Simon Maple: Awesome. And if you are interested to learn more, why don't you head over to tessl.io, where you can actually do a lot of self-service? You can go ahead and discover.
[00:26:09] Simon Maple: You can use, download, and use. You can even publish your own context, skills, and tiles indeed. And if you wanted to learn more, maybe you have a more complex, larger environment that you are trying to use context to enable. Why don't you reach out at contact at tessl.io. But for now, thank you very much, Guy. That was really enlightening about how the space has evolved and changed and how Tess is positioned in that.
[00:26:32] Guy Podjarny: Thanks, Simon. And you know, again, a good opportunity to say thanks for the amazing team that we have that has been building all these things. And I am just here sharing a bunch of their wisdom and hard work. But, and thank you to the amazing users and early customers that have helped to shape a bunch of these practices and truly are shaping the future of software development.
[00:26:49] Guy Podjarny: So, looking forward to hearing from more of you amazing people and organisations as you reach out.
[00:26:55] Simon Maple: Thanks very much, and tune into the next episode.
Chapters
In this episode
Most teams think agentic dev is about writing better prompts. It's not.
Guy Podjarny and Simon Maple explain why managing context, not crafting prompts, is what separates teams that scale with agents from teams that don't. They walk through a practical framework for building, evaluating, and distributing the context your agents actually need.
In this episode:
- Why agents fail without structured context about your internal platform
- The 3 context layers: policies, platform docs, and application context
- How to build regression evals and torture tests for your agents
- The Context Development Lifecycle (CDLC) - a new loop for agentic dev
Your agents are only as good as the context you give them.
Context Engineering Is the Core Competency of Agentic Development
Agentic development is not just about coding faster with AI. It represents a fundamental shift in how software gets built, and the teams succeeding with it are those who have recognised that context engineering sits at the centre of everything. In a recent episode of the AI Native Dev podcast, Guy Podjarny and Simon Maple explored what this means in practice, mapping out the workflows, evaluation strategies, and types of context that development organisations need to master.
The conversation surfaced a framework that appears increasingly relevant as teams scale their use of coding agents: a context development lifecycle that mirrors the DevOps loop developers already know.
Why Context Management Defines Agentic Success
LLMs are stateless machines. They receive a bundle of context, calculate weights, and predict the next tokens. This architectural reality means that managing what information goes in, and how it is structured, becomes the primary lever for influencing agent behaviour. As Guy explained during the conversation, "The core competency in agentic development is context management."
The analogy to human teams proves useful here. When managing a team of developers, the tools available are fundamentally about communication: how information is conveyed, what incentives shape behaviour, and how alignment is achieved across different perspectives. Context serves the same function for agents. Rules push explicit constraints. Skills hint at capabilities the agent can pull down when needed. Docs provide reference material for just-in-time retrieval. Each serves a distinct purpose in the overall architecture of AI agent context management (/blog/ai-agent-context-management).
This framing suggests that developers working with agents need to think less about prompting and more about designing information systems. The question shifts from "How do I ask this better?" to "What information architecture does this agent need to succeed?"
The Context Development Lifecycle
The conversation introduced a workflow that maps naturally onto the DevOps infinity loop. On the development side, teams analyse their current situation, generate documentation and specifications, then evaluate how well agents respond to that context. This evaluation step proves critical, as different models interpret the same instructions differently. Some respond better to concise bullet points. Others handle longer, more nuanced documentation. The only way to know is to test systematically.
"You have to build a competency to evaluate," Guy noted. "The best analogy here is to think about monitoring runtime systems. The closest analogy to having non-deterministic systems is servers as they run. And we understand that you have to instrument a system; you have to observe them."
On the operations side, context gets distributed to agents and then observed in real-world usage. This creates a feedback loop: observations inform updates to context, which get evaluated, refined, and redistributed. The parallels to DevOps practices are intentional. Just as runtime systems require monitoring because their behaviour cannot be fully predicted, agents require observation because their responses are non-deterministic.
The evaluation strategy itself appears to follow a familiar pattern. Lightweight, fast evaluations run frequently, catching obvious regressions whenever context changes. More comprehensive "torture tests," as one early AI Native Dev podcast guest termed them, run less often but cover edge cases that matter. Intercom's Des Traynor described using this tiered approach for their support agents, running regression tests on prompt changes but reserving comprehensive evaluations for model upgrades.
Three Types of Context Every Organisation Needs
The podcast surfaced three distinct categories of context that development organisations are deploying, each with its own workflow and maintenance requirements.
Policy and best practice context captures organisational decisions: security requirements, architectural constraints, budget optimisation preferences. These tend to be hierarchical, with company-wide policies that business units or teams can augment or override. The challenge here is evaluation. Telling an agent to "write secure code" is too broad. Teams need specific evaluations that define what "secure" means in their environment, then optimise context to meet those standards.
Platform documentation addresses the knowledge gap around internal systems. Agents have no inherent knowledge of internal billing systems, custom cloud infrastructure, or proprietary libraries. While agents can theoretically explore codebases to discover this information, that approach proves error-prone and expensive. Centralised, maintained documentation for internal platforms gives agents reliable knowledge they will need repeatedly. The key requirement is maintenance: platforms evolve, and context must evolve with them.
Application context captures the definition of what a specific codebase does and how it should behave. Without this, agents making changes have no reference for what "correct" looks like. Guy suggested starting with evaluation here: extract representative commits from repository history, turn them into test scenarios, and measure how well agents can replicate that work. The failures reveal what context is missing. The successes reveal what context might be unnecessary.
Toward a Context-First Development Practice
The framework presented suggests that context will become a first-class asset in software organisations. It integrates with the SDLC at multiple points: local development, code review, incident analysis. The same unit of context might inform an agent helping with implementation, another reviewing the pull request, and a third investigating a production issue.
This has implications for how teams invest their time. Context, like software, can rot. Documents that once reflected how a system works may drift from reality. Evaluations that once covered typical scenarios may miss new patterns. The discipline required appears similar to test maintenance: ongoing effort to keep assertions aligned with evolving systems.
For developers getting started, the practical path seems clear. Begin by evaluating how well agents handle your existing codebases. Identify the failure patterns. Create context that addresses those failures. Build evaluations that let you know whether changes help or hurt. Observe what happens when agents use that context in real work. Iterate.
The full conversation offers additional depth on each of these themes. Worth a listen for teams working to make agentic development reliable at scale.
