We're live on Product Hunt!

Product hunt

Upvote & share feedback

Registry EnterpriseDocs SIGN UP FREE

Claude vs Codex

vs Gemini

Yaniv Aknin

Founding engineer, Tessl

Back to podcasts

Building an AI Agent in 100 Lines of Code

16 Dec 2025with Yaniv Aknin

Also available on

Transcript

[00:00:00] Simon Maple: Hello and welcome to another episode of the AI Native Dev. And my name’s Simon Maple. I’m your host for today. And joining me is Yaniv Aknin. Hi. And Yaniv is a software engineer from Tessl.

[00:00:12] Simon Maple: You’ve done a huge amount of research on some super interesting things about understanding how context can be used to better enable some of the top tier agents in our industry today. You’re a software engineer. You’ve done a little bit in the AI engineering space. Tell us a little bit about yourself.

[00:00:32] Yaniv Aknin: Yeah, thanks Simon. So I’ve been with Tessl since the start. I think we joined at the same time. Give or take, I’ve done a bunch of roles and basically changed roles and shifted shape to fit whatever was necessary. I’ve led and kind of started the AI engineering team, but now have folks working more with customers, but still working with customers.

[00:00:54] Yaniv Aknin: Some just understanding what could best service customers, or how whatever context is relevant to a customer would fit in with what an agent requires us to actually do. Oftentimes we look at what customers are doing, but sometimes we actually look at what the agents are doing and seeing how the two things fit.

[00:01:09] Yaniv Aknin: So what context do the agents come with built in? Like when you download Claude Code, Codex, Gemini, and so on, what is actually told to it internally? And then you see how you fit whatever else you’re feeding to it externally to see how the two things kind of gel well together.

[00:01:27] Simon Maple: Yeah.

[00:01:27] Yaniv Aknin: So that’s been the underlying thought behind the research, right? Like let’s see what is in those agents, what is being told to them. So yeah.

[00:01:34] Simon Maple: Let’s talk about that. And you gave this as a lunch and learn here in the office. In fact, there’s a lunch and learn happening right now from Maria, who actually was on the podcast last week. In fact, it was released last week. And she’s giving another wonderful session today. You gave this session a week ago as a lunch and learn, had a lunch and learn, and we thought, wow. We were blown away by the information first off. And then we were like, oh, we should definitely have to get this on the podcast.

[00:02:02] Simon Maple: We need to share this. So, but this information is for Tessl’s really own knowledge initially. It wasn’t something that we were originally going to share. It was for our learnings, right?

[00:02:14] Yaniv Aknin: It’s so, so we started looking into this. I can’t remember when did we start. As soon as Tessl started, basically, as soon as agents appeared.

[00:02:21] Yaniv Aknin: We definitely started looking at how do they work, kind of like packaging it up a bit and preparing for presentation. It was for a meetup that was happening here in London. And then we said, okay, maybe we’ll do like an extended director’s cut for lunch and learn. And then afterwards we said, hey, maybe there’s good content for a podcast here, so let’s do that.

[00:02:38] Simon Maple: Cool. So here we are. So we’re going to kick off with, I guess, a little bit about some baseline statistics for agents. Some of the top tier agents that I guess most people of our listeners would probably maybe use today. Then we’re going to look at, well, what kind of things will actually change that performance.

[00:02:56] Simon Maple: Better enable agents to perform their tasks more effectively. Everything from looking at system prompts to the information that’s given in a prompt. Then looking at things like the tool usage, what tools are available to agents, and how that affects their capability. So where do we start, Yaniv? Should we have a look at the baseline for how agents work today?

[00:03:20] Yaniv Aknin: If I can, like, one thing we, we should remember before we do, we'll, we'll talk exactly about, about the baseline or like the, the agent with nothing in it, basically to start with, right? Like, yeah.

[00:03:29] Yaniv Aknin: But before we do that, just a quick reminder to everyone, we are doing this to best understand what context to give agents. We cannot speak on behalf what we're talking about. Like flagship agents that, that you may have heard of. We can't speak on behalf of those companies. We don't have any internal material from them.

[00:03:46] Yaniv Aknin: We're trying to understand how to best use the product we're being given. And, and so some of the things we say, I hope they're right. They might also be wrong. If you have a different perspective, if you are understanding of how, say Claude Code or Gemini worked, like please, let us know.

[00:04:01] Yaniv Aknin: Let us know. Yeah. And, and. It's, we're showing you the research as it's happening, basically. Yeah, absolutely. Sounds good. Okay,

[00:04:08] Simon Maple: Let's jump in with this.

[00:04:08] Yaniv Aknin: Let's, let's go to the, the, the minimalistic agent, right? Absolutely. Yeah. Okay. So, on, on the screen, what we can see here is, the code for, something I call the nano agent in for, for the meetup.

[00:04:21] Yaniv Aknin: I basically vibe coded in like an hour. There's, there's nothing particularly interesting here other than the fact. That there is nothing particularly here. Like the interesting thing is that this is a working coding agent that, I wouldn't say, would win you any awards, but is actually quite capable.

[00:04:38] Yaniv Aknin: Quite capable.

[00:04:39] Simon Maple: So this is Python, right? This isn't Python. I wouldn't say it's quite capable then it, it's, you know, it's, it's

[00:04:44] Yaniv Aknin: not as capable. If we were at it in a hundred lines of Java, there we go. Exactly.

[00:04:47] Simon Maple: At least a hundred lines of

[00:04:48] Yaniv Aknin: At least a hundred lines of. And then some libraries as boiler play and some XML Exactly, exactly.

[00:04:52] Yaniv Aknin: To get started. Okay. But no, but, but in Python, this is what I wrote. And, frankly, I think I could have done without the right file and read file tools, I think just execute and the run agent loop would've been enough. But if, if we look at the, the, the bits here, we have the system prompt, which, which you can read is like 250 bytes. And it just says, Hey, you're a coding agent, do well.

[00:05:10] Yaniv Aknin: And the execute commands allow the, the, model to run any command. On whatever is the environment this is running on, I'm running it in a container and then write file and read file. Because I felt that will be, easier for a human to understand that if we would've seen the model doing an execute command and trying to like, you know, echo something into something or cat something to read a file.

[00:05:32] Yaniv Aknin: So, so just to have explicit tools for read and write would probably be okay. Even with just execute, but maybe the, the heart of the agent is the run agent, function where we see we're initializing the model, we're giving it the tools, right? Like the other functions are given to it as tools, I'm using here the, a library called LLM Python, by Simon Willison, great Library in my opinion.

[00:05:54] Yaniv Aknin: And then we're creating a conversation where we're starting with the system prompt and whatever the user has told the, the agent to do so, like the goal. And now the agent runs in a loop. Saying, okay, what can I do to achieve this goal? And we're counting on the model, the underlying model that we're using to power this thing.

[00:06:12] Yaniv Aknin: This, I, I ran it with sonnet 4.5, but whatever recent model you'd use to say, oh, well, given this goal and given this prompt, and given that I have this tool. I need to understand what's going on here. I need to start reading what files already exist in the file system. I need to run LLMs, I need to run a read. I need to understand what's going on and then start actually getting towards the goal.

[00:06:28] Yaniv Aknin: If I tell it make me a to do app, I need to start writing files, I need to execute the test. I need to check the things are working until eventually, this wild tool calls and the number of available turns, is done.

[00:06:44] Yaniv Aknin: So there the model saying, I don't want to call any more tools, or we've ran out of turns and then we have something complete. Mm-hmm. So. It in, in a hundred lines. You get an agent and you're thinking, well, it's probably not a very capable agent, is it? And I, I can say it's very capable. I mean, I'm, I'm sure it's not very capable, but if you, if you, you ask it to, Hey, make me a to do app, don't cover, complicate it.

[00:07:07] Yaniv Aknin: This is for a demo. Just, you know, put something, nice on the screen and make it look good. And so. You get a working to do app from this like under a hundred line, agent. It looks great. It works like you click around, you can manage your to-dos. I don't know. You can vibe code with this. I wouldn't say as well as you would with the flagship, agents. But, but absolutely works and provides good results.

[00:07:27] Yaniv Aknin: And this isn't a new observation that, that I've made, or I dunno, Tesla have made, like I said, I wouldn't say it's a very capable agent, but if you actually look at the, the scores for a terminal bench, terminal bench, popular and I think quite a respectable, benchmark.

[00:07:45] Yaniv Aknin: We use it, we use it inside, we, we use it, yeah. Internally exactly. For, for evals that, that we run. You see there. Only 15th place. So, okay. Not first place and not second place, but not hundredth place either. Only 15th place. It is the mini SWE agent, from the same lab that made SWE Bench and SWE agent, and, and so on.

[00:08:05] Yaniv Aknin: And you see that they have a mini SWE agent, which is also under a hundred lines. They do a hundred lines for the, the core agent loop. And then I think they, they give themselves another hundred lines for, I don’t remember if it’s tool use or something else, but still you can read the code in one setting.

[00:08:20] Yaniv Aknin: Very, very comfortably. And yet it gets to a very respectable 15th place. And also, if you see kinda like the distribution of scores, it’s not like 15th place and really, really far below 14th place or 13th place. It’s, it’s actually in a pretty good cohort. And I would say maybe on par, maybe a bit below, actual like serious agents.

[00:08:42] Simon Maple: So what is this, what is this telling us? Is the kind of like the thing which makes agents, you know, that much better?

[00:08:48] Yaniv Aknin: Based, based on this alone, I would say, well, it looks like maybe tools and context are not that important. Because even with very little tool, with very little tools, and with very little context, look, in a hundred lines, you, you don’t have much to work with.

[00:09:03] Yaniv Aknin: Yeah. You can, you can get to a good score. So, so you, you might think that. It could also suggest that there are things that Terminal Bench doesn’t cover.

[00:09:14] Simon Maple: Yeah.

[00:09:14] Yaniv Aknin: Right. And at the end of the day, benchmarking is an art and an extremely difficult and like challenging task to do, like really well.

[00:09:21] Yaniv Aknin: Perhaps there are real world situations where the flagship agents, the, the, the, the big brothers of the mini SWE agent, would do better because they have this additional context, additional tools, and, and so on.

[00:09:34] Yaniv Aknin: So, so is it that context doesn’t really matter? If it doesn’t really matter, how come the flagship agents all come with loads of context, and actually what kind of context do they come in with?

[00:09:42] Yaniv Aknin: And the context you're talking about here?

[00:09:44] Simon Maple: Question we're asking and the top context you're talking about here is specifically the system context, right?

[00:09:46] Yaniv Aknin: Correct. Correct. I, I, I should be clear about that. 'cause when we say context, we usually mean the context that you add to the agent, right? So we, we.

[00:09:55] Yaniv Aknin: Don’t usually think about the context that comes built in, in the agent. Right. And that is primarily the system context and tool descriptions.

[00:10:03] Simon Maple: Yes.

[00:10:04] Yaniv Aknin: So, if you're thinking about the call that being, that's being made to the LLM, there's a certain amount of input tokens.

[00:10:12] Yaniv Aknin: At the end of them is somewhere as though it doesn't have to be at the end.

[00:10:14] Yaniv Aknin: Typically, at the end would be the user’s request. Make me a to do app. But first there’s a lot of your Claude Code or your Gemini or your Codex. You’re a helpful agent. This is what you do. This is how you behave. Here are all the tools available to you. And for each tool, this is the schema of the tool, the description of the tool, this is when you should be using the tool.

[00:10:33] Yaniv Aknin: So this is exactly kinda like what we'd like to, to be looking into.

[00:10:36] Yaniv Aknin: What do the big labs choose to equip their agents with? Kinda like the built in context, not the context that is being added to it. So what do they choose to put built in, what descriptions and what tools do they choose to make available?

[00:10:48] Yaniv Aknin: And how does that affect agent performance and how does that affect, and of course, the answer the contexts will give.

[00:10:53] Simon Maple: To add some context, if you don't mind the problem. Sure. You doing that?

[00:10:56] Yaniv Aknin: I do mind pun, but you know. Okay. Well, well, you know, you've been working together for a while. Once a doubt.

[00:11:00] Simon Maple: I can’t take it. Exactly, exactly. Yeah. So to add some overall context to the research, what you are interested in, or, or what, and what Tessl is interested in, is of course adding own context or adding context that allows the agent to perform a certain task better. Yes. And it’s additional agent, which is very.

[00:11:18] Simon Maple: Core to that task, so that the agent has a better chance of actually being able to complete that task. Yeah. Whereas the system context is kinda like more of a generic context that just makes that agent be able to perform general tasks, and even generally in terms of coding and things like that.

[00:11:34] Simon Maple: Yes. So. What we're gonna see here is more gonna be next focused on system prompt and the system context. So I'd love to kinda like Yeah. You know, learn some of your findings in, in that system context.

[00:11:44] Yaniv Aknin: A hundred percent. The, the reason we're looking into it is that at the end of the day, when you're doing, when you're invoking the model, all of that context gets mixed together.

[00:11:51] Yaniv Aknin: Yes. And the model. It gets mixed together in some order with some words that would describe, here is the section that you should be reading in this sense, or whatever. But still, it all gets mixed together and gets shipped to the model. So you need to understand the, the built in context, I dunno that there is a term of art for this, but the system prompting the tool descriptions.

[00:12:09] Yaniv Aknin: In order to add your own context. At the end of the day, the whole thing gets shifted, including the user prompt and including everything. And the model needs to start creating output tokens. Yeah. So we, we better understand these things. To learn from the big labs. Probably they know what they're doing in terms of creating tools, but also to be compatible with the big labs and ensure that the context we are providing doesn't contradict and sits well.

[00:12:31] Yaniv Aknin: Yeah. With the context that we know is there.

[00:12:32] Simon Maple: Yeah. So we can effectively, it's that, so we can be more effective in our external context. What I love to call mechanical sympathy. It's understanding what is under the covers. Exactly. Like I, I don't need to understand exactly. I, I don't need to understand a gearbox, how it works perfectly.

[00:12:45] Simon Maple: Yeah. In order to use it, because my use of it is literally using the, the, you know, the gear stick. Yeah. But as a Formula One driver, them understanding how it works under the covers will actually make, allow them to get every benefit of that. Of a hundred percent. And so this is essentially a hundred percent.

[00:13:00] Simon Maple: A hundred percent

[00:13:00] Yaniv Aknin: Right. Like know where are the studs when you want to hang a picture. It’s perfect. Like you need to understand what's behind the wall. Yeah. Yeah. Exactly. So, so if you want to look, I want to look at context. I want to read the system prompt. You can, but it depends how, how you do it, how would you extract it, and so on.

[00:13:16] Yaniv Aknin: But you, you can, because it’s English, right? Yeah. If you, if you’ve read a book, you can read the system prompt. But also it’s pretty difficult to show it in a podcast, because it’s not that interesting. Yeah. It’s like showing someone reading a book, or I, I could show you the pages of the book to the camera.

[00:13:29] Yaniv Aknin: It’s interesting to talk about size first. Like yes. That’s something we can talk about in a podcast. We’re talking about around 12 kilobytes of context that we’ve seen in Claude. Yep. Around 10 kilobytes of context, I mean specifically system prompt. 10 or 10.7 kilobytes that we’ve seen in Codex.

[00:13:48] Yaniv Aknin: And around 14, 14.7 kilobytes that we’ve seen in Gemini. And obviously they’re adjusting this a bit as the versions of these agents change. I don’t know if that’s exactly what you’ll be seeing for your version, but that’s roughly the amounts we’re talking about. 12, 10, 14.

[00:14:02] Simon Maple: And Claude and Gemini here, we’re talking about Claude Code and Gemini CLI, correct?

[00:14:06] Simon Maple: Correct. Okay. Yeah. It should be clarified exactly,

[00:14:08] Yaniv Aknin: But now, okay, so 12 kilobytes, 10 kilobytes, 14 kilobytes, about 12 pages of text. What, what goes into those pages of text? I felt that this visualization, right, like the tree map with colors, is a useful way to, to give you a sense of what is in the system prompt without

[00:14:24] Yaniv Aknin: Sitting down with a book and reading it. So, to, to me, this is like creating a sort of like a, an index or a legend for, for the book and giving you a chance to see out of the system prompt how many words were spent talking about the process or how the agent should work versus about safety versus about how to communicate with the user.

[00:14:45] Yaniv Aknin: How to write tests and so on and so forth. Right? Like you, you can see the, the tree map on the screen, the differences.

[00:14:51] Simon Maple: It's quite, you know, you would, you would've thought yes, they would have similar learnings as a result, have kinda like similar splits, particularly with some of the biggest ones that they've got.

[00:15:02] Simon Maple: Yes. But looking here. I mean, you look at the two biggest sections on Claude, it's different to the two biggest sections on Codex, and then with Gemini actually it, it uses one of each of the biggest two. It's, it's surprising how different they are.

[00:15:18] Yaniv Aknin: The, the, the difference was surprising to me.

[00:15:19] Yaniv Aknin: hat’s true. And especially, it’s surprising given that in the back of my mind, I keep remembering sort of like the null hypothesis or the, the zero agent. Yeah. Right. It also works without any of this stuff. Yeah. Yeah. So what makes them put in 10 kilobytes? 12 kilobytes, 14 kilobytes. When I'm knowing that it, it works pretty well, like I imagine there needs to be quite a lot of value Yeah.

[00:15:39] Yaniv Aknin: In, in the system problem for them to bother to put it in. So I don't know exactly why are the differences as they are. I think you can look at the, at the chart yourself and sort of like, have, have a sense for what you're seeing. You can go and read the text yourself and have a sense for what you're seeing.

[00:15:55] Yaniv Aknin: But to me, like reading it as a, as an outsider, it felt like there are certain properties or traits that the model came out with out of training, and out after reinforcement and post training and everything.

[00:16:09] Yaniv Aknin: And were things that the, the, the lab wanted to emphasize or wanted to subdue a bit.

[00:16:15] Yaniv Aknin: Right? So all prompts talk about be direct, be clear, do not waste tokens, like follow, follow the instruction of the user, that kind of thing. But now I’m making a hypothesis. I don’t know. Maybe there is more about use of tests in Claude than in the other agents because they felt that after the model came out of training, it needs a bit more leg up.

[00:16:38] Yaniv Aknin: On tests. Or there is more about the process and how to manage your work. Right. Like when I say process, I mean making to dos or how, how to decide how to plan ahead. Because after training, for Gemini and Claude, they felt they, they wanted to assist the model more.

[00:16:56] Simon Maple: Yeah.

[00:16:57] Yaniv Aknin: So that’s my impulses. I don’t know if that’s a very good way to think about it, but a bit like a patch in software. Right. Like you, you are patching the bits where the model didn’t do exactly as you wanted. So you’re giving it a leg up using the context, using system prompt.

[00:17:09] Simon Maple: And of course we can't like directly compare each of these to say, oh, Claude's doing it this way.

[00:17:14] Simon Maple: Codex is doing it that way because of course they're different models as well. So they're different models are gonna act differently and potentially require different support in different areas or different direction rules in different areas. To be able for both of those models to actually achieve the same thing.

[00:17:27] Simon Maple: So that's obviously specific to each model as well, but. Incredibly curious, which were you most surprised at, would you say?

[00:17:36] Yaniv Aknin: I think, I think for me, the way that Claude manages process and manages planning was quite distinct from Codex and Gemini and doesn’t necessarily even is not so visible in the chart Codex.

[00:17:51] Yaniv Aknin: You can see there’s very little in process Gemini. There’s more, but it’s differently framed and like used differently. We’ll see that again a bit further on when we talk about tool use and the tools that are available to each of these agents. And how does Claude use the to do tool very differently to, to the other, uh, two, but also Claude leans very heavily on the idea of subagents and on delegating tasks to sort of like a, a separate agent loop with its own context and its own use of tokens.

[00:18:21] Yaniv Aknin: So saying, okay, you go and do this task for me. Afterwards, I don't care what, what context you've built while doing the task, I just care about the result. And I'll continue from there. And, Claude is using that in the most flexible manner, I would say.

[00:18:35] Simon Maple: And that's captured in the process, presumably?

[00:18:36] Simon Maple: Yes. Or is it more in the tools process? Yes. Okay. Yes.

[00:18:39] Yaniv Aknin: Would be useful to, to say a word about how do they come up with, with these stream apps. Basically taking the text of the system prompt and then sort of like asking a model to come up with a classification for every line in the text.

[00:18:53] Yaniv Aknin: Yeah. So if, if we switch to another slide, for example, it’s, it looks like a bar chart, but it’s not really a bar chart. I’ve used the bar chart to represent the text. So imagine that every pixel is a word or a letter. I don’t remember which I’ve used. And the labels that I’ve added are the things that are determining the color of the line.

[00:19:12] Yaniv Aknin: So it’s a bit like the mini map in your editor, right? Like many VS Code and other editors often have like this mini map of the code on the side with syntax highlighting.

[00:19:21] Yaniv Aknin: So this is not syntax highlighting, this is context classification. Category highlighting.

[00:19:26] Yaniv Aknin: So for every line we're adding, oh, this is about coding philosophy.

[00:19:30] Yaniv Aknin: This is about an approach to testing. This is about safety, this is about process and whatever, and then we are coloring the lines appropriately. You get a sense for volume and for line length. You can see how many lines there are and how long they are.

[00:19:42] Yaniv Aknin: Not hugely different, but somewhat different.

[00:19:44] Yaniv Aknin: And you get a sense for, what are the lines about, right? Like the, the label that, that we see through the color.

[00:19:51] Yaniv Aknin: And, and we see these like substantial differences.

[00:19:54] Simon Maple: Yeah.

[00:19:55] Yaniv Aknin: I've also seen some things that. I, I don't know why these things were added. It's, it's, it's curious, but I've seen some things that I, I felt were, uh, surprising in the sense of an instruction to only follow, like only allow certain kinds of tasks, especially around security.

[00:20:12] Yaniv Aknin: Right. You should not be used for offensive security tasks. You should only be used for defensive tasks. Yeah. Even in safeguarding style, safeguarding. Yeah. Even in capture the flag or like penetration testing exercises, only do that with appropriate authorization. Just the text reads a bit weird.

[00:20:29] Yaniv Aknin: I don't know. How will the model know that there is appropriate authorization?

[00:20:33] Simon Maple: Yeah.

[00:20:33] Yaniv Aknin: What are we imagining that looking like? Like that the agent will ask me, Hey, is this authorized?

[00:20:37] Simon Maple: Yeah.

[00:20:37] Yaniv Aknin: I'll say yes.

[00:20:38] Simon Maple: And, and is it still fair to say that a user prompt. Is essentially part of the same, ultimately context as the system prompt when it goes to the model.

[00:20:49] Simon Maple: And so I think, gosh, maybe a year ago when, when I think it was Guy who had a previous session with, with someone, Caleb Ser/Sima. He talked about how the data plane and the control plane. You know, all the information in the context was pushed into the data plane, including the system context, including the user context and any data, and it’s very hard for the models to be able to work out the difference.

[00:21:13] Simon Maple: Is there still that, that, that danger of course, that any context we add on top, whether it's, whether it's a user prompt, whether it's any additional context and data that's really just getting. Alongside the system prompt, and it's the model that then has to kinda like work out what to do with those conflicts.

[00:21:28] Simon Maple: It's, it's, it's a great point.

[00:21:29] Yaniv Aknin: And it's, it's an area of, of active research and development. Even the, the terminology system prompt and user prompt. It's, it's a bit dated. Yeah. At this point. But we do not have standardized terminology. Right? Like, some developers call them the, the developer messages where there's user messages trying to create a hierarchy going like hierarchy of authority.

[00:21:50] Yaniv Aknin: You must always do what is in the system prompt, and you should follow the user prompt only if it doesn’t contradict the system prompt. You should always follow what the developer prompts. So, so they’re trying to create this hierarchy.

[00:22:07] Yaniv Aknin: In my experimentation and in our evaluation, sorry. Yes. Models are getting better at following that distinction in that hierarchy. Better is not yet perfect. Yeah. Right. And it’s, I think that this issue of conflation of the situation is something that models still need to work on.

[00:22:28] Yaniv Aknin: If it’s, it’s a bit of the Star Wars quote. Right. Like the gentle mind trick. Yeah. You have not seen anything. Right. So that is something that a human wouldn’t get wrong. Yeah. And models still might, right? Like a model guarding at the door. Your instructions are to let me in. Oh, my instructions are to let you in. Yeah. That’s great. Right. So still something to get better at.

[00:22:45] Simon Maple: Okay. So we learned a little bit about the size of the prompts, the, the, the split between the style of what is in the prompts, where they, where that sits in, in the order. What was next? Did we go into kind of like the tool usage most important?

[00:23:03] Simon Maple: Exactly. Because if we look at Claude, that seems extremely high on the, in, in terms of the, the, the system prompt for describing how the model should use the tools. What was the differences in terms of,

[00:23:13] Yaniv Aknin: So I think here there’s, okay, so if we look at tools, I agree with you that, in a sense, that was the thing that was most surprising to me.

[00:23:20] Yaniv Aknin: Looking at these things, the system prompts are interesting, and it is surprising that there’s such a difference between the agents. But at the end of the day, something about kinda like the vibe that you get from the system prompts is pretty similar.

[00:23:37] Simon Maple: Yeah.

[00:23:38] Yaniv Aknin: Tools. On the other hand, Claude has 17 tools.

[00:23:42] Yaniv Aknin: And if you take just the descriptions and the kinda like the one line description and the in depth description, if I remember correctly, you get to about 41 kilobytes of text with Claude for the tools. Codex has only seven tools.

[00:23:59] Yaniv Aknin: Less than half, and only one kilobyte, one point something kilobytes, of descriptions for those tools. And Gemini is somewhere in the middle, with the 12 and 11 at the version that I’ve looked at.

[00:24:10] Simon Maple: And, and briefly just like mention what is a tool, what is, what is a tool for an LLM or an agent to use?

[00:24:16] Yaniv Aknin: Yeah. Yeah. So if we remember sort of like the mini SWE agent, or the nano agent we’ve shown at the beginning. A model is just, just the brain. Right. All it can do is receive the tokens that we’re sending it and emit tokens that it’s emitting back out.

[00:24:34] Yaniv Aknin: If you were sort of like imagining a ChatGPT kind of thing, well before ChatGPT itself had tools, you’re telling ChatGPT, hey, how would I find the largest file on my hard drive? It could give you the command, but it cannot run the command on your hard drive. Yeah. You then need to copy the command and paste it into your terminal and be like, okay, now I actually run this thing.

[00:24:48] Simon Maple: Yeah, yeah.

[00:24:49] Yaniv Aknin: And then maybe there’s a syntax error. Or maybe there’s the tool that ChatGPT wanted to use doesn’t exist. Okay. So I’m copying the error and I’m pasting it back to ChatGPT, and now it’s telling me, oh, well you should be using this other command. And now I’m copying back and forth, back and forth.

[00:25:04] Yaniv Aknin: By tool use, we’re able to just bypass all of that and let the model have sort of like a side channel, where it’s just saying, okay, this is not text that I’m emitting. These are tokens that I’m not emitting for the user to read. These are tokens that I’m emitting to tell the thing that is invoking me, to tell the program or the agentic harness that is invoking the model.

[00:25:24] Yaniv Aknin: Stop for a second. You should run a certain function or certain tool, is the word we use. Here are the parameters. Here are the arguments. Once it’s run, the output from that command execution should go back to the model. So if the model is choosing to, to use, the execute tool and it's giving the parameter LLMs, it once run LLMs and now we get the list of files in the current directory, that should be sent back to the model.

[00:25:48] Yaniv Aknin: So now it has additional input and now it's, oh, now I know. What are the files in the directory?

[00:25:53] Yaniv Aknin: Now the model can decide what next it would like to do.

[00:25:56] Simon Maple: Yeah.

[00:25:56] Yaniv Aknin: Do a read file tool in order to read what's in the files. Or do a delete file if it thinks or edit, and so on and so forth.

[00:26:04] Yaniv Aknin: So the tools are actually the, the, the tools combined with the context and the model and the agenda harness around all of that is the thing that makes an agent an agent as opposed to sort of like just the, the model, which is like a brain in a jar. It cannot do anything.

[00:26:16] Simon Maple: Yeah.

[00:26:16] Yaniv Aknin: Now it's an agent that can.

[00:26:17] Yaniv Aknin: Operate in the environment. Yeah.

[00:26:18] Simon Maple: Okay. Cool. So, okay. When, when you then look at that, the 41 K versus the 1.1 K between Claude and Codex, that's super significant in terms of, you're, you're essentially, you know, it, it feels like Claude is just so much more enabled to, to use these various functions, be able to run these actual commands.

[00:26:40] Simon Maple: Talk us through kind of like, I guess the split in the tools. Yep. What, what were the. What were the capabilities that, that Claude has, I guess that codex may, may miss with these

[00:26:51] Yaniv Aknin: You, you would, you would, I I wouldn't be surprised if someone didn't understand it that way. Oh, way more tools, way more powerful.

[00:26:58] Yaniv Aknin: But it doesn't have to be, actually, there's a lot you could do with very basic tools if you're able to use them really, really well. Right. And we know from, uh, from our own use of the tools, from benchmark results is that I don't be the like. A side comment, crowning the, the best agent, what, what does even best agent mean?

[00:27:18] Yaniv Aknin: But we know that both of them are very capable agent like flagship. Top of the line. Right? So even though they have this very distinct difference in them, and I would think about it more as a difference in philosophy.

[00:27:29] Yaniv Aknin: Codex does a lot with little,

[00:27:31] Simon Maple: yeah.

[00:27:31] Yaniv Aknin: Right. It has these seven basic tools.

[00:27:33] Yaniv Aknin: It doesn't even have a read file tool. It just, you know, if it wants to read a file, it executes a cat's command to, to, to read, grab or whatever, like some, a command that will let it, uh, take something from the file. And the descriptions for the tools are very, very short. As in if, we'll, we'll show here a, a few descriptions of, some of Claude's tools.

[00:27:56] Yaniv Aknin: We can see it's like. Here's the, the, I don't know, apply diff tool and this is how you use it, and that's about it. Whereas if we now look at the definition of a tool from Claude, we can see it's a lot more detailed.

[00:28:10] Yaniv Aknin: Here's the tool, when should you use it? What should you do in this case? What should you do in that case?

[00:28:15] Yaniv Aknin: Sort of like supporting the model a lot more in understanding when and how and what is good and what is bad, and giving it examples and so on.

[00:28:24] Yaniv Aknin: Now, Again, hypothesis. I don't know that this is true, but I, I believe that this suggests something about the underlying training of the models.

[00:28:33] Yaniv Aknin: Right. I don’t know, but I would be surprised if Codex can achieve these kinds of results without any kind of training for these tools in particular. Right. And we also know that Codex, at the time of recording at least, we are running it with the GPT 5.1 dash Codex, not with the general purpose GPT 5.1.

[00:28:54] Yaniv Aknin: But with a specific GPT 5.1 that was then further trained and further fine tuned or RL to use these tools and to better work for code.

[00:29:01] Simon Maple: So you're not, so this is outside of the 1.1 K, which is the tool description. You're talking about an act, something that's baked into the training to allow it to use this.

[00:29:10] Simon Maple: And then the descriptions are therefore just almost, decorative so that the agent with its training knows which ones to use. Yes. But the, the, the core piece, so, so. Very interesting and, and maybe, and I guess this, you know, there's, there's, there's a lot of guessing here probably, but you know, perhaps Claude has maybe had less of that training for in, in the tools because of the amount of it in the, in the.

[00:29:34] Simon Maple: In the descriptions. 'cause it's probably you are looking at around twice as much adversity in the, in the description given you're around. Uh, oh, no, no. It's, it's about 20 times. Yeah. Because if you look at the difference in tools, yeah, you're looking at about 20 times more. It's, it's quite a lot. Very, it's quite interesting.

[00:29:49] Simon Maple: Quite interesting. Of course, we can speculate here because we. Can't see the training, but it's very, very interesting data and differences between them. We can speculate.

[00:29:57] Yaniv Aknin: And there are other things that we see, right? Like in our own evaluations, we've had better success. And, and at the moment our, our sense is that, uh, SONET four, five, Opus four five do better.

[00:30:09] Yaniv Aknin: Then the, the Codex models in using other tools. Yeah. Tools that they haven't seen before. Right. Like custom tools. And so while yes, of course Codex and Claude and Gemini, all of them support MCP, you can add tools on top of them. You, you can just invoke the model in your own code. Right? Like just we've seen with nano agents and give it whatever tools you want.

[00:30:27] Yaniv Aknin: We've had at least better success making sonnet make use of arbitrary new tools than we've had with Codex. Interesting. Not saying Codex, the GPT 5.1, Codex and GPT 5.1. In general, not saying it cannot use tools, but not as flexibly.

[00:30:43] Yaniv Aknin: And then there's a trade off. What, what do you prefer?

[00:30:45] Yaniv Aknin: Right? Like. These tokens always being in the context. They're expensive. They cost you something even with, with, uh, caching, right? They take some of the, the context window. And, and they, they cost us something, but they give you this additional flexibility. You don't need to retrain the model in order to swap the tools.

[00:31:00] Yaniv Aknin: There is reason to believe that the model would be, sort of like more capable. And again, this is something that we're also seeing in, in separate, uh, benchmarks, uh, with. New tools that it's never seen because it's like more of a generalist. Yeah. As opposed to, has been trained on those tools in particular.

[00:31:16] Simon Maple: Yep. Interesting. So you mentioned before there's a difference in planning. Um, what does that, I guess, what does that look like? What are the, what are the differences in planning across models and agents?

[00:31:25] Yaniv Aknin: Yeah, yeah, yeah. So one thing that I lumped into planning is use of cyber sub-agents. Yes.

[00:31:33] Yaniv Aknin: Like invoking a whole separate agent. We, we could talk about that. I wasn't sure if it was the right thing to lump it into planning. Because that's, it's a whole, very, very rich space.

[00:31:42] Yaniv Aknin: But when we say planning normally, right. Ignoring subagents for a second, we mean exactly this. The model is instructed, Hey, uh, if it's a complex task, don't just try to.

[00:31:54] Yaniv Aknin: To do it willy-nilly and Yeah. And do whatever you want. First of all, write to yourself. And there's, there's, uh, a lot of research supporting that. Models do better when they first kinda like, reason through the problem and, and create a plan. Mm-hmm. And just say, what, what is the plan? Or write? They don't say they, they write the plan and tokens out like.

[00:32:11] Yaniv Aknin: To be added to their context and now they'll be following the plan.

[00:32:14] Simon Maple: Yeah,

[00:32:15] Yaniv Aknin: So both, Codex and Gemini have a tool called Update Plan or something along these lines. And what they do, what they do in that tool is just the, the, the model is invoking the tool saying, here are the steps I'd like to go through.

[00:32:28] Yaniv Aknin: These steps are then added to the conversation history.

[00:32:31] Yaniv Aknin: And that's it. There's nothing else that reminds the model to follow the plan. There's nothing that would, would tell the model, oh, you've, you've skipped a step or whatever. Anytime the model wants it can just emit a new update plan and the user will be seeing whatever is the latest plan that the model has has shown.

[00:32:46] Yaniv Aknin: Usually the model doesn't change to, to change the plan willy nilly. And so if it had steps 1, 2, 3, 4, it does an update plan, and now I'm doing one. Update plan. Same. 1, 2, 3, 4, but one is crossed out.

[00:32:59] Yaniv Aknin: Update plan two is crossed out and so on. And we get a sense of progression, but actually the model can do whatever it wants.

[00:33:05] Yaniv Aknin: There's nothing in the energetic harness that is running be around the model that would remind it of things. Claude takes a, a different approach here. Claude has several planning tools and update plan task complete. The, these kind of things like more, uh, nuanced and the agentic harness actually kinda like supports the model by reminding it, oh, uh, you're done with step two.

[00:33:29] Yaniv Aknin: Lemme remind you that before you said that you should be doing step three now. And so actually the plan. Nothing, it's just tokens that the model is saying in, Codex and Gemini. It gets added to the message history. And now, now the model is more likely to follow that because it sees that in the, the, the context that's gone by.

[00:33:46] Yaniv Aknin: Whereas the plan is an actual sort of like, element of flow control or element of control in, in Claude. There is something that is not the model that the agent harness is actually reminding it, Hey, are you sticking to the plan? Are you following the plan? Did you finish the plan?

[00:34:01] Simon Maple: Yeah.

[00:34:01] Yaniv Aknin: So that's again, pretty substantial difference I think.

[00:34:04] Yaniv Aknin: Yeah, on something that you can see how often agents are planning, right? Like planning is a, is a very important aspect of, of agent behavior and yet such a difference in philosophy. But again, all of these are very capable agents. All of them end up like doing their plans pretty well, even though the, the underlying implementation is quite different.

[00:34:22] Simon Maple: Yeah, it almost feels like. There's no right way or wrong way by the sound of it because. Different models are picking different things, and like you say, it's such a crucial thing. If there was one way that was kind like showing, you know, much better value, you'd think all models would navigate to that immediately, right?

[00:34:38] Simon Maple: Yeah. So it's probably just more the way that it was, it was a, a choice of implementation versus any, any major difference. But yeah, super interesting to see such. You know, high profile models, just choosing completely different ways of doing these things. Yes. And then sticking to it as well, and then

[00:34:55] Yaniv Aknin: sticking to it.

[00:34:56] Yaniv Aknin: We, we'll see, like, I, I don't remember if it was Codex or Gemini. One of them already is beginning to have, 'cause the, the, the repo is, is open source. Yeah. It was already beginning to have changes to some of its tools to be. More geared, more towards kinda like the, the rich set. So maybe there are changing tack and like maybe they're approaching it, like they're, they're making a change over time.

[00:35:17] Yaniv Aknin: But yes, the seeing Claude and Codex that the distinction is, is so big between them, clearly they thought about it. Right. This is not the first version. No. Like we, we've been out there for some time and there's a marked difference in philosophy.

[00:35:31] Simon Maple: This isn't their first LLM rodeo is it?

[00:35:31] Yaniv Aknin: Yes. This isn't their first LLM.

[00:35:33] Yaniv Aknin: Exactly.

[00:35:34] Simon Maple: So now we're gonna go behind the scenes. We're gonna see some demo, we're gonna see what happens, behind the scenes between the agent and the LLM. Exactly. And take it away. So, to start,

[00:35:45] Yaniv Aknin: there's nothing like, incredibly novel here, but still we, we need to understand the setup a bit and,

[00:35:51] Yaniv Aknin: And what, what we had to do in order to, to understand the communication between, the agent and the model. We have like this small open source tool called Contain Agent. Not, not a huge amount there, but basically, drops you into a container. Mounting the current working directory inside the container.

[00:36:09] Yaniv Aknin: The container comes pre-installed with many popular agents. And you can run stuff there. Very, very basic, like many, many other things like it around. Yep. Probably other, other tools like it also have such features. I, I don't know, but it is something that we've spent a bit more effort on.

[00:36:25] Yaniv Aknin: Contain agent is everything around proxying or transparent proxying of the, the. Agent harness to model traffic

[00:36:34] Yaniv Aknin: when we're doing the recording.

[00:36:36] Yaniv Aknin: So if I do contain agent here, I just get dropped into a shell and now I can do LLMs here and like, okay, there's, there's nothing much here. But if I do contain agent dash dash dump, Claude, hello man in the middle dump.

[00:36:57] Yaniv Aknin: We're getting the same thing. I'm getting jumped into, into the same directory, now I'm in the container and so on. But the difference is that now, MITM proxy, the, the, the transparent, well the proxy multi-tool that can do all sorts of proxying has been started in the background.

[00:37:13] Yaniv Aknin: And the environment variables have been set so that the, the agent will proxy its traffic through, through the proxy.

[00:37:21] Yaniv Aknin: And now if I do something like,C laude. And say hello world. Claude will think about it for a bit and then just say, hello, world. That's great. I can finish my session with Claude. Close the container, and now I have this dump of the traffic, which I can later analyze.

[00:37:47] Yaniv Aknin: So, after I, I have this recording, I can do like, again, a, a, a simple script that we've made to, to parse the, the mannel in MITM, dump, dump files and do something like this. And now what we have under output, we have a standardized JSON file that includes the system prompt that was sent to the agent, the model that was used, the tools that was given to it, and so on and so forth.

[00:38:20] Yaniv Aknin: We see interesting things like looking at these things. We see how Claude senses traffic to the model, even when it’s just invoked to start pre-warming caches. We see how Claude is using different models because it has these different subagents, and it’s starting up sort of like, here’s the definition of the agent that would do code exploration and would be using haiku, and it’s small and fast and can navigate quite quickly.

[00:38:45] Yaniv Aknin: Versus the main agent that is actually writing code and is like smarter and, and like using more intelligence.

[00:38:52] Simon Maple: Awesome. Thank you for sharing that, Yaniv. And I guess kinda like, you know, this is an initial piece of research that you and the team have done. What does this unlock now?

[00:39:00] Simon Maple: What are the next steps that you kinda like, want to take to, to learn new things?

[00:39:04] Yaniv Aknin: Absolutely. I think, I I say, the, the, the outcome of all research is that we should be doing more research. It feels, it feels, in this case as well, it asks more questions than give answers. Yeah, absolutely.

[00:39:14] Yaniv Aknin: Absolutely. I think two things that, that I’d be interested in. One is an ablation study around tools and around prompts. I would love to, I would love to try and take sort of like the agentic harness of one of the, the, the big agents and see what happens if we remove a tool from it.

[00:39:34] Yaniv Aknin: Or if we remove parts of the system prompt or even all of the system prompt, right? Like take cloud code as is with all of its tools and everything, but instead of the 40, 14 kilobytes system prompt, like, or 12, I don’t remember how much is cloud code, like make it 250 bytes, like minus we agents, right?

[00:39:49] Yaniv Aknin: Yeah. What happens in that case? So that’s one thing. And the other thing I think would be sort of like a cross benchmark. Where we take the agentic harness of, say, codex CLI and we try to run it with a Sonnet 4.5 or vice versa. Hypothesis. If what we’re imagining here is true, I would expect that Sonnet 4.5 should be able to kinda like drive the other car better because our current understanding is more generic.

[00:40:21] Yaniv Aknin: Yes. And it, it's not as, as well-trained for this particular car, but for cars in general.

[00:40:26] Yaniv Aknin: But it would be interesting to, to even just. Running, running a terminal bench on it. Yeah. And seeing the results after we've done the rewiring of the model behind the agent one. Yeah.

[00:40:36] Simon Maple: Wow. So, yeah. Cool, cool.

[00:40:36] Simon Maple: Things to look super interesting. I'd love to, I kind of wanna know the answer now,

[00:40:41] Yaniv Aknin: also need the time.

[00:40:42] Simon Maple: Yeah, of course. Of course. Well, if anyone, if anyone, if anyone, our listeners. Kind of like absolutely, you know, wanna, wanna, wanna share maybe some of their research or, or, or have done similar things.

[00:40:50] Simon Maple: Please do let us know. But for now, Yaniv, thank you for everything. Thank you for all the, all the info and, it's, it's wonderful to see what you and the team are learning and sharing. Look forward to, look forward to seeing more.

Yaniv Aknin: Absolutely. Thank you Simon.

Simon Maple: Thanks for time. Thank you. Appreciate it.

[00:41:04] Simon Maple: Thank you everyone for listening and, tune into the next episode soon. Bye for now.

Code Generation

AI-Native Development

Agentic Systems

Chapters

Introduction

[00:01:01]

Research on AI Agents and Context

[00:04:06]

Deep Dive into System Prompts and Tools

[00:05:38]

Defensive Tasks and Authorization

[00:21:41]

Tool Usage in Language Models

[00:24:26]

Planning and Agentic Harness

[00:32:43]

In this episode

In this episode of AI Native Dev, host Simon Maple and guest Yaniv Aknin explore the balance between built-in system contexts and developer-added instructions in coding agents. Yaniv demonstrates how a simple 100-line "nano agent" can effectively generate code, highlighting the importance of minimal system prompts and well-chosen tools. The discussion sheds light on how developers can optimise agent performance by designing complementary contexts and leveraging benchmarks alongside real-world scenarios.

In this episode of AI Native Dev, host Simon Maple welcomes Yaniv Aknin, a software engineer at Tessl, to unpack a deceptively simple question with big implications: how much of an agent’s power comes from its built-in system context and tools, and how much comes from the context we add as developers? Yaniv walks through hands-on research—from a 100-line “nano agent” to observations about flagship agents like Claude Code and Gemini—to show how system prompts, tool descriptions, and evaluation benchmarks shape agent performance.

From Zero to Agent: The Nano-Agent Baseline

Yaniv starts with a live, minimal example: a working coding agent implemented in under 100 lines of Python. Built with Simon Willison’s LLM Python library, the agent uses a tiny system prompt (about 250 bytes) that essentially says “you’re a coding agent—do well,” and exposes a few tools: execute (shell commands), read_file, and write_file. The core is a simple run loop that submits the conversation (system prompt + user goal) to an LLM, processes tool calls returned by the model, executes them, appends results, and repeats until the model says it’s done or the step budget is reached.

Despite its simplicity, this nano agent reliably creates usable artifacts (like a functional to-do app) with a modern model (Yaniv ran it with Claude Sonnet 4.5). Running the agent inside a container gives it a predictable environment and a safe place to execute commands. The code structure is intentionally lean: a short system prompt, a minimal tool surface, and a straightforward loop. In practice, that’s enough to bootstrap code generation, execute tests, and iterate toward a goal in a few tool-invocation turns.

For developers, the lesson is profound: you don’t need a complex framework to get started. A baseline agent can be a simple message loop with 2–3 well-chosen tools. Start there to understand how your model reasons about goals and tools before layering on fancy orchestration, multi-agent handoffs, or retrieval pipelines. Keep the system prompt minimal so your added task context gets more attention in the token budget.

What Benchmarks Reveal—and Miss

Yaniv anchors the discussion in TerminalBench, a respected agent evaluation used internally at Tessl. A mini agent from the SWE-bench team—Mini SWE Agent—places 15th on TerminalBench with a similarly small codebase (roughly 100 lines for the loop and a small amount more for utilities). It’s not top of the leaderboard, but it’s in a very credible cohort, often close to much heavier-weight agents.

This suggests two things. First, baseline agents with little context and a few tools can be surprisingly capable on structured coding tasks. Second, benchmark success doesn’t fully capture what richer system prompts, bespoke tools, and domain-specific context buy you in real-world scenarios. Benchmarks are essential—but they’re an imperfect proxy for production work where repos are messy, dependencies are unclear, and the “definition of done” is nuanced.

Developers should use benchmarks for regression checks and model comparisons, then validate in-the-wild tasks. A good approach is dual-track evaluation: keep a TerminalBench (or similar) run for quantitative signal and pair it with a realistic scenario suite (e.g., “add feature X to repo Y with tests Z”) that exercises your actual workflow, dependencies, and runtime. Instrument tool calls and outcomes so you can see where the model stalls, loops, or misuses tools.

Inside the Black Box: System Context and Tooling

The episode’s central investigation is: what system context do flagship coding agents include by default, and how does that interact with the context you add? Without claiming insider knowledge, Yaniv observes that agents like Claude Code or Gemini send sizeable system contexts up front: high-level behavior (“you are a helpful coding assistant”), persona and safety instructions, and detailed tool descriptions including schemas and usage guidelines. Those tool descriptions often specify when to use a tool, what the parameters mean, and constraints or safeguards.

Crucially, all of this context—system prompt, tool manifests, your custom instructions, and the user’s latest request—arrives as a single input to the LLM. The model doesn’t “know” which parts came from you versus the platform; it just predicts the next token conditioned on everything. Order, phrasing, and relative length matter. If the built-in system prompt is long and prescriptive, your domain-specific guidance might be diluted unless you keep it concise and structured.

For developers, that means two practical implications: your custom context must coexist with strong built-in priors, and your tool design must be easy for the LLM to reason about amid many available tools. Think clearly named tools, concise one-line descriptions followed by an explicit “use when…” directive, and schemas that reduce ambiguity. The simpler and more discriminative your tool interface, the more consistently the model will call it.

Designing Context That Plays Nicely with Built-ins

Yaniv frames context in two layers: the system context (shipper-provided defaults inside the agent) and the task/domain context you add (repo details, objectives, run commands). Because everything merges at inference time, your goal is not to override the system but to complement it. Keep instructions narrowly scoped to the task: define the goal, constraints, and the environment’s “rules of the road” (e.g., “run tests with make test; code lives in src/; prefer FastAPI; follow PEP8”).

Tool descriptions benefit from being explicit about intent and side effects. For example: “execute: run a shell command in the project root. Use to install dependencies, run tests, or scaffold. Side effects: changes filesystem and environment.” Pair this with read_file and write_file tools that clarify default paths, allowed file sizes, and expected encodings. The model is better at planning when tools declare both capabilities and boundaries.

Also, constrain the step budget and encourage summarization. A short system prompt like “Operate in as few tool calls as possible. Write small, verifiable changes. After each tool result, summarise the new state” can reduce thrashing and make logs easier to inspect. Finally, avoid burying key details in long prose. Use concise bullet points and explicit labels (Goal, Constraints, Commands, Repo Layout) so the model can “pattern match” the structure and retrieve the right facts at the right time.

A Practical Evaluation Loop for Teams

Yaniv’s workflow at Tessl is pragmatic: start with a minimal, inspectable agent, then iteratively add context and observe deltas. Begin with the nano baseline (short system prompt, execute/read/write tools, containerised runtime) and run a small suite of tasks. Add one change at a time—e.g., expand tool descriptions, include repo layout hints, or add a “test-first” directive—and measure completion rates, tool call counts, and time to solution.

Use TerminalBench (or similar) for repeatable checks, but pair it with an internal scenario bank that mirrors your customers’ realities. Log every model message, tool call, and return value so you can replay failure cases. Track where the model hesitates: missing context (e.g., how to run tests), tool confusion (e.g., wrong working directory), or environment issues (e.g., missing dependencies). Each failure class suggests a targeted context or tooling fix.

Finally, sandbox execution. Yaniv runs the agent inside a container so execute is powerful but contained. In production, adopt even tighter controls: non-root users, network restrictions, resource limits, and a allowlist of commands. Consider adding a “dry-run” option or a plan-then-execute pattern for risky operations. Human-in-the-loop checkpoints can also be valuable during initial rollouts, especially when agents touch customer repos.

Key Takeaways

Start small: a 100-line agent with a tiny system prompt and 2–3 tools (execute, read_file, write_file) is enough to ship working results, especially with a modern LLM like Claude Sonnet 4.5.
Benchmarks are helpful but incomplete: Mini agents rank respectably on TerminalBench. Use benchmarks plus a curated set of real tasks from your environment to measure what actually matters.
System context matters—and it mixes with yours: flagship agents include long prompts and detailed tool descriptions. Keep your added context concise, structured, and complementary so it doesn’t get drowned out.
Design tools for discriminability: clear names, short “use when…” guidance, explicit schemas, and side-effect notes help the model choose the right tool at the right time.
Constrain and guide the loop: set a step budget, encourage small changes and summaries, and log everything. This reduces thrash and makes debugging tractable.
Sandbox execution: run agents in containers with limited privileges and explicit allowlists. Add human checkpoints for sensitive operations.
Iterate with intent: add one context or tooling change at a time and measure its impact on success rate, tool calls, and runtime. Treat agent design as an engineering feedback loop, not a one-shot prompt.

As Yaniv underscores, none of this depends on insider knowledge of proprietary agents. It’s about understanding how the model consumes context and designing your prompts, tools, and evaluations so the agent—and your developers—can do their best work.

Resources

Related episodes

Why 95%

of Agents

Fail

Founder, Agentics Foundation

Can Agentic Engineering Really Deliver Enterprise-Grade Code?

23 Sept 2025

with Reuven Cohen

Agents Explained:

Beginner To Pro

Maksim Shaposhnikov

AI Research Engineer, Tessl

AI Agents Beyond Context Limits

28 Oct 2025

with Maksim Shaposhnikov

Smaller Context,

Bigger Impact

Founder & CEO, Tessl

What Holds Devs Back From Multi-Agent Thinking

26 Nov 2025

with Guy Podjarny

Code Generation

AI-Native Development

Agentic Systems

Chapters

Introduction

[00:01:01]

Research on AI Agents and Context

[00:04:06]

Deep Dive into System Prompts and Tools

[00:05:38]

Defensive Tasks and Authorization

[00:21:41]

Tool Usage in Language Models

[00:24:26]

Planning and Agentic Harness

[00:32:43]

Resources

Related episodes

Why 95%

of Agents

Fail

Founder, Agentics Foundation

Can Agentic Engineering Really Deliver Enterprise-Grade Code?

23 Sept 2025

with Reuven Cohen

Agents Explained:

Beginner To Pro

Maksim Shaposhnikov

AI Research Engineer, Tessl

AI Agents Beyond Context Limits

28 Oct 2025

with Maksim Shaposhnikov

Smaller Context,

Bigger Impact

Founder & CEO, Tessl

What Holds Devs Back From Multi-Agent Thinking

26 Nov 2025

with Guy Podjarny