
The End of Fragmented Agent Context
Also available on
Transcript
[00:00:00] Simon Maple: Before we jump into this episode, I wanted to let you know that this podcast is for developers building with AI at the core. So whether that's exploring the latest tools, the workflows, or the best practices, this podcast is for you. A really quick ask: 90% of people who are listening to this haven't yet subscribed.
[00:00:25] Simon Maple: So if this content has helped you build smarter, hit that subscribe button and maybe a like. Alright, back to the episode. Hello and welcome to another episode of the AI Native Dev podcast. My name is Simon Maple. I'm your co-host for the episode. And joining me is Guy Podjarny.
[00:00:41] Guy Podjarny: I'm also co-hosting here, CEO and founder of Tessl.
[00:00:45] Simon Maple: Wonderful. And today's gonna be a nice special episode where we're gonna talk about something which there's a lot of hype around it right now. Skills, agent skills, just general skills that we're gonna be talking about. What is a skill? How should they be used? When should they be used? What is a skill compared to other types of contexts that an agent may or may not use?
[00:01:02] Simon Maple: And also, what are the certain things that we want to, that we need to do to make sure a skill is right? A skill is good, a skill is accurate, and also some sneak peeks into some of the great things that Tessl announced last week on our support for agent skills and skills generally.
[00:01:19] Simon Maple: So there's a lot to pack into this episode, Guy.
[00:01:21] Guy Podjarny: Yeah.
[00:01:22] Simon Maple: Let’s get started.
[00:01:23] Guy Podjarny: Yeah. Let's talk about it, and skills are all the rage.
[00:01:25] Simon Maple: Absolutely. So why don't we just start, why don't we just define: what is a skill, Guy?
[00:01:29] Guy Podjarny: Yeah. I think that's sort of an interesting question coming from you as, you know, I sort of thought that you're sort of a skillful person.
[00:01:34] Guy Podjarny: No?
[00:01:34] Simon Maple: I'm a highly skilled person, but I struggle to define it. Lots of people find it hard to put their finger on it.
[00:01:40] Guy Podjarny: Maybe you learn something. No. So, skills are basically a standard unit of context to provide to agents that teaches them to do something. So to acquire a skill, they have initially been introduced by Claude as part of, you know, Claude skills capability.
[00:02:00] Guy Podjarny: Although they were very, very similar to things like cursor rules or other kind of built-in guidance that existed in other agents before. But still, Claude skills kind of paved the path to definitely this as a concept. And effectively it's just a standard file structure.
[00:02:17] Guy Podjarny: It defines a Skill MD file that has some sort of definitions of, you know, what this does and some metadata, but when should the agent use it, alongside with supporting files designed in a certain way so that the agent can kind of acquire that skill, use it, and do so in an LLM-wise way.
[00:02:39] Guy Podjarny: For instance, a core element is this progressive disclosure of having sort of bits of knowledge that the agent can traverse through, so it's loading the right information at the right time and doesn't overload the context window.
[00:02:47] Simon Maple: Yeah. A couple of interesting things that you said there. First of all, this isn't actually a new thing at all. You mentioned cursor rules. I can't even remember when that came out.
[00:02:53] Simon Maple: That was a while ago. Claude Skills, again, is something that's been around for a little while. Since
[00:03:00] Guy Podjarny: September, I think.
[00:03:00] Simon Maple: Yeah, since September. So it's interesting that it's kind of heating up right now. Another thing that kind of really interested me there is when you talk about the Skills.md, it's just plain text.
[00:03:11] Simon Maple: Again, we're just talking about passing some text to an agent that an LLM reads that text and then does things based on that. So there's a lot, really, in terms of us being skillful as well about how we construct that, how we write that. And I think we'll talk a little bit about evals.
[00:03:28] Simon Maple: So why have skills all of a sudden become an overnight success? And everyone's talking about it and everyone's releasing various skill support. What's happened?
[00:03:36] Guy Podjarny: Well, the key thing that happened is similar maybe to MCP, is that it's become a standard. So Anthropic very late last year has kind of introduced an open standard.
[00:03:44] Guy Podjarny: Standard is a bit of a funny term, they've kind of coined one, they called agent skills and they published the definition of it. And a little bit similar to what we've seen with MCP, the rest of the industry lined up actually fairly quickly. And so now within a short span of time, maybe within a month, you've seen comments from Cursor and Codex and, I think Gemini and others, about supporting skills themselves.
[00:04:08] Guy Podjarny: And what that does is it suddenly gives the users, the creators, anybody that's actually kind of looking to use agents, gives them an ability to create one thing, the agent skill, and roll that out. And so you've seen at the beginning of the year, you've seen Cursor make statements that they will embrace skills and actually phase out rules.
[00:04:33] Guy Podjarny: Again, they're very similar in many of their traits, so they will phase that out. And you've seen kind of this explosion of an ecosystem. The other very smart thing that Anthropic has done is that they created a "create skill" skill. Yes. That just made it very easy for people to start accumulating it.
[00:04:52] Guy Podjarny: And maybe the last important thing is that skills actually were initially introduced in Claude less as a development thing and more as an ability to create some instruction for non-developers when you're using Claude to, whatever, create an Excel sheet, right? Or perform some type of musical review, right?
[00:05:12] Guy Podjarny: And so now, like that approach is true for the skills and agent skills as a whole. And so it also introduced kind of a lot of non-developers to this opportunity of creating kind of reusable artifacts that they can share across the team. That has been very, very exciting.
[00:05:32] Guy Podjarny: Or like reusable workflows. So a little bit of a taste of software development style capabilities of reuse to the non-dev world. And all of those combined to just create this flywheel. And now there's huge excitement over something that we've believed all along here at Tessl, which is context should be reusable.
[00:05:48] Guy Podjarny: It should be something that you don't want to reinvent the wheel. You know, and there's a difference between intelligence and knowledge.
[00:05:54] Simon Maple: Oh, absolutely. Yeah. Good. Nice little takeaway there. And I think let's go into a couple of things there. First of all, I've actually been doing a little bit of work with some of the folks at CodeGuard, and it's an amount of context which is talking about how you can actually build, how you can review code in the most secure way, following very specific security rules.
[00:06:15] Simon Maple: And I think one of the interesting things, or one of the things that they saw as a big drawback, was the fact that if they want to produce this, they know that the people who are consuming it, some are gonna be using Windsurf, some are gonna be using Cursor, some are gonna be using Claude.
[00:06:27] Simon Maple: And so they actually need to structure it in multiple ways so that every different agent can consume it. And I think creating this standard that all agents can kind of build from allows you to effectively be able to deliver one thing, fingers crossed, hopefully, as long as everyone starts supporting it, that people can then just kind of pull in.
[00:06:46] Simon Maple: So I think that's a really important piece. Um, let's talk a little bit about context and context engineering and where skills sit in that, because really skills are, you know, they're part of overall context engineering. What would you say are the differences between, uh, skilled specifically and general context that an agent or an or an LLM can use?
[00:07:06] Guy Podjarny: Definitely there are many ways to basically introduce more words into the context window of the agents. You know, we need to remember that agents eventually still remain just interfaces to the LLMs. So every time there's a request to the LLM, and the question is: what is the data, right?
[00:07:25] Guy Podjarny: What are the tokens that get put into that? And so all of these different means of context engineering, they're just about giving the right sort of path to choosing the right words to include in the message, right? So in that sense, they're all a zero-sum game, right? Of what do you put in, and you can easily put too many things into the context window.
[00:07:44] Guy Podjarny: If you have so much context, so much like setting the groundwork, the actual instruction gets lost; or too few things, so things don't get known. So all of that is the sort of constant exercise. When you're talking about external context that you provide, I'd say there are sort of three primary buckets.
[00:08:01] Guy Podjarny: There are rules. Which you kind of shove into the context window, whether you like it or not, right? By hell or high water, you would put it into that .claudemd or into a must-use rule in Cursor, although Cursor sometimes ignores them. And those are mandated. You know, they are very important things, but they take up context window space, so you can't put full-on documents this way.
[00:08:24] Simon Maple: You've got to be quite careful about the amount of context you add in that. Because if you do that, you can kind of bloat the context window if you just try and put as much context into a mandatory way that an agent has to then read, all of that has limited context.
[00:08:39] Guy Podjarny: Exactly. So it's the most forceful way to guide an agent.
[00:08:42] Guy Podjarny: Yeah. But you have to be limited in content. Oftentimes they are an initial instruction and some link, some pointer to another location, right? Then you've got skills. Skills include like Cursor rules; basically a tiny rule, like a little bit of data that goes into the context window for the agents to choose to invoke a skill.
[00:09:00] Guy Podjarny: Implicitly, you can choose to call a skill actively like a command, in which case you don't use those things there. But you can also expect the agents to invoke the skill at the relevant time. So you have to put a little bit of that breadcrumb in the window. There's some development right now, but maybe separating them out within the agents.
[00:09:18] Guy Podjarny: Does it go and consult some directory? For now, that's the reality. And then you've got docs. Docs are just available information for the agent to find. But they're not naturally findable for the agent. And so you need to either leave a breadcrumb somewhere, like with a rule, or you just need to name them in a way that allows for greps and sort of other agentic search to find them.
[00:09:40] Guy Podjarny: But they're purely loaded on demand and they don't have any price to pay. So you can have as many docs as you want. If you have a thousand skills loaded, you might really cripple your agents today. But if you have a thousand docs, they're just available. It's more about which docs the agent loads at the right time and can it actually find them.
[00:09:57] Guy Podjarny: So these are the three types we have today. I'm sure we'll have more types over time. Again, all coming down to reusable context so that the agent doesn't need to infer everything from intelligence every time.
[00:10:07] Simon Maple: Yeah, and I think the interesting thing there is something that we call activation, I'm sure many others kind of refer to it as well, which leans onto back in the MCP days, like it was 30 years ago, when we were creating MCP servers, whereby you need to be very clear in the name of the tools and the descriptions there so that the agent would use them at the right times.
[00:10:29] Simon Maple: It's very, very similar when we talk about those various skills. You need to be very clear in the skill name and in the skill description so that it doesn't just try to do things by itself just because it can; it needs to use those skills at the appropriate times, right? Rather than just trying to do things its own way.
[00:10:43] Simon Maple: So I think those are very, very important.
[00:10:47] Guy Podjarny: And it's a good time to remember as well that while it's a standard format, they are not standard models. And so the same instruction in the skill text right now would be loaded by different agents, by different models.
[00:10:58] Guy Podjarny: Although we know for a fact, we have repeating data to show that the same words would not trigger Haiku and Sonnet and Opus to the same action. You know, Opus is much more of a smart-ass, right? It can choose to say, "No, I know better," and won't do it. And Haiku might need more detailed instructions.
[00:11:16] Guy Podjarny: And so at the moment, skills don't really solve for that. They are one standard unit of context, but it's not sure that the same words will be the optimal ones for different agents. So I'd say, you know, think of skills as an amazing new capability. It is absolutely worth leaning into.
[00:11:34] Guy Podjarny: We're doing that in Tessl. We think they're amazing. They are maybe the most standard way right now to reuse context. But like MCP, I think they are also a piece of the puzzle. And there will be all sorts of tools or just helpers that we would want to reuse for making agents successful.
[00:11:52] Simon Maple: That's really interesting. So essentially the standard provides a kind of standard bolt fitting in the sense that you can integrate it with the agents, but your mileage may vary depending on which agent you use. Because it will assess the wording and it'll use them differently depending on how they are implemented in the background.
[00:12:10] Simon Maple: So that's a good insight as well. Okay. So skills, we can use them as indie developers, as hobbyists, as open-source developers. We can also use them in an organization to describe how we want to do certain things in our certain methodologies and our certain processes that are key to our organizational requirements.
[00:12:31] Simon Maple: Now, when we think about adding skills as a first-class citizen into our organizational process, our development process, what do we need to consider? When building and owning those skills and actually distributing them and expecting other professional developers in our organization to make the best use of that, what do we need to consider?
[00:12:52] Guy Podjarny: Yeah, so I, I think it's a good question and I think kind of allure of, uh, of skills is that they, they have this immediate impact on it. You create some static markdown file, you do it with a create skill, very, very low lift. Uh, and it helps, like right now, you immediately can anecdotally see that it works.
[00:13:07] Guy Podjarny: But I think just like with software, there are differences between a one-time "it worked" and therefore it's awesome, and something that is a long-lived asset that you now need to live with. These are competencies that you want to reuse across the team. So I think to take skills to a professional level, to a team level, to an organization level, you're better served by thinking of skills not as a markdown file, but as a unit of software.
[00:13:26] Guy Podjarny: This is a competency, a reusable competency that you want the agent to have. So this puts a different lens on what are the tools that you need to be able to own skills, right?
[00:13:45] Guy Podjarny: And to be able to operate them and collaborate on them over time. There are probably many things to handle, but I'd say the three to focus on is: one, you need to be able to test skills. Just like software, if you want it to remain working, or even just assess whether it works today, you have to think about what's correct and then test, or in the world of AI, evaluate, against those.
[00:14:03] Guy Podjarny: Two is really thinking about how do you distribute that software. We can talk about this more, but at the moment there's a sad reality that people are copying skills all over that are designed to be reusable, and yet we kind of copy and duplicate and then copy everywhere.
[00:14:25] Guy Podjarny: We've seen that movie and we know where that ends, so that's not awesome. And then the third is you have to think about how do you own them long term. You know, they will fall out of date just like any docs or anything like that. The models will change and so a new model will come along.
[00:14:38] Guy Podjarny: It will think it's very smart, right? Or you would want to use it with some sort of cheap model or open model. And so we have to think about how do we maintain those skills? How do we keep them up to date? How do we allow a team to collaborate when the person that wrote them leaves the organization?
[00:14:54] Guy Podjarny: And so the whole lifecycle management of it, I think those are the core ones to think of.
[00:14:59] Simon Maple: So these are three key pieces to skills: evals, package management capabilities, and lifecycles. Now, it's really exciting that Tessl, last week, released support for skills and in fact, would you believe it, Guy?
[00:15:14] Simon Maple: There is coverage and functionality for evaluating skills, distributing skills through a package manager style mechanism, and also support for skills in a more complete lifecycle way. Can you believe that? What are the chances of covering all three?
[00:15:29] Guy Podjarny: Thoughts become things.
[00:15:30] Simon Maple: Thoughts become things. So we're really excited about that. And you can go to tessl.io/registry to learn more and see some of those skills in that registry.
[00:15:38] Guy Podjarny: It's a good point also to just give a shout-out to the amazing Tessl team, because I think this has been built superbly and quickly.
[00:15:44] Guy Podjarny: I couldn't be more proud of the team for pulling together, excited by the capability. Possibly more excited even by just the sheer power that was seen in the team as we put it together.
[00:15:57] Simon Maple: Yeah. Amazing. Let's go into these three.
[00:16:01] Simon Maple: Let's break them down a little bit. Let's learn about why we need them and what they are. And then let's talk a little bit about what Tessl offers as part of this release. Start with evals. There are a couple of different types of evals that we can offer here. There's review evals and task evals.
[00:16:18] Simon Maple: Break them down a little bit for us, explain what they are, and then how does Tessl fit in this wonderful world of evals?
[00:16:24] Guy Podjarny: Perfect. So evals are, most broadly, a way to answer the question of: how good is my skill? Right? Is it just like you wrote some words?
[00:16:33] Guy Podjarny: I mean, I think you'd easily understand that if you put in unrelated text or you rambled on for an hour, it won't be good. So it's easy to imagine something terrible. And so there's varieties of it, just being able to have a systematic way of being able to say, you know, how good is my skill?
[00:16:50] Simon Maple: And that's really important. Because when we talk about the anecdotal approach which you mentioned earlier in the episode, that's what we rely on today, which just isn't good enough. Imagine if we were writing code and said, "Anecdotally, this feels like it works. Let's ship it." It's just not good enough.
[00:17:04] Simon Maple: And we wouldn't do it for any other piece of software that we consider a key part of our workflow. So why would we do that with skills?
[00:17:09] Guy Podjarny: Yeah. And we all know people that take forever to get to the point of it, maybe like I am right now. So it's like, "When I was a child, I wanted this..." all of that type of context will not be very useful.
[00:17:18] Guy Podjarny: At the end of it, it's like, "And therefore when you git commit, use uppercase letters." Yeah, that's not very useful. So those are evals as a whole and we might add additional ones. We do have review evals and task evals as you mentioned. Let me actually read out a little bit.
[00:17:32] Guy Podjarny: Review evals focus on just making sure that the skill adheres to the best practices that Anthropic has released. Agent skills are cross-agent, but Anthropic is clearly leading the charge. And they created a bunch of best practices on their sites that are good guidance on how to build it.
[00:17:52] Guy Podjarny: They're saying things like, "Be concise." They're saying concise is key. "Set appropriate degrees of freedom." You know, make sure you have the right structure for the skill. And so the review evaluations start by assessing it against those metrics. That already has given us some pretty interesting insights.
[00:18:12] Guy Podjarny: First of all, we've seen that if you use the Anthropic "create skill" skill, it indeed scores quite highly. Not that surprising. And to be frank, we're using Claude to assess those. So Claude, when it creates using those instructions, probably has a bunch of that input.
[00:18:29] Guy Podjarny: And so we're seeing, "Great job, Anthropic." If you're using a skill to create skills, you're probably gonna score well on all of these paths. The other thing that we've seen is maybe a couple of points against Anthropic, which is they might not be using their own "create skill" to build that.
[00:18:44] Guy Podjarny: And we've actually now run review evals on several hundreds, maybe even past a thousand, different skills. They include a bunch of the Anthropic ones and we've seen some in which they don't score terribly well. I want to pick a little bit on the Anthropic's Canvas Design skill, which as the name implies, kind of helps you design the canvas content.
[00:19:08] Guy Podjarny: And on the content side, we separate between our scoring for the description as a whole and for content. It got fairly low scores: 27%. And I just love the prose that Claude gives its own creators over here. For this skill, it says: "Conciseness: one in four. Extremely verbose with extensive repetition of concepts like craftsmanship and masterpiece throughout. Contains philosophical padding and redundant explanations that Claude doesn't need."
[00:19:35] Guy Podjarny: Very straightforward there, like explaining what the design philosophy is multiple times, repeating the same principles in different sections. So that's a good example of it. And later on it goes to the progressive disclosure: "one in four." And it goes on to say, "Monolithic wall of text with no references to external files for detailed guidance," and so on and so forth.
[00:19:54] Simon Maple: I'm glad it's Claude slamming its creators and not us or anyone else.
[00:19:58] Guy Podjarny: It's all in the family, and I'm sure this is an omission, but I think what this goes to show is not at all a lack of skill from the creators of Canvas Design. It's just the need for tooling, right?
[00:20:09] Guy Podjarny: It needs to be automated. And it's possible this was initially created before "create skill." And then over time, it didn't get maintained. So anyways, the review evaluations are about that. They're about keeping you well-structured and saying, "How are you applying the best practices?"
[00:20:27] Simon Maple: Mm-hmm.
[00:20:28] Guy Podjarny: So that's review. The task evals go a step further and they actually create scenarios that you would run the agent against and see how well it did. This is actually another one of the best practices that Anthropic recommends. It says create the evaluation scenarios ahead of actually creating the skill; make sure you know what good looks like and capture those, but few people actually do it.
[00:20:53] Guy Podjarny: It requires more effort. And so we have done this for tiles already, we talked about that in some previous episodes. For our documentation tiles, we create a bunch of context, but we create some coding scenarios and we run them with and without that documentation to make sure that documentation is good.
[00:21:08] Guy Podjarny: We do the same for skills. I will say task evals for skills are more of a work in progress. There are interesting questions about how do you best extract the evaluation scenarios out of the skill without leaking too much of the guidance itself into the task.
[00:21:24] Simon Maple: Yes.
[00:21:24] Guy Podjarny: And so there's some work there. So I'll share a couple of anecdotes of examples here, but take them with a grain of salt. We've seen maybe two interesting examples of those. One is we created task evals, 10 different coding scenarios for two fairly similar skills. One is called agent-browser from Vercel, and we've seen that the skill was dramatically impactful.
[00:21:46] Guy Podjarny: It took the success rate of the agent as it tries to implement that from about 28% to 71%. So a very, very significant boost. The skill is very valuable for agent-browser. And we saw another skill called browser-use, which conceptually is similar about the use of browsers within agents.
[00:22:07] Guy Podjarny: And with that one, we saw that the baseline average was 85% and in fact, using the skill was minimally negative. It actually dropped that by about three percentage points. And again, a lot of grains of salt, maybe we don't have the right tasks for it, but it's a good demonstration that maybe the skill is not that helpful.
[00:22:23] Guy Podjarny: And we've seen a few of those examples. We've seen cases where, for instance, some of the Anthropic skills were not bad, but were just not very helpful because they were just describing things the model already knows how to do. Which once again comes back to: these right now are evaluations with Sonnet.
[00:22:38] Guy Podjarny: Maybe Haiku needs it, maybe Opus doesn't need it. And you want to build those out. We'll have one more interesting anecdote, which is someone codified a summary of the best practices that he recommends for skills. Someone put it together into a skill and published it.
[00:22:57] Guy Podjarny: And we ran through it and we actually found that it had, again, a minimally negative effect on agents. We created 10 scenarios, so it did very well in some scenarios and worse in others. One example where it did the worst, for instance, is that it just over-complicated or over-engineered a simple CSV-related task.
[00:23:14] Guy Podjarny: So all of those are, I think, great indications to say you need to know these answers. You need to know whether your skill works or doesn't. And you need to define what are the scenarios that simulate it. And then you need to be able to test that over time as your skill evolves and as the models evolve.
[00:23:35] Simon Maple: There are so many juicy takeaways in this. If you are a consumer of a skill, it's really important to know this data so that you can understand the impact that the skills that you are pulling in are gonna make to your agent experience.
[00:23:46] Guy Podjarny: Mm-hmm.
[00:23:47] Simon Maple: The second piece is, as a producer of that skill, as the owner or creator of that skill, it's no use just creating a skill and assuming, again through anecdotal evidence.
[00:23:55] Simon Maple: Saying, "Yeah, this feels like it's working better." Particularly with the scenarios, it's important to say, "Hey, do you know what? This is going to make my chances of doing this thing the way I want it to be done three times, four times more likely to happen."
[00:24:09] Simon Maple: Three times, four times more likely to happen. But I think the key thing here is when you mention the scenarios, so if you have an eval where it runs across 10 scenarios, you can see what it's good at. What it's not good at. And that provides you with the insight to say these pieces, these, these scenarios are working really well.
[00:24:24] Simon Maple: But I think the key thing here is when you mention the scenarios, if you have an eval where it runs across 10 scenarios, you can see what it's good at and what it's not good at. And that provides you with the insight to say these scenarios are working really well, I'm gonna leave that description as is.
[00:24:38] Simon Maple: These are the pieces I need to change. So from a producer point of view, they can actually create the best skill. For the consumer of the skill, they can then say, "Right, all these skills are available. I'm gonna pick these ones because the data is showing me that these can actually make it more impactful." Of course, then everyone's environment is different, there is gonna be somewhat an anecdotal, "Yes, this is actually working well" or it's not. But I think this gives us the start we need.
[00:24:52] Simon Maple: The second piece that you mentioned that I think is so important is as models change and that variation across models or agents, there is gonna be variety across the environments. But even if something works for a model in my agent today, as that model grows and as the training data changes, my context needs to change as well.
[00:25:10] Simon Maple: My skills need to be updated potentially. If it relies upon things that are in old training data or new training data, they will act differently. And we need to use maybe even different levels of the same skill based on the agent I'm using or the model I'm using.
[00:25:30] Simon Maple: So these are important considerations that we as users and creators of skills need to constantly be thinking about.
[00:25:37] Guy Podjarny: Absolutely. And I would say that there's a similar analogy to tests in real-world scenarios, which is "works on my machine" versus something else is very much a real risk with skills as well.
[00:25:47] Guy Podjarny: As well as the fact that developers don't like writing tests, some love writing tests, but many do not. And so people avoid this hard question of what is correct. At Tessl, we'll help you generate things.
[00:26:08] Guy Podjarny: We'll help you build those out, but over time, the definition of what is correct and what this skill should do is actually more important than the words that you use because the words themselves will change from environment to environment, from model to model. Maybe over time something that wasn't in the training data now is, and so you can shrink it.
[00:26:27] Guy Podjarny: You can actually say less. Maybe at some point you don't need the skill at all. But all of those will be dynamic. What actually would be less dynamic is what do you want the results to be. That one you need to evolve when your actual requirements change or systems change and you need to call an API differently, or your business goals have changed.
[00:26:46] Guy Podjarny: But I think not only are these evaluations critical for you to be able to make the most out of agents today and be able to not regress over time, the definition of what good looks like is actually more important than the context itself.
[00:27:04] Simon Maple: Yeah. I think the other thing that intrigued me with what you were saying is how essentially we're trying to create a set of questions from an answer that we already have. We have a skill that defines the way we want to do things. We're trying to extract from that: "Oh, okay, you're trying to do this, you've got this end goal."
[00:27:23] Guy Podjarny: Yeah.
[00:27:24] Simon Maple: What I want to do is create a set of scenarios whereby we're not trying to feed or guide an agent into hitting that end goal as we would want it, but more what is the general way that people would try and get to such a goal, such an endpoint? And then will it do it in the way that the skill is trying to? So it's very hard to pull the question out of the answer almost.
[00:27:45] Guy Podjarny: Yeah.
[00:27:45] Simon Maple: And evals are hard. A good eval, just like a good test, is hard to do well. And I think it's important that we rely upon that good framework. And that's what, of course, the AI engineering team at Tessl and others are working on.
[00:27:58] Guy Podjarny: And I think it's an important note that right now, again, this is all a journey and at the moment we are talking about creating evaluation scenarios for public skills which are not ours. Yes. And with limited information, because we just have the skill. When we generate, for instance, documentation, we have so much more info.
[00:28:15] Guy Podjarny: We have the code itself, we have potentially other callers. So there's a lot more information to extract correctness. But the true, kind of correct evaluation scenarios come, and that's what we do in Tessl when we sort of engage more with a full-on customer, it comes when we actually get to kind of inspect reality.
[00:28:33] Guy Podjarny: Upfront, you can extract a lot of that knowledge out of Git history and what have been the actions, and over time you can extract a lot of learning from agent logs and how we build. And so all of those are correct. When you're an organization and you're creating skills for yourself, you're trying to optimize and tune them on an ongoing basis.
[00:28:49] Guy Podjarny: And so you get that sort of feedback data from actual agent users on it. It's a little bit harder to do it for kind of third-party skills that we consume. And we do hope, just like with documentation, we today are creating evaluations for someone else's skills.
[00:29:07] Guy Podjarny: But we encourage and would love the owners of those skills to come to us, and we will pass those evaluations and the ownership of the eval scenarios to those people so they can tune them and correct them. And then we'll just run the evals and publish them.
[00:29:23] Simon Maple: I love it. And just clicking around actually on the Tessl.io registry, where you can see tiles and the skills, there are a bunch of featured tiles, but also then you see the top-performing skills and tiles. And what's great is you can see, you know, I can click through and I can see the skills and I can see the eval results on various skills as well with different scenarios that I can click, I can open up the scenarios, I can see what the scenario is doing.
[00:29:45] Simon Maple: And I can see the kind of with and without the skill, with and without Tessl. And you can see what is good at and what is not.
[00:29:52] Guy Podjarny: And if you dig in a little bit, you'll also see some of the prose. You see some of the text that Claude writes as it assesses it; yes, it's entertaining and that's reason enough to keep it on the website.
[00:30:04] Guy Podjarny: But it actually has some very concrete advice to say, "Hey, this is the problem." You know, like that monolithic wall of text comment was entertaining, but it's also very clear guidance to the owner of that skill to say, "Hey, you should break this up."
[00:30:17] Simon Maple: We should totally have modes of how we want Claude to give us that information.
[00:30:21] Simon Maple: So if it's saying, "Okay, this is the review eval," you can get it to be "Be nice to me, I'm fragile." You can get it to be a little bit of a-
[00:30:30] Guy Podjarny: Yeah, I'm in a tough spot right there. Well, I think the other way is the glimmer of hope. So there will shortly, we do this with customers today, there will be an "optimize" button next to it.
[00:30:42] Guy Podjarny: Hold out hope. You know, we have this "Do you want us to fix this for you?"
[00:30:44] Simon Maple: Nice.
[00:30:45] Guy Podjarny: Click this button and we'll create a new version of it. So that's just-
[00:30:47] Simon Maple: What LLMs and things like that are good at as well, right?
[00:30:49] Guy Podjarny: Yeah. It doesn't mean, and that comes back to, you need to define the evaluation that is correct, right?
[00:30:55] Guy Podjarny: Because if we generate documentation with an LLM and we generate evaluation with an LLM and the two are misaligned, you have to choose: who do you believe?
[00:31:04] Simon Maple: Yeah.
[00:31:04] Guy Podjarny: And what we want to believe is the evaluation. But we need the owner to bless that. Once they do, it's actually not that complicated to create agentic pipelines to optimize the context to achieve the goal.
[00:31:17] Simon Maple: We covered evals quite a lot there. I think evals is actually a really important piece, though, so I'm really pleased we went into depth on that. The two other pieces that you mentioned: package management, or package manager capabilities, and also the lifecycle of skills.
[00:31:31] Simon Maple: Let's talk about those, starting off with package manager.
[00:31:34] Guy Podjarny: Yeah. So I think there's more of a technical element to it, but it really comes out of thinking about skills as software, not as documentation. Today, when you want to install a skill, first your option was to just sort of go off and copy it from wherever it is that you want, or you created a skill and put it somewhere so your team can do it.
[00:31:54] Guy Podjarny: And remember that skills are designed to be a reusable entity. So really the core element to them is spreading them to your team. And that's inconvenient. We've seen a really handy tool from Vercel called skills.sh. And so today you see, within a week it took it up and people do "npx skills i" and provide a GitHub repo.
[00:32:13] Guy Podjarny: And what it does is it copies something from a GitHub repo to their .skills folder. And now they committed it to their repo and they duplicate it. And what happens if the GitHub repo got an update? Well, they're none the wiser. They don't get that. We've even seen worse than that.
[00:32:32] Guy Podjarny: We've seen the fact that because people install skills, but then maybe other people in the team are using different agents, they might even duplicate that skill into both the .claude/skills and the .cursor/skills. We're seeing an entertaining effect of all these different agents going and reading skills out of .claude/skills.
[00:32:50] Simon Maple: Yep.
[00:32:50] Guy Podjarny: It's really a mess and that is just not a good recipe for a happy future. We need something better, and this is a problem that we've already solved in software. We know how to handle reuse in software; we call it a package manager, right? Or a dependency system.
[00:33:06] Guy Podjarny: And it requires a few things. It requires, first of all, some form of manifest file that knows what you've downloaded. Actually, even before that, you need whatever piece of content that you have to have a version and an identifier to say, "Well, this is the thing that I've downloaded," and maybe SemVer indication of the delta and ordering of those versions.
[00:33:27] Guy Podjarny: Once you have a manifest file with those skills that you have, you should be able to run an update, run an install, maybe add some other criteria, separate ones for different environments like dev and production. So all of those things are solved problems. We have a lot of practices that we can bring to bear to make them work well.
[00:33:46] Guy Podjarny: We also know that pretty much every programming language has its own dependency system because each of them has some subtle, important differences in how you want to handle the dependency relationships and installations and environments. And so what we need is a package manager for skills.
[00:34:08] Guy Podjarny: And I will say skills and context, because as we mentioned, skills is not the only type of context. And that's basically what we're building in Tessl. You know, we've had the Tessl registry for a long time. It's always been this sort of concept of "this is a context that is versioned and is consumable" in a way that is like a professional software development team.
[00:34:29] Guy Podjarny: And now today we extended that to skills, which is really, really exciting. So it allows you to “tessl skill install” whatever skill that you want. We're not gonna get in your way if you just want to download something off a repo. But without any further effort, we will remember that you downloaded that in a manifest file and will allow you to update it.
[00:34:51] Guy Podjarny: Over time, we will install it to whatever agent that you want, and by default, we will still commit it to your repo. So we don't get in the way. Maybe other team members have not installed Tessl yet. But we think the right destination for you is to change that mode and stop vendoring those dependencies..
[00:35:10] Guy Podjarny: Stop committing them and duplicating them and having just the manifest file. And just like you NPM install or whatever your equivalent is in your repo, you also Tessl install, or "tessl skill install" if it's just the skills that you want. And it will put them in the right agent for the right user for what they're using.
[00:35:30] Guy Podjarny: And so we generally believe, and again, it doesn't feel like a complicated or harsh statement, is that skills just need a better distribution system.
[00:35:40] Simon Maple: Yeah, and I love the two differences there. Obviously, what we talked about with some solutions is more of a directory, I guess.
[00:35:48] Simon Maple: It doesn't necessarily host various skills; they're more points to other skills. There are benefits and drawbacks to that approach. I think when you have something more like a traditional package manager where it hosts that content, it can look after a bunch of things for that content, including the versioning and so forth.
[00:36:10] Simon Maple: And including in the manifest that says, "These are the things that I want to use." So I think that ability to say, "Okay, if I want to push something to Tessl, publish a skill to Tessl," I can have all of that cool stuff, including things like evals and versioning and things like that.
[00:36:27] Simon Maple: However, if I just want to use it lightly, I could actually say, "Actually, I want to point to this GitHub repo." I don't want to necessarily publish anything or own anything within the Tessl registry. And so that will still happen. We'll still add that to the manifest, but of course there will be a few little drawbacks.
[00:36:44] Simon Maple: We wouldn't have the evals; we wouldn't have things like the ability to pick a version because it's just there on GitHub, it's not necessarily a version in a registry. So we can't identify, so we more point to the latest. But I think these are really, really important things.
[00:36:58] Simon Maple: When we think about skills as a first-class citizen in a developer workflow, these are the capabilities that we need to do professional development, to do auditing as well. "What skill did I use to create these types of things?" Well, you actually have a look at your manifest and you can say, "Well, this was the skill I used, these were the processes.”
[00:37:16] Guy Podjarny: And I think, I don't want to diss the existing capabilities. I mean, JavaScript thrived before NPM, Java thrived before Maven. They're not necessary to be able to provide value, but just at the pace that everything is happening right now, I think we need to move quickly with this as well.
[00:37:33] Simon Maple: Yeah.
[00:37:33] Guy Podjarny: And over time also, it's fair to expect Tessl to introduce other things that you have in package managers. Like, "Can we inspect skills to see if they're malicious?" We already run these evaluations to give the consumer some indication to say, "Well, I'm seeing 17 different skills that are trying to do the same."
[00:37:53] Guy Podjarny: Give me some indicators to tell me which one might be better for me, or what's worse. And so all of those traits will come along. Learning from the past versus trying to ignore it.
[00:38:04] Simon Maple: Absolutely. It's a similar pattern to other things that we've used.
[00:38:08] Simon Maple: So why wouldn't we do things in that? It's hopefully a healthy ecosystem.
[00:38:11] Simon Maple: Move on to lifecycle, because I think we've touched on things like versions and stuff like that.
[00:38:25] Simon Maple: So we have about 14 minutes left in terms of
[00:38:28] Guy Podjarny: I think it's okay. The lifecycle can be short. I think we've covered a lot of the things.
[00:38:31] Guy Podjarny: It's actually mostly it.
Chapters
In this episode
One skill took coding success from 28% to 71%. Another made things worse.
Guy Podjarny and Simon Maple tested 1000+ agent skills and reveal which ones actually work, which hurt performance, and why anecdotal evidence isn't enough anymore.
Tessl Skills Registry is the first package manager for agent skills with built-in evaluations, versioning, and lifecycle management. Explore tested skills and see real performance data:[ https://tessl.io/registry](https://tessl.io/registry)
On the docket:
- Claude roasted Anthropic's own skill with a 27% score ("monolithic wall of text")
- Why some popular skills actually decrease agent performance
- How Tessl is bringing package managers and evals to agent skills
Whether you're creating your first skill or managing them across your dev team, this is your roadmap for making agent skills actually work.
In this episode of the AI Native Dev podcast, co-hosts Simon Maple and Guy Podjarny (CEO of Tessl) dive deep into the world of Agent Skills. As AI agents move from experimental toys to core development teammates, the industry is racing to standardize how we "teach" them new tricks. The duo explores the shift from "it worked once" anecdotes to professional-grade Context Engineering, the emergence of the Anthropic skill standard, and why your skills need a package manager just as much as your JavaScript does. Along the way, they share eye-opening eval data on Anthropic’s own skills and introduce Tessl’s new registry built to professionalize the AI-native lifecycle.
Beyond the Markdown: Defining the "Skill" Standard
For a long time, guiding an agent meant hacking together "system prompts" or messy .cursorrules. Podjarny defines a Skill as a standard unit of context that teaches an agent a specific competency. While the concept isn't brand new, it has reached a "Goldilocks" moment because of the Anthropic Agent Skills standard.
Similar to how the Model Context Protocol (MCP) standardized tool-calling, this format provides a structured Skills.md file and metadata that allows different agents—from Cursor and Windsurf to Gemini and Claude—to consume the same instruction set.
The Insight: Intelligence (the LLM's reasoning) is different from Knowledge (the context you provide). Skills bridge that gap, using "progressive disclosure" to feed the agent only the information it needs, when it needs it, without bloating the context window.
The Three Buckets of Context Engineering
Not all context is created equal. Podjarny breaks down the "Context Engineering" hierarchy into three distinct categories:

The challenge is Activation. Just like an MCP tool, a skill is only as good as its name and description. If the agent doesn't know when to pull the lever, the skill is useless.
Evals: Why "It Feels Better" Isn't Good Enough
One of the most provocative segments of the episode covers Evaluations (Evals). Most developers currently treat skills as "vibes-based" software. Simon and Guy argue that if skills are part of your production workflow, they must be tested like software.
Tessl’s new research into review evals revealed some surprising results:
- The "Prose" Problem: Even Anthropic's own "Canvas Design" skill was flagged by Claude for being "extremely verbose" and containing "philosophical padding."
- The Impact Gap: In task-based evals, one Vercel skill boosted agent success from 28% to 71%, while another similar skill actually had a negative impact.
The takeaway? You need systematic evals to know if your context is actually helping or just adding noise that confuses the model.
The "NPM for Skills": Solving the Copy-Paste Crisis
We’ve seen this movie before. In the early days of Java and JavaScript, developers copied libraries manually. Today, "Agent Skills" are in that same messy "copy-paste" phase. People are duplicating .skills folders across repos, leading to version drift and unmaintained instructions.
Podjarny argues that we need a Package Manager for Context:
- Version Control: Using SemVer for skills so teams know what has changed.
- Manifest Files: A single source of truth (like package.json) to track what skills a repo relies on.
- Cross-Agent Distribution: A tool that can install a skill into .claude, .cursor, and .windsurf simultaneously.
Tessl has launched a Skill Registry to act as this central hub, allowing developers to tessl skill install proven competencies rather than reinventing the wheel.
Key Takeaways
- Skills are Software: Stop treating them as static docs. They need versioning, lifecycles, and ownership.
- Standardize or Die: The Anthropic standard is winning; lean into it to ensure your instructions work across Cursor, Windsurf, and Claude.
- Ruthless Conciseness: LLMs hate "monolithic walls of text." Use progressive disclosure to keep the context window lean.
- Measure the Delta: Use Task Evals to compare agent performance "with vs. without" a skill. If it doesn’t move the needle, delete it.
- Stop Vendoring: Don't manually copy-paste skills into your repos. Use a registry and manifest file to manage dependencies.