
Cisco Principal Engineer's Fix for AI Code Security
Also available on
Transcript
[00:00:00] Simon Maple: Hello and welcome to the AI Native Dev. My name is Simon Maple, and I'm at Cisco Live today in Amsterdam. Joining me is John Groetzinger. How are you doing, John?
[00:00:09] John Groetzinger: Great, Simon. How are you?
[00:00:10] Simon Maple: I'm doing very well, thank you. Today, in this episode, we're going to be asking the big question: Can we guide coding agents to write secure code out of the box?
[00:00:19] Simon Maple: And can we encourage and teach coding agents to be able to spot vulnerabilities from existing code bases or code changes? We're at Cisco Live today in Amsterdam, and we just did a session earlier on your
[00:00:31] John Groetzinger: first Cisco Live. You're the speaker.
[00:00:33] Simon Maple: How does it feel? It's really empowering. Finally, finally, a Cisco Live session.
[00:00:37] Simon Maple: Tell us a little bit about what the session was today.
[00:00:40] John Groetzinger: Yeah, I mean, the session was kind of my history as a developer trying to wrangle this world of AI coding in the enterprise space. Particularly because it's challenging to write secure code. So it's just kind of walking people through my history and the last two years to how these tools and models have developed.
[00:00:57] Simon Maple: Awesome. Awesome. And tell us a little bit about who John is. You're a principal engineer at Cisco. You're part of the CX engineering team. Tell us a little bit about what you do day to day.
[00:01:06] John Groetzinger: Yeah, my day-to-day is a lot of AI focus, of course, but it's a newer team, like two years old, trying to figure out how we bring more value to our customers with AI.
[00:01:16] John Groetzinger: And it's building agents for our customers, building platforms for our customers. Just really trying to enable the Cisco technology because there's a lot of it. Technology's complicated. Can AI help us and help them? That's kind of what we're trying to do.
[00:01:29] Simon Maple: So we mentioned CodeGuard at the session today, and I think there are a couple more sessions tomorrow and the day after with Omar Santos.
[00:01:39] John Groetzinger: Yeah.
[00:01:40] Simon Maple: And of course Omar is obviously one of the proud owners of CodeGuard. Tell us a little bit about what CodeGuard is really there for and the journey it's been on.
[00:01:49] John Groetzinger: Yeah, so I mean, CodeGuard is really, you can't think of it as security skills for your AI coding agent, right? So we really want our developers in Cisco across all of our orgs to use AI coding.
[00:02:01] John Groetzinger: Because it accelerates software development, right? We want to make more software. But doing that securely and to our standards is very important, right? And so our security and trust org is trying to wrangle all that and figure out how we enable our developers without shooting ourselves in the foot.
[00:02:14] John Groetzinger: And so CodeGuard was kind of the first attempt at that internally. It's just kind of a list of skills. What is secure development? What are the kinds of things we need to look for while we're writing code? And then the real problem was how do we ship that to all our developers, right?
[00:02:27] Simon Maple: Mm-hmm.
[00:02:28] John Groetzinger: We use all kinds of tools in Cisco. It's a company of that position, so we don't all use the same tool. Love to, but it's just not the reality in mind, right? So how do we ship that into Windsurf, into Cursor, or into Cloud Code with low friction right now? We don't need to add more friction for the engineers, so how do we just make that easy?
[00:02:45] John Groetzinger: That's kind of the idea behind CodeGuard. That's what we're trying to above.
[00:02:48] Simon Maple: Awesome. Awesome. And I think it's such an important problem because people who use agents are stung by a number of things. And one of them, of course, is secure code.
[00:03:00] Simon Maple: And the biggest problem, of course, is agents have learned LLMs. They've learned all of our bad habits. That's true. Worst practices, and they average them out, and we tend to get that back. So it's really, really important to provide that guidance, sharing that knowledge around that. Let's talk a little bit about why we mentioned so many developers at Cisco.
[00:03:19] Simon Maple: Obviously a huge organisation, a huge enterprise. How is a company like Cisco embracing AI today across its development organisation?
[00:03:28] John Groetzinger: Yeah. I mean, I firmly can't speak for our whole company. Very massive, as you mentioned. But from what I'm seeing, I do do internal training, so I get some visibility across the organisation of what people are doing.
[00:03:38] John Groetzinger: But there are pockets that I don't. To summarise, I mean our executive leadership really enables this; they want us to embrace this and learn it, and we know it's a new technology, so it's not all figured out, right? There's no roadmap to follow for this. So they're creative, or they let us get creative and try new things, but obviously we have to do that securely and without blowing the budgets, right?
[00:03:58] John Groetzinger: So, I mean, yeah, it's a fine line to balance, but I feel very empowered as a developer at Cisco to try these new tools and figure out productive patterns. Share that back with my organisation or cross-organisation stuff. Just kind of the culture we have here.
[00:04:14] Simon Maple: Yeah. And I guess in the session that we had earlier, we kind of talked about, I think you talked about the Stone Age and going forward and the dawn of civilisation where we are today, and it's really like this progression of learning. What would you say are the things which almost unlocked in your development environment a new way of working?
[00:04:33] Simon Maple: Were there tools? Was it AXs? Was it methodologies? What would you say are the things?
[00:04:39] John Groetzinger: : Oh yeah, definitely a combination, but I think the tooling was key because early on when it was just a chat window, right? There were only a few models, and people weren't really using them for code quite yet.
[00:04:51] John Groetzinger: Like I was very much heavily into vibe coding back then and copying and pasting between, but it was so clunky, and I was like, "There's got to be a better way to share this." Like, I want this engineer to try this. That stuff took a long time to figure out, and I think really with the ENT IDEs is really when I started to see that, oh, this is what I want.
[00:05:08] John Groetzinger: Like this is magic. I make that plan, that specification. I'm always about that now.
[00:05:13] Simon Maple: Cursors, Windsurf, that type.
[00:05:15] Simon Maple: Yeah.
[00:05:15] John Groetzinger: Yeah, of course. Just specification driven. Even your guys' tile is now; it doesn't matter what IDE I use, that tile helps a lot. Yeah. But yeah, that's just the tooling. It was, I think, probably the biggest break to the models, obviously.
[00:05:27] John Groetzinger: But that's kind of a given that we knew the models would get better at this stuff, but I think the tooling is much better. Like if we take the tooling we have today and we used it two years ago, it would probably still be really good, right? Even with the older models, you know?
[00:05:41] Simon Maple: Yeah.
[00:05:41] John Groetzinger: So for me that was the difficult part to navigate, like, how do I do this effectively? Felt like I was getting burned a lot at the fire. Yeah. I was like, "Ooh." Shiny new tool, right?
[00:05:50] Simon Maple: Yeah, it's an interesting discussion actually. Because of course there are three things. There are three pieces here.
[00:05:54] Simon Maple: You've got the tooling, you've got the agents, and then you've got the LLM under the covers. And of course, yes, there is a level of LLMs within the agents as well. But it's interesting how sometimes we just think of it as one experience, which I suppose it is, that one workflow. But really there are a number of aspects here that all contribute to a successful developing workflow, and I think people probably think too much about the model sometimes.
[00:06:15] Simon Maple: And I think you're right. If you know, with a good user experience for a developer with the IDE or with the terminal even, and a good agent that actually walks you through that and does the planning and all these different types of things, those kinds of things, I don't know about you, but for me, are almost more important than the model.
[00:06:37] Simon Maple: In some cases, it allows you to perform that workflow otherwise. It's more of a bot if there's no workflow.
[00:06:44] John Groetzinger: Yeah, it felt that way to me as well. Yeah. Like I'd always argue with people about models, and I'd be like, "This model's better." And I would argue with people about that, but why?
[00:06:52] Simon Maple: Oh, the show me why.
[00:06:52] John Groetzinger: Show me why.
[00:06:53] Simon Maple: The Reddit flames, the Flame Wars. Reddit, this model's better. No, it's not better for me and blah, blah, blah.
[00:06:58] John Groetzinger: Even the "trust me, bro" benchmark.
[00:07:00] Simon Maple: Yeah.
[00:07:00] John Groetzinger: No, I don't trust them.
[00:07:02] Simon Maple: I like that. Trust me, bro. That's the only type of benchmark I'm listening to now. That's sort of "trust me, bro."
[00:07:06] Simon Maple: Okay, so let's go into a little bit more about CodeGuard now, and we'll talk about, I guess, how that evolved when you started. It was; I guess you had a ton of best practices and a ton of instructions or guidance. How did you go about turning that into what it is today?
[00:07:22] John Groetzinger: Yeah, so I mean the security and trust org, that's the CodeGuard kind of evolved internally, and I think they took a lot of like OWASP best practices and industry-standard stuff that we already apply in Cisco internally.
[00:07:33] John Groetzinger: But just kind of reducing the context. Because if you've ever tried to go look at OWASP schools, it's very encompassing, right? And it's not something you can even feed to an agent that would make sense of that, to make sense of your code. So it's just simplifying it, right? Even for humans, for agents, we need the same simplification and then packaging it, right?
[00:07:50] John Groetzinger: So they wanted it to work everywhere. It is just text, right? I mean, at the end of the day, this is the context we are shipping. Internally, we work on different methods to do that, and I think just a repository of code is a simple way that everyone can understand.
[00:08:05] John Groetzinger: But then there are philosophical debates on "Do I commit this to my repo?" But then what if it changes, right?
[00:08:09] Simon Maple: Hmm.
[00:08:09] John Groetzinger: Wrangling that world is still not solved.
[00:08:13] Simon Maple: It is interesting what you say there about the fact that these are basically OWASP rules, right?
[00:08:20] Simon Maple: And nothing has changed in terms of the security space of what we are actually looking for in code. Is there anything agent-specific in there, or is it entirely just about the production? Could you run this against human-written code as well?
[00:08:31] John Groetzinger: Yeah, of course. I mean, it is mainly for just traditional code, right?
[00:08:34] John Groetzinger: AI agents are creating traditional code, but they do also, I think it was two weeks ago or so, add MCP security, right? So it is obviously top of mind as well. We want to give our agents all these tools; are they secure? They are exfiltrating data, right?
[00:08:48] Simon Maple: Yeah.
[00:08:49] John Groetzinger: That kind of stuff can be really hard to observe when you have an agent that is just going crazy, right?
[00:08:52] John Groetzinger: Yeah. So there is also MCP security built in as well, and I believe skill security is coming as well. So it has evolved beyond just traditional software at this point.
[00:09:03] Simon Maple: And for the distribution of this, now, you obviously want all of your users to consume the rules and the guidance, and the way that you did that was you built it into skills, which you made custom for each of the different agents.
[00:09:16] Simon Maple: So there are dot cursor rules, there are Claude skills, and things like that. What were the kinds of problems, I guess, that were caused when you were trying to satisfy so many different agents and their different structures?
[00:09:30] John Groetzinger: Yeah. I mean, I can really speak as a user of CodeGuard, and I actually use all those tools.
[00:09:34] John Groetzinger: So, speak from the experience we are trying to instill in each one. I mean, it is really just kind of a package. You would unzip it, and if it is for windsurfing, it unzips in that location, then that is, I mean, that is fine, but again, maintaining that is very difficult. Yeah. Because some, like, I wish we would just standardise on this stuff in terms of, like, agents.md, right?
[00:09:52] John Groetzinger: Yeah.
[00:09:52] Simon Maple: But
[00:09:52] John Groetzinger: Even some IDEs that claim they honor it, they do not, and that is really the problem. And maybe like Windsurf would start to follow that standard, and then they would not do their old rules. So now it is maintaining all that context. Every developer having to do that just does not make sense.
[00:10:07] John Groetzinger: Like it's just too much friction.
[00:10:08] Simon Maple: Yeah.
[00:10:08] John Groetzinger: People just do not do it. Right. Yeah. And it just makes me not want to use CodeGuard because it is too hard to keep up to date, right? And that is not, you know, that is bad. Yeah. I want it to be easy. It should just always be up-to-date and just easy for my agent to pull in.
[00:10:21] John Groetzinger: Mm-hmm. You know, that's what I want.
[00:10:22] Simon Maple: So let's talk about learnings then. So obviously you created this. What were the things that you took away straight away from using it and playing with the structure of the CodeGuard skill?
[00:10:33] John Groetzinger: Yeah. I mean, it was so specific to the tool that you are putting in, of course.
[00:10:37] John Groetzinger: But, you know, so I even, I had to go; I learned stuff on how they deployed it.
[00:10:41] Simon Maple: Yeah.
[00:10:41] John Groetzinger: On like, oh, I was doing this wrong when I was using this IDE. Right? So, like, that stuff was great too, like, it is a learning experience, but at the end of the day, it is really just a bunch of, like, markdown files that are intuitively titled for the security practice that you are trying to do.
[00:10:54] John Groetzinger: And not every security practice applies to every piece of code or repository. So being able to kind of pick and choose, oh, I am trying to do this. Yes, session management is a big problem. I need this pill, or SQL injection, right? Just take those two skill files. I can read it as a human to understand what it is doing, but I feed it to the agent and have it review my code, and then they can have a conversation about it with my agent.
[00:11:18] John Groetzinger: Because I do not always agree with my agent, of course.
[00:11:20] Simon Maple: Mm-hmm.
[00:11:20] John Groetzinger: And sometimes it makes assumptions that are wrong, and I have to steer a direction. But with security, it is; I want that to be easy. I do not; no one knows everything about the different ways you might exfiltrate codes or, you know, exploit code.
[00:11:35] Simon Maple: And what about, like, the repositories of people who were trying to pull CodeGuard into their agents? How did it affect other people's repositories?
[00:11:45] John Groetzinger: Yeah, I mean, there are a lot of opinions there, and that is kind of the problem. There are, you know, some people who say this should be committed right to your repository.
[00:11:51] John Groetzinger: So it is always there. But if you have a team of developers that use all different tools, you are just bloating your repository with all these dot files for each IDE. Even I myself wrote a script. No, Claude wrote a script that is like an agent's linker. So I would write just one agents.md, and then I would run the script, and it would create all those directories and symlink it off back to my single source of truth.
[00:12:10] John Groetzinger: Mm-hmm. Right. But it still blows my repository with a mess, right? And so I do not agree with the philosophy that stuff like CodeGuard should be committed to repo because every time it updates, you have to merge it in. It really complicates PR processes too. It is like just noise that your team should not have to deal with.
[00:12:27] John Groetzinger: Because it's not part of the central code base.
[00:12:29] Simon Maple: Yep.
[00:12:29] John Groetzinger: I feel like a repository should really reflect what the repository is about and not external stuff. Just like NPM packages, I do not have the whole NPM package committed. Right. That would be crazy, right? So same concept. I really think it should be treated that way, as you guys do.
[00:12:42] John Groetzinger: I think of it as Tesla.
[00:12:44] Simon Maple: Absolutely. And now when we think about the skill being used by the agent identifying issues in existing code, or trying to find issues or trying to write secure code with the guidelines as the best practices. What were your immediate learnings from how it was working, and how did you know it was as good as it could be?
[00:13:07] John Groetzinger: Well, I mean, I did not. I think it was a lot of "Why did you use the skill this time?" Why did you not use it this other time when I wanted it? Mm-hmm. Sometimes I would have to go out of my way to be like But did you even use these skills and CodeGuard for this? And they would be like, "Oh no, I did not do that." Let me go do that.
[00:13:22] John Groetzinger: And then it would change the code. It is like: Ah, he has wasted my time." Right. Yeah. So there is definitely a lot of that with any skill development, not even just with CodeGuard. I go through that a lot.
[00:13:30] Simon Maple: Yep.
[00:13:31] Simon Maple: Typical activation of skills within the agent. Yeah.
[00:13:34] John Groetzinger: It's not always easy.
[00:13:34] John Groetzinger: Like what are the, what the, I mean, of course you can try to set up like hooks and stuff in Cloud Code and make it hook off, that kind of stuff, but even that is not perfect, and it can be kind of, it can actually be worse.
[00:13:46] Simon Maple: Do you have any tips or best practices about how you can increase the activation? Is it something that is in the skill as a producer of the skill that it is good to do? Or is it something more? As a user, you want to guide; you want to tell the agent, "I need you to do these things and this sort of thing."
[00:14:00] John Groetzinger: Yeah. I mean, I use, when I build the skill, the things that are top of mind for me, from what I have just personally learned: keep the skill lean.
[00:14:07] John Groetzinger: I think it is a best practice anyway from the topic. Yeah, but that is something I have learned; like, you cannot bloat your skill.md/ It is just going to become useless. So I think referencing things in the skill MD is really the way to go.
[00:14:17] Simon Maple: Mm-hmm.
[00:14:18] John Groetzinger: And when the model does not pick up on key things, like when I am having a conversation, actually I just ask it. I just say, "Why did you not pick up this hint in the Skill MD about the security review?" Right?
[00:14:28] Simon Maple: Mm-hmm.
[00:14:28] John Groetzinger: And then it will explain to me, and I will just ask it, "Can you go update your agents.md or the skill.md in a way that next time you go to use this, you will actually pick it up and you do it?”
[00:14:39] Simon Maple: So I tell the model to fix itself because it is a self-healing kind of flow.
[00:14:40] Simon Maple: Yeah.
[00:14:40] John Groetzinger: Who better to ask than the model itself? Yeah. So I, that is my strategy, and it works quite well. And also it is less thinking for me. So it is just like, "Why did you do this?" Oh, okay. Well, do not do it that way, and change it for yourself.
[00:14:50] Simon Maple: Yeah, yeah.
[00:14:50] John Groetzinger: It is a lazy approach, but it works surprisingly well.
[00:14:54] Simon Maple: And how much can you? You mentioned putting things into the agents.md and things like that. With context like this, obviously there is a huge amount of context, and it is quite easy to blow the context window pretty quickly. Did you play much with the context window of how much you should put in and how much is mandatory for it to read versus just allowing it to be a reference?
[00:15:12] Simon Maple: And you can play with this whenever you feel as an agent you need more information on this? Was there a balance there?
[00:15:19] John Groetzinger: I do not know what the balance is. I am still trying to find out myself.
[00:15:22] Simon Maple: We're all learning.
[00:15:23] John Groetzinger: I have certainly tried, but I do not know that I have anything other than anecdotal, you know, memories.
[00:15:28] John Groetzinger: I do not have a lot of time to spend engineering context management for my age as much as I want to. I have to write actual code, right? Yeah. Or ship software, not play with AI coding and, you know, babysit it. Actually, yeah. So it is really a trial-by-error learning thing. I do not have time to really engineer.
[00:15:45] Simon Maple: Mm-hmm.
[00:15:46] John Groetzinger: Which is unfortunate because I think that is really what is needed to get optimum, you know, the optimal performance out of these is really engineering the context and how it is picked up, how different models key into that, and how the different tooling actually is plugged in. It is a very complex, you know, matrix from that.
[00:16:01] John Groetzinger: And I just do not have the time to really dig into that as much as I would like.
[00:16:05] Simon Maple: Yeah. So what were your learnings then once you had the CodeGuard skill? What were the lessons that you kind of went through in terms of getting that skill as good as it can be?
[00:16:18] John Groetzinger: Yeah. I mean, learning just the basics of model activation and that stuff, I mean, not even just CodeGuard, but right, I have written a lot of other skills myself, which are, you know, I spent a lot of time trying to understand, like, what works well and what does not. Too much context is always a problem.
[00:16:32] Simon Maple: Yeah.
[00:16:32] John Groetzinger: You do not want to bloat that skill.md with just unnecessary information.
[00:16:36] John Groetzinger: I mean, most of my Skill MDs are built by my agent, which means they are overly verbose. So I found that problem very early. I think it makes way more sense, and I get way better performance when I kind of cross-reference things. So whenever I think I have some large concept that is important, I will say, "Put that in a new file."
[00:16:50] John Groetzinger: Reference that in a skill.md and that improves a lot. But it does not always pick up from the Skill MD to reference that when I should. So it is still learning how the best way to kind of do that in each model is very different from that perspective. So it has not been easy.
[00:17:07] Simon Maple: And in terms of how you know the context is right before you ran evals, was it anecdotal? Was it a gut feeling that this is correct?
[00:17:17] John Groetzinger: It was largely a vibe feel.
[00:17:19] Simon Maple: Vibe feel. Oh, vibe feeling.
[00:17:21] John Groetzinger: Yeah. Did I get a good vibe that this worked well? Did it do the tasks that I wanted to do efficiently?
[00:17:26] John Groetzinger: Or did it stumble around forever, ten minutes thinking or whatever? No, it is very anecdotal. I still am trying to figure out the best way to really evaluate these more like software is evaluated; that is not easy to do. So yeah, I do not have a good data-backed way to.
[00:17:45] Simon Maple: So what did you do? Was there anything you tried? Did you try using LLMs to judge it or anything before the Tesla?
[00:17:53] John Groetzinger: So I would say I shared it with other engineers. Only when it was like, oh, I did it two times myself, and I got a good result out of it, then maybe I would share it with someone else and get their anecdotal feedback.
[00:18:05] John Groetzinger: But it is anecdotal, anecdotal, and anecdotal.
[00:18:07] Simon Maple: Anecdotal at scale.
[00:18:08] John Groetzinger: Yeah. And then of course they would have; they are like, "Oh, I use a different tool in a different model." And then they have a very different experience. I do not have time to go.
[00:18:16] Simon Maple: Which could actually have nothing to do with the skill. It could actually be the agent, the model, and a number of other things. Right.
[00:18:20] Simon Maple: So we ran with the Tesla evals, we ran with the review eval, and then with the task eval. Tell us a little bit about the journey you went on with the review evals first.
[00:18:33] John Groetzinger: Yeah. It was actually fun. It was a good experience, honestly.
[00:18:37] John Groetzinger: I did what you expected; you just ran the skill review. And then instantly, within just like a couple seconds, I do not think it was very long, but I always assume agents will look less; I might have walked away, but it was very quick. I came back, and I have this rubric to review, not just categorised feedback, but more importantly to me, actionable feedback.
[00:18:57] John Groetzinger: These models, when you ask them for feedback on things, tend to just judge you harshly, and they never provide suggestions. I mean, sometimes they do, but they give you too many options. But this is like grounded in, "Oh, well, this part here is wrong for these reasons," or "You might be able to improve with certain things."
[00:19:11] John Groetzinger: I might not agree with those categories, but for me it is an iterative loop always. I want the model to do as much as possible, but I still need to be in the loop because I cannot just let it review itself and keep going because that is a waste of money and I will probably end up getting something worse in the end.
[00:19:27] Simon Maple: Yeah.
[00:19:27] John Groetzinger: So I make myself part of the review process, and I will just kind of look at that rubric and say, "Well, I don't agree with this point, but I really agree with this point, so let's focus on that one right now. Can you suggest some changes to improve that part of the rubric?" And then it would adjust it and reevaluate, and then they would get a better score.
[00:19:44] John Groetzinger: And it is like, "Oh, okay. We can start to see the improvement working live.” And you understand it as well because you are in the loop. So that is kind of how I do it. Has my model improved its own skills? Because at the end of the day, it is the one using it.
[00:19:56] Simon Maple: That is awesome. So you are essentially managing, you are overseeing the updates, and you are recognising and having the time to agree if you want to make that change.
[00:20:04] Simon Maple: But then the model does it, and then of course, Tesla, i.e. the model in the background, retests, provides the new feedback, and then you just loop through that until you get to a stage where you are comfortable.
[00:20:16] John Groetzinger: Yeah, exactly.
[00:20:17] Simon Maple: How did that differ from, I guess, the task evals? The task evals for those who did not see or hear the last couple of episodes are much, much more real-use case scenarios.
[00:20:23] Simon Maple: So a number of scenarios that are based on what the skill is likely to be used against. And then the LLM or the agent will run those scenarios in two manners: one with the skill and one without the skill, just the baseline, as we call it.
[00:20:43] Simon Maple: And then you get different percentages; hopefully with the skill, it is a higher percentage success rate based on the evaluation criteria. But CodeGuard did really well in that. You can actually see, I think it was a 1.79 times improvement on the baseline. Were you shocked or surprised at that kind of number?
[00:21:01] John Groetzinger: I definitely was. Yeah, for sure. Because I think anything with AI and someone is like, "Use my context; it'll improve your skill." It is like none of that comes with data other than the trust benchmarks, of course. But I think I was a skeptic on, "Do I really need this?" I know security; I can just tell it to go read about security and do this.
[00:21:19] John Groetzinger: Where does this help me? And it is optimised, so it is way faster. And the 1.8 improvement was impressive, especially when you have a baseline to compare it to.
[00:21:28] Simon Maple: Yes.
[00:21:28] John Groetzinger: Then you know, like, "Oh, well, what is it actually better at? I don't believe you." It is like, "Oh, I see that you can go try it yourself, and you see that."
[00:21:34] Simon Maple: And the joy is actually as well, this is a very early stage with the task evals, but the scenarios are there to be reviewed and updated and then sent back and reevaluated. So if the evals are not accurate or are not realistic, they are absolutely changeable. So the skills can be tested against valid scenarios.
[00:21:52] Simon Maple: The baseline was 47%. 47% is the score from the evaluation criteria that a plain Claude code did against the scenarios. 84% agent success with the skill. In some cases, I am looking at scenario five here, where without context everything is red, and everything is Xs on the left-hand side.
[00:22:18] Simon Maple: With the context, four out of the five turned into a hundred percent success; one is still zero percent. What do you do with this data? How would you turn this into an actual valid turnaround for improvements?
[00:22:32] John Groetzinger: Yeah. I mean, it is good; I like the scenario-based, especially because not everyone needs the same security. Especially like the session fixation one, which was especially interesting.
[00:22:40] John Groetzinger: I think that is scenario five because having the same session ID after you log in is sometimes a bad practice. How long does that ID live? There are different schools of thought on that. I mean, at Cisco we have a very zero-trust thing, so it is like that is very short-lived.
[00:22:56] John Groetzinger: You need a new one at least once a day. But for maybe more consumer-side stuff, it is annoying to log in so frequently.
[00:23:02] Simon Maple: Mm-hmm.
[00:23:02] John Groetzinger: So maybe you do not care about that. So having a different scenario for enterprise versus a more consumer base.
[00:23:06] Simon Maple: Mm-hmm.
[00:23:07] John Groetzinger: And being able to tell, like, "Is my agent aware of that? Is the base model aware of that?" And if not, "How can I have this skill kind of give it that context?" Because again, perfect security does not exist. There are different levels. Security will slow you down.
[00:23:20] Simon Maple: Yeah.
[00:23:21] John Groetzinger: So you do not want it to slow you down, but you want it to be secure enough for your application.
[00:23:24] Simon Maple: Yeah, absolutely. What would you say were the biggest learnings from the point of view of how useful you found the eval data in certain cases, and what learnings did you then have from CodeGuard itself?
[00:23:40] John Groetzinger: Yeah. I mean, it is definitely even; some of the scenarios that it helped create helped me think, "Oh, there are situations that I didn't think about that it kind of applies to."
[00:23:49] John Groetzinger: So it really helped. I do not want to come up with all the scenarios; I might not be thinking of them. So from that perspective, it is good to see. But really the rubric of how I thought about it and the way that it explained it is very different than how I would approach it. But it is a better way, right?
[00:24:03] John Groetzinger: It is more scientific, which I like. I feel sometimes I am just exploring, and I am like, "I think this helps," and I do not know. And I really hate that feeling that I am wasting my time and I am not even making improvements. It is the worst to me. So I think it really helps me keep on track and gives me a hill to climb.
[00:24:19] John Groetzinger: That is what I try to do for all my evaluations for my own agents is: What are we trying to build? What does good actually look like? I think this gives you an example of what good does look like, or at least an explanation of what good should look like, and then where you are missing the mark.
[00:24:34] John Groetzinger: It makes it easy to just fill in the blank, climb that hill, and get to where you need to go.
[00:24:38] Simon Maple: Mm-hmm.
[00:24:38] John Groetzinger: It just makes it less friction.
[00:24:39] Simon Maple: Yeah.
[00:24:40] John Groetzinger: Way less friction to get there.
[00:24:41] Simon Maple: Yeah. And using something like this with a skill, would you say, for a greenfield project, writing code from scratch, that the skill would maybe be better for those types of scenarios?
[00:24:49] Simon Maple: Would you think brownfield, where you have an existing set of code whereby you actually want to do more code review or slight adjustments to the code? Where would you say a skill like this actually has its most effect?
[00:25:05] John Groetzinger: I mean everywhere.
[00:25:07] Yeah.
[00:25:07] John Groetzinger: As influencer, but I mean, that's
[00:25:08] Simon Maple: That is the joy of this type of skill, actually.
[00:25:10] John Groetzinger: Yes, it is. But I would say I do not want it to get in the way too much for simple projects, right? There are some side projects I want to do, and I just want to code, and I do not care about security. If it is for me, I am the only one using it, and it does not matter what the data is. Sure, I am just having fun, but then I do not need this, you know?
[00:25:23] John Groetzinger: But then it is like, "Oh, I made something fun. I want to share this. Now I have to go add this security stuff to this, you know?
[00:25:31] Simon Maple: Yeah.
[00:25:31] John Groetzinger: While I am just having fun, maybe I do not use it. But for anything where I am actually going to share or ship, I think it absolutely has its place always.
[00:25:40] John Groetzinger: Yeah.
[00:25:40] Simon Maple: Amazing. So what advice would you give to our listeners if they wanted to create a skill from scratch today? How would you go about it if you were doing that from scratch today?
[00:25:49] John Groetzinger: Yeah. I mean, I know that I have the best approach, but where I am today will probably be different in three months, but it is really building it with the model together.
[00:25:57] John Groetzinger: I do not write the skill.md myself; obviously I do not have time for that, but also I am asking the model what it thinks, right, in terms of "How do you think we should do this?" And I build it together, right? And not only does it help me understand some of the stuff I did not, but it also helps the model understand what I am doing.
[00:26:14] John Groetzinger: Because oftentimes I will say, "We're going to go do this task," and it makes a lot of assumptions on what is involved in that process, and that is wrong. So oftentimes I do not make the skill first. I will basically tell it I want to make a skill, and then I will say, "You're going to do a dry run with me. We're going to do this together."
[00:26:29] John Groetzinger: And then at the end, you are going to review everything we just did and we are going to make a skill out of that. And then we also have our first evaluation from that as well, right? It is a real-world scenario, with me using it to do what I want it to do.
[00:26:39] Simon Maple: Mm-hmm.
[00:26:40] John Groetzinger: And so then I tell the model to build this, go off with that, and then I adjust and improve from there. That is kind of my approach right now. It is very hands-off, though.
[00:26:48] Simon Maple: I really love that process. So you are essentially having the LLM, while doing the stuff, observe the interactions, I guess. Because if the agent is creating something and you actually say, "Do you know what? That is not how I want it done.
[00:27:02] Simon Maple: “I want it done like this because of blah." Then all of a sudden it learned: "Ah, okay, my basic route was to do this. You did not like that. As a result, I am going to start adding this as part of the skill." Absolutely love that. And also if it is doing things the right way and you actually do not comment on it, it actually does not need to document, so long as it is deterministic enough to do something similar every single time.
[00:27:19] Simon Maple: I love that. And then evals and things like that. How would you go about running evals? When would you run evals on that kind of a skill? How many iterations would you do before you would be happy with a skill? One eval?
[00:27:42] John Groetzinger: Definitely still figuring that one out. But I am a big proponent of evals, so I say eval always, but obviously do not have the budget to eval on every line change.
[00:27:50] Simon Maple: Mm-hmm.
[00:27:50] John Groetzinger: So it kind of depends, right? But I think a major refactor of the workflow is where I am changing something in one scenario that could have impacted another, and I am not sure, so I run it.
[00:28:00] John Groetzinger: But again, it is really use-case specific. I do not have a really general kind of when I do that, but I would say it is kind of like if you are doing semantic versioning, right? It would be like the middle version release.
[00:28:13] Simon Maple: Yeah.
[00:28:14] John Groetzinger: You want to do it at the third release, but again, it is a cost and performance balance.
[00:28:20] Simon Maple: And how about you mention, obviously, a large organisation such as Cisco? You are going to have different developers that are potentially working in slightly different ways but with similar policies and similar overall team ways of working, but they are potentially using different models and different agents as well.
[00:28:38] Simon Maple: At what stage, if you were building a skill, let's say, for a larger team, would you do it a similar way, or at what stage would you try and bring external thoughts in? Because you do not want it to necessarily be a perfect skill for you but then actually not great for others.
[00:28:53] John Groetzinger: Yeah. I mean, again, it kind of depends. I like to bring the users that are consumers in as early as possible, but that can also be too many opinions too early, too many chefs kind of problem, right? And sometimes it works if someone just hacks it out for a bit, finds what works and what does not, and then kind of takes a first go at it and shares that with others, and then you kind of build from there.
[00:29:13] John Groetzinger: But it needs to be easy to explain how you got to where you are. And if people do not agree, they can try something else. Too early might be not good to have a lot of people, I would say.
[00:29:21] Simon Maple: Yeah.
[00:29:21] John Groetzinger: It can be wasteful of time.
[00:29:23] Simon Maple: So potentially two loops: one where it is like you learning with the evals, refining, refining, but no point in refining it too much before you get that more external loop of feedback with the wider team or even extra agents or extra LLMs to get that extra.
[00:29:40] John Groetzinger: Yeah. And that is one approach. The other approach is just have five people or five models all do it asynchronously, and then we merge the results right together. You know, if I am in a hurry, maybe that, but again, it depends on the skill.
[00:29:52] Simon Maple: Yeah. Awesome. So what is next for CodeGuard then?
[00:29:56] John Groetzinger: Yeah, so we actually donated it to the Coalition for Secure AI, right? So it is kind of like a Linux Foundation but for AI coding security stuff. So yeah, we want to share it with the world. We think we have found a lot of zero days with it.
[00:30:09] John Groetzinger: It is obviously valuable. We just want that to be easy for everyone else to use.
[00:30:13] Simon Maple: And then, of course, it is on the Tesla registry as well. We had a little play with it just to be able to run the task evals and the review evals. So right now it is at Cisco/software security. It may not always live there, but if people wanted to take a look at it and see what we did as a first tile, it is all there for people to have a play with.
[00:30:31] Simon Maple: And of course if we move it about or if the name changes, et cetera, we will let you know.
[00:30:42] John Groetzinger: Yeah, people should give it a try. Pull it down to see if it can find a vulnerability in your repository, right? I mean, why not give it a go?
[00:30:48] Simon Maple: Awesome. John, it has been an absolute pleasure, and let us go and continue enjoying Cisco Live. But thank you very much for the session that, well, I say we gave, but you gave like ninety percent of it.
[00:30:59] John Groetzinger: You kicked the ball into the goal.
[00:31:01] Simon Maple: I tapped it; I was heading it in on the line. That is right. Thank you very much for the session. It's a real pleasure to be here with you, and thanks for the session.
[00:31:09] John Groetzinger: Thanks for having me.
[00:31:10] Simon Maple: Thanks everyone for listening, and tune in next time.
Chapters
In this episode
Your AI coding agent learned from millions of lines of code, including insecure ones. That means by default, it can write vulnerable code too.
So how do you fix that?
John Groetzinger, Principal Engineer at Cisco, built CodeGuard, a security skills layer that teaches coding agents how to write and review code securely. He tested it against real scenarios.
The result:
84% success rate vs 47% baseline. Nearly 2× improvement.
In this episode we get into:
• how CodeGuard works
• why Cisco open sourced it
• the surprisingly simple method that gets agents to fix their own mistakes
Try CodeGuard: cisco/software-security on the Tessl registry.
Connect with us here:
John Groetzinger: https://www.linkedin.com/in/john-groetzinger/
Cisco: https://www.linkedin.com/company/cisco/
Simon Maple: https://www.linkedin.com/in/simonmaple/
Tessl: https://www.linkedin.com/company/tesslio/
How Cisco Built Security Skills for AI Coding Agents
Getting coding agents to write secure code remains one of the harder problems in agentic development. Agents have learned from all of our code, including our worst practices, and they tend to reproduce those patterns unless given specific guidance. In a recent episode of the AI Native Dev podcast recorded at Cisco Live Amsterdam, Simon Maple sat down with John Groetzinger, a principal engineer at Cisco, to explore how the company tackled this challenge with CodeGuard, a set of security skills designed to work across multiple AI coding tools.
The conversation offered practical insights into skill development, context engineering, and the evaluation workflows that help teams know whether their agent guidance is actually working.
Why Security Context Matters for AI Coding Agents
CodeGuard emerged from a straightforward problem: Cisco wanted to accelerate software development with AI coding agents, but doing so securely and to enterprise standards required giving those agents specific security knowledge. As John explained, "You can't think of it as security skills for your AI coding agent. We really want our developers to use AI coding because it accelerates software development. But doing that securely and to our standards is very important."
The approach took existing OWASP best practices and internal security standards and distilled them into context that agents could actually consume. The key insight was simplification. Raw security documentation is too dense for effective agent consumption. "If you've ever tried to go look at OWASP schools, it's very encompassing," John noted. "It's not something you can even feed to an agent that would make sense of that, to make sense of your code."
The resulting skills cover traditional code security concerns like SQL injection and session management, but have also expanded to include MCP security and skill security as the agentic landscape evolves. This reflects a broader pattern: security context needs to grow alongside the capabilities teams give their agents.
The Distribution Problem Across Multiple Tools
One of the more instructive parts of the conversation centred on distribution. Cisco developers use Cursor, Windsurf, Claude Code, and other tools. Each has different conventions for agent configuration. Getting security context to all developers without adding friction proved surprisingly difficult.
The initial approach involved creating tool-specific packages that would unzip into the right locations for each IDE. John even wrote a script, or rather had Claude write a script, that would create symlinks from a single agents.md file to all the various tool-specific directories. But this still cluttered repositories with dot files and created maintenance overhead whenever the security guidance needed updates.
"Every time it updates, you have to merge it in," John observed. "It really complicates PR processes too. It's just noise that your team shouldn't have to deal with. Because it's not part of the central code base."
The comparison to package managers proves useful here. Just as developers do not commit entire NPM packages to their repositories, security context should be pulled in as needed rather than duplicated across codebases. This points toward a model where context engineering (https://claude.ai/blog/context-engineering-guide) happens at the organisation level, with distribution handled separately from the code itself.
Evaluating Whether Agent Security Guidance Works
Perhaps the most valuable portion of the conversation addressed evaluation. Before structured evals, John's approach was largely anecdotal: "Did I get a good vibe that this worked well? Did it do the tasks that I wanted to do efficiently?" He would share skills with other engineers, but their feedback was equally anecdotal, complicated by the fact that different tools and models produced different results.
The shift to structured evaluation through task evals provided something more concrete. CodeGuard showed a 1.79x improvement over baseline Claude Code performance across security scenarios. More importantly, the scenario-based approach revealed specific gaps. In one case, baseline performance showed zero percent success across all evaluation criteria, while the skill-equipped agent achieved success on four out of five criteria.
"I was a skeptic on 'Do I really need this?'" John admitted. "I know security; I can just tell it to go read about security and do this. Where does this help me? And it's optimised, so it's way faster. And the 1.8 improvement was impressive, especially when you have a baseline to compare it to."
The evaluation data also surfaced scenarios that John had not considered. Security requirements vary by context. Enterprise applications demand stricter session management than consumer products. Having distinct scenarios for these cases helps teams understand whether their agents are appropriately calibrated for their specific environment.
Building Skills That Agents Actually Use
The conversation surfaced several practical lessons for anyone building agent skills. First, keeping skills lean matters. Bloated skill files become less useful as agents struggle to extract relevant guidance from too much context. John's approach involves putting large concepts in separate files and referencing them from the main skill document, though activation remains inconsistent across models.
Second, building skills collaboratively with the model helps. Rather than writing skill documentation himself, John describes the task, does a dry run with the agent, and then has the agent generate the skill based on that interaction. "At the end, you're going to review everything we just did and we're going to make a skill out of that. And then we also have our first evaluation from that as well."
Third, self-healing workflows can address activation problems. When the agent fails to use relevant context, John asks it directly why it missed the hint, then has it update its own configuration to improve future behaviour. "Who better to ask than the model itself?"
From Internal Tool to Open Standard
CodeGuard has since been donated to the Coalition for Secure AI, reflecting Cisco's view that these security patterns should be broadly available. The skill is also available through public registries for teams that want to experiment with it.
For development organisations wrestling with similar challenges, the CodeGuard journey suggests a path forward: start with existing best practices, simplify them for agent consumption, solve the distribution problem at the organisation level rather than per-repository, and invest in evaluation workflows that provide actual data rather than anecdotal impressions.
The full conversation covers additional ground on model selection, context window management, and the evolving tooling landscape. Worth a listen for teams working to make their AI coding workflows more secure.
