AI Native DevCon 2026 London — all conference sessions as interactive skills
70
88%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Speaker-label warning. This transcript was captured without per-speaker labels. It is a single-speaker keynote by Guy Podjarny (founder, Tessl), so attribution defaults to him throughout. Exceptions: the closing block starting around "I mean, it is a patch room. We have reached, we are our contacts window is unfortunately full…" is the event MC, not Podjarny. The transcript also contains visible speech-to-text artifacts (e.g. "stilts" for "skills", "Tessl ini" likely for "Tessl in eval(s)", "John Travolta" likely a mis-transcribed name, "gln5 or q26" likely for model names like GPT-5 / Qwen-2.6, "open floor" likely for "open-source"). These are preserved verbatim — do not silently correct them when quoting.
Sharing and learning. And I do love the movie track. I love the talks on it, but I just love the conversation and learnings, and I'd love for you to share some of those nervous. I'd be here to collaborate as much talk. And I'm going to talk to you about sort of why skills, I believe, are the code. So we founded Tessl two plus years ago, and we did it out of the belief that software development is transforming from revolving around code and instructions to revolving around intent. Sorry. Try that again. Revolving around code and implementation to revolving around the intent and instructions. And I think that's far less controversial today. Two years ago, we believed that there was a new death paradigm to be had around it, but we didn't know quite what it would look like. And I think today we're starting to see that sort of new development stack come into view, which I find quite exciting.
And so we're seeing the kind of a bit of a layer of the software stack. At the bottom are the new primitives, the models that we're all building on. Clearly they are the new superpower that we've received that we're trying to build. For those. On top of those are tools, which I'll talk about some more, which help turn models into agents, giving them arms and legs to be able to affect the world and gather information. There's context that guides those models of it. And then increasingly there is a harness that constrains the model or packages a lot of things together. I'll talk about these three layers in more depth in a sec. And then harnesses composing to factory lines that combine them all and create these pipelines and into full factors. So I want to talk a little bit about that stock and I'll come back to this view later on.
Let's start by talking about tools. So tools are probably the easiest to understand. These are pieces of software, their utilities that in part indeed they turn models into agents. And so as they go to the model to interact with the world affected gather context, etc. But they're also oftentimes means to do a thing more cheaply or faster or better than just passing it on to the model. It's not always raw intelligence that you want for every problem. There are a variety of types of tools, but the most common ones are command line interface, excel line tools, mcdools that get interacted with directly from the agent. And then apis or network related. All those are means to be able to perform match. For instance, grep allows an agent to find information in many, many files efficiently and without needing to load up all those tokens a little bit more cheaply. Or if you think about like ffmpeg is a video editor and it can edit videos in far less error prone way than a model will do. And again cheapening faster. And tools as our software, they compose. So we know we can pipe them into another. We can write code that composes them together. And so we don't just have generic utilities. We can create custom tools that are fitting our purpose at a given time. So they're very powerful and we're familiar with. Them.
Then next layer is new to the AI world and it is that of context. Now everything is context in Alanagan what goes into that element call. However, when you think about context files and we are developing, they end up focusing on information that the agent either doesn't have and can't know or information that you can't figure out, but it's very inefficient of error code for it to find out. And we find that it kind of boils down into these three types of things. There's policies and practices, right? We got this is how we use this frame. This is our security policy. This is our API design. Right? Capture those and then form the agent about that. The second is specs. Specs we are definitions of product. So it's information about the product you are building from the agent better to say we're building things non-existent or products you are using, especially when those are APIs, we think it moves quickly. So again, maybe the agent can try and error it, but you know it's expensive. And won't work well. And the third bucket is workflows. This is how you will do incident response. Right. This is how you will review code. Sometimes to help the agent succeed, if the agent is just not good enough yet, it's doing it on its own. And sometimes it's because you want it consistently. They don't care that you control should the incident on your own. I want it done in a repeatable fashion so everybody can relate to it.
The other matrix around context is how you load it. And so over there you've got, it's a bit more technical, but you've got rules that always get shoved into the context window, regular cloud empty or your agent in defile or gets put there. You know clearly you need to be mindful in size. You've got skills that are more loaded on demand either in cocktail or through some hints to the agent. And then you have passive context, just docs like architecture and d or other information that's just available there in the repo. Hopefully the agent can find it with agentic search and the likes. Now I'm going to refer to context as skills mostly for the rest of the presentation just for simplicity of it. But there are some sort of subtitles around it. And importantly skills composed as well. And we don't think about it as much. It's not as obvious as tools, but they do gather up. First of all skills called tools and that's I think for the common in fact many still exist to be able to talk about how to use tools. But skills can call other skills. You can have for instance an incident response workflow that calls one skill to collect the logs that use the data dog or other tools and another one to the root cause NAS is another one to log into things in a linear or jira. And so you can compose those. And that's amazing because composability is what makes software powerful. Right. We can build on one another. Imagine every time you want to write a web page, you have to write a kernel. Like that doesn't work. Well we want to be able to do it.
So we have tools and we have skills that are slightly more entrant in terms of thinking about them as developers is this world of harnesses. Now harness is deterministic software that wraps a probabilistic model. It harnesses the model. For instance cloud code is harness and it controls a bunch of things about how to interact with the model. Right. It knows how to load the rules for problem D and how to load the skills. It has a config file that defines things that it will apply in the container to constrain the model to limit and say what does it have a define which tools are available and you can customize that model increasingly as far as kind of frameworks that are extensible so you can create plugins into these agents that have your tools or your skills. And they also allow you to create hooks which again are these deterministic pieces of software. Every time a prompt comes along, run this hook and I might edit that prompt. I might disallow it and lock it. Right. Every time you call a tool and when I do that. And so these are deterministic functionalities of functionality that again are potentially cheaper potentially faster potentially better and definitely more deterministic than just calling them models. So this interaction is important.
And of course there are many types of different harnesses. The popular ones over here and interestingly it's become much easier to build your own harvest and people do that as well. So harnesses are important. They help the model and they constrain the model. And harnesses are becoming a core component of how to make gen development scale. And you see this from a bunch of thought leaders just a couple of examples here are intercom talked about how they for instance block creation unless the relevant skill was loaded. So they literally block the action. The agent who knew that the skill was there. So why wouldn't we code it? Well this commandeers control the model. It says the model doesn't get to choose. This will always be loaded. So it gives you some of those controls. And openai with writing the buffalo is actually going to be speaking later on. Great talk that you definitely need to talk about. This is like an earlier post from him. Talked about a much more custom harvest that they've built that doesn't a variety of things including common practice of not allowing a commit unless sort of a test coverage. Has been included. So again all of those are examples of deterministic software that wraps the probabilistic model and allows us to create control so we can make things repeatable. And unsurprisingly harnesses compose as well. So you can take say a product harness that is very good at elaborating on the feature request and a coding harness that knows how to build a thing and a security partner that has to secure it. And maybe a devops harness knows how to deploy it and package the normal into this system harness that I call a factory line that looks like a pipeline. You might get a type of input in this example feature request and you might have a shaped feature on the other side. And so hire system both which again is great because this is the way that we can scale.
So coming back to the stack, this starts looking like a software stack. Right? It starts looking like these are layers that we are developing each of these layers composes the layer below it and sort of tools called the models and context guide the tool to pull the models and feed the model that is directly enhances package up a bunch of these things. Now all of these terms are working progress harness is probably the most abused at the moment. And I'm sure they will change over time. Still, it's something we can start reasoning about. And I think it's interesting to see that amidst that layer the tools, the harnesses, the factory length, the factories they're all actually software. They're software that wraps the model. But really the two kind of new compute entities are here are the models themselves and the context that is native to the models. It literally goes into the matrices and modifies what is being done. And so my view on this is that really a lot of this is frameworks, but context is the new code. It is the place in which you program the model. And you do this sometimes locally and sometimes reusably.
So when we think about context again everything is context. But when we talk about the context you provide is human, it's very improved as developers. It ranges from things that are more in time, just in time like prompts to be able to say guide the like literally in an interactive session to dox on some information we store alongside our code that is probably project specific roles that are sometimes project specific and sometimes get reused across projects. And into skills and skills are really designed to be reusable. They're light libraries. They are. But if you're only doing one thing something once, you're not going to create a skill for it. You're just going to prompt for it. If you think you're going to do it multiple times, you turn it into stilts. It's designed to be reusable. So reusability is great. You know, it's a means of scaling.
So we're seeing skills explode and there are many stats on it. On the public ecosystem actually just completed analysis with github on it. The children are about 2 million skills on github. This is up from about zero at the beginning of the year. So it's many, many, many, many skills across many repositories. Some of these are open floor and some are not playful. It's not low quality of course. But it's a lot. And on the other side, I don't have numbers for this, but we think it's an explosion of skills within the enterprise. And again going kind of in a similar ramp, which is amazing but also daunting. And to talk about how it feels, I want to give this sort of real world testimonial around this. So I think judge was the head of the time. Over here. And I think explained it very succinctly. This is how most enterprises feel. They've got skills. They're multiplying and they're losing neutral. They don't know how to hand. Le that.
And so moving on, you know, let's talk a little bit about some of these challenges that we see with skill and with context. The first challenge unfortunately is security. So skills are actually executed by the agent and they may or may not be nefarious. And so we're seeing a variety of risky skills come along. We're seeing malicious skills that are literally built to cause harm. And they open floor ecosystem. We've seen over 30% of skills to be malicious on it. So there is a way to hack in. We're seeing what we call negligence skills. So skills that really urge the user to do something but do not have any safety instructions in it. Do not set any boundaries. So make sure you update the table correctly. And there should be a library, but do not drop the table. You know, it's famously happened in places. And so some of these safety instructions are lacking. And there could be vulnerable skills that guide for information just leaves you exposed. Most common example of that is to use API tokens or other secrets as part of something that is visible in the logs and might be available if it's available to the agency. Manipulated. Out.
So security is a concern. And if security is a concern, governance becomes a concern. Because if some risks are, some skills are risky. Then you have to ask, well, like which skill am I even using? Do I know if the risk team and who's using which skill and if one of them was malicious or was compromised, would I know that that happened? How do I track the track record? And so especially for organizations that are sort of mid-sized now really truly for everyone, some element of governance, just like know what's going on since supply chain hygiene comes along, we have to do that. So security and governance is kind of one bucket.
The next bucket of concern is around reusing collaboration. And I love the story from the developer into the large unicorn that has worked with us developers told me just very exemplary of how things happened. So they had, you know, everybody loved skills and everybody was building their own skill. A lot of people did the same skill again and again, whether it's related to the app or a code review or a test creation or something, many of us have the same ideas. And so they say, well, let's wasteful. Let's create a shared repository. So they created a shared repository and everybody uploaded those steals, started that blue skills in there. And very quickly it became a mess. And what happened was one, all these duplicates just went into the repo. But any new developer coming along and saying, oh, I don't want to write your own code of this skill. I'll use one of those. There are seven over here. Which one should I choose? Like there's no indication of which one is good. They've seen a works on my machine problem where one developer is using one agent uploads a skill and turns out that skill is actually very well on another agent or collaboration problem with his PRs that don't go anywhere because one person wrote a skill. Another hopefully came and proposed a notification. How is the skilled owner know whether this change is making it better or worse? Like there's no tools around it. And so nobody trusted anything in the repo and eventually everybody came back to writing their own. So not very effective on it. And I think we'll come back to this, but I think big evidence kind of gap here is there was no quality control in indication of what is of good quality. There was no management. Of the dependencies.
And so security and governance, we use and collaboration. And then the third bucket is really lifecycle. And I think we've come to appreciate and know that software rots. If you write a piece of excellent software today and you do not maintain it over time, it will stop working and eventually it will be harmful. In your system. I think we've had enough mileage to know that. The same is true for skills. You can create a skill. It works very well today. And you don't touch it. And then over time, the models change. The APIs around it change. Might be correct. The infrastructure is deployed and will change. Many things will change. I wrote three months here in the world. It might be two weeks and it still is already outdated and it rots and it becomes less effective and virtually harmful actually guiding the agent to do the wrong thing. So we have to think about how do we maintain skills. Nobody likes maintenance, but we have to figure that it is like a piece of software. And there's a bit of a characteristic going on over here because maintenance in the state, if you do not maintain, it's going to rot. It's going to cause damage. But now we can do it. If you do maintain, you can actually turn maintenance into optimization. You can observe agent logs and pull requests and production logs and other information about what has actually happened and automatically turn that into improvements and track that along into new skills and to motivate skill removed skills. Always go to the lymph code. Right. Just try to observe that. And so that's the carrot, an opportunity and risk. And both of them boil down to high quality autonomous agent powered maintenance.
So these are the three common changes like challenge areas that people are feeling. And we're hearing it. It's actually amazingly consistent as we talk to different organizations about how they're adopting skills in the dev. And the good news is they sound awfully familiar. All of those are problems that we have seen and address and we've built amazing tools on how to deal with them. In code, in software development. And so really what we need to do is we need to change our view around skills and not just think of them as like words that move around. They're not like your internet docs, right? There's a promotion of confluence dogs. They are like software.
And I want to talk about five core kind of quality and software development tools that we need are static analysis, dynamic test security tools, dependence management and observability. And talk about how can they apply those to skills? Because I think those are the solutions. To many problems.
So let's start from static analysis. Static analysis is the exercise of inspecting code without actually running it. Right? And we've seen it evolve in depth. Right. We have simple litters that simply, you know, check if you're styling a certain path, right? Or some type of things on type languages. You can go into deep security analysis, analyzer that inspects data flows within your application, your code and, you know, find some deep problems. And today you have a genting review that can even look into the substance of what is the application doing. In all of those cases, it doesn't run the code. It just looks at it. And so it's an easy way to scale it. And typically one measuring stick for all of your applications. So you will review the same way, make it in a different type of changes. So this is directly applicable to skills. We can use things like linting. Tessl has one version of that, which is linked. Just look to see there's a skill having the right field. Does it have the right forms? You can look at quality measures or security measures to be able to repeatedly find best practice flaws. Tessl usel review to match a scale against the anthropic best practices. Does it have proper progressive disclosure? Does it have is it concise enough? Is it declared the activation well enough? And then of course we have the security analysis. We'll talk about that. And then you can get into custom agent review that actually looks at the substance of it, which is pretty much the same as the other ones. Again, a Tessl, we do that through the review. So static analysis is hopefully straightforward and there's really no reason why we're willing to do that. It is, it's just about the definition of stopping and defining what good looks like, what is correct function. Ality.
The second world is tests. We all love tests. And, you know, I think again with mileage a little bit, there are probably enough gray beards in the room. That we will have experienced how it feels amazing to just ship something and not worry about tests. And then it feels not so amazing. But it's the day after as we try to modify it. Then we've kind of come to learn like visceral learnings why we need tests to be able to scale. And you don't really test for everything, but you have to have some test. Otherwise you can't build a quality. That can develop. And because tests are just a very big world, but all of tests, all the tests run some piece of your software and they scale from unit tests that are cheap and easy to run. So run them in a CI and build them out for a unit of software due to integration tests that we had a little bit of dependency world and what happens for this piece of software not in isolation with a bunch of others. It's a bit more expensive to run. It's a bit harder to maintain it, right? But it's useful all the way to end to end tests that are expensive and they need a production like system and you run them, but they're very valuable. So we, you know, we invest in them proportionately.
In skills, the equivalent is evaluates. You take, you define an environment for some skill to run, you actually practically run an agent through a task in that environment. And then you judge the result. And you can do all of those in a variety of ways, just like tests that you actually invoke it. And you can do it in varying levels of depth. And I think what we're seeing is we're staying at the bottom layer. We're saying scales is the unit of software. So like unit tests, you're testing the skill in isolation. It's like a library to do testing. Define what the rent looks like, build from there, run that evaluation. You can do that only for the agent that you care about or even with cheap open models just to know that you're not regressing just to know that however it is that worked before is still working now. You can go into project eval that test a project, say a repo in with multiple skills. Right, I have 20 skills installed. This is a more simulated, like a better simulation of production of like closer to reality world. And maybe I can take an old pull request and I can convert or extract the scenario out of that and load that in my sort of 20 skills and see again whether the judgment is, does it get a high score? Right? What is the correct output of it? And so those are great to optimize project context, help you delete things and it'll help you know whether it's useful or harmful to add a skill to your project. And then you can have more comprehensive test. You can actually use in any of these embells to be able to run model evaluations on it. But if you really want comprehensive test to be able to say, well in my organization, in my development, can I use gln5 or q26, which is much cheaper than open sort of. But I need to move towards for it. Or maybe I continuously failed in this task and this new fancy model came along. Does it help you get past that? And you create something that is more simulated, more extensive like the end-to-end test. And these are not perfect analogies, but we can think about evals as tests and we can write them incrementally. But we have to get into the habit of. Writing. Them.
And it's worth noting that just like tests, scenario quality matters. If you write tests that make you feel good because you've got the coverage and they do very little quality. Like I got the 100%, but you know, it doesn't have to be funny bugs. And you can write a few tests, but super, super valuable. Know that you can trust that framework. So you can change faster with more confidence. Same is true for scenarios. You can write some scenarios that are amazing quality parameters and you can write sounds that are just a waste of time tokens. Right? So you have to think about that. Also, we're big believers in eval as a Tessl ini. We've heard a scramble about this for quite a while. And, you know, we have kind of good capabilities around it. We help generate the test with this scenario and we help you run those. And then one sort of side benefit which we haven't really done well in software is once you do statically dynamic analysis, you can actually aggregate a bit of this view and create a quality score so that when you're consuming a skill, like in that shared repo, you can actually figure out which one is of high quality knots. And quality scores are always going to be perfect. But they are useful. They're a means of still knowing directionally this one is better. Than the other.
So the third bucket is security testing and it's actually very directly applicable to what we do in software. We can do static analysis. We should do stat analysis. I was very happy to see all the appsec players introduce skill scanning to being one of them, but many of the other did as well. And I think that's exciting. But having found its name on passionate security and embedding it into software development on it. I think this is directly applicable. Tessl specifically way integrated Snyk into our registry and we will support others. To try those out. There's a more nascent world of dynamic tests. So this is more like red teaming and testing of skill. It's tricky because these skills are very varied in functionality. So it's hard to identify what is correct, anticipated behavior and what is not. It's an evolving world. And then you need supply chain security. So we need to know, you know, we've all now experienced like 10 lm and a variety of other. Supply chain attacks that are becoming fast and furious. The same is happening or what will happen for skills. So you have to be able to track what you're using.
Which gets us into the fun world of dependencies. And one of the famous quote from John Travolta is that the tendencies are fun. Well, even you didn't say that is not the case. In fact no one ever said dependency management. It's fun. Dependence management is a pay and dependencies are paying. Right. They might upgrade your application like break your application when you upgrade. They conflict with one another. If they're stale, you might not be able to use some new system of it. If they're brand new, they might be compromised. Joy. And there's a reason we invested in package managers. Like those exist for a reason. They're widely used for a reason. Because these are real problems. So again, if you think about skills as you code, it becomes fairly evident that we need dependency management for skills. Tessl first, one of those on it. It's a few nasal token source ones. If you have proper dependency management, you can use them, use the registry to be able to discover skills. You can have version management where the problem occurs. You know which version would occur in and you can roll out the fix with a new version and see if it fixed it. When you install like say the Tessl install, you install the scale or we package them up in plugins. Lo and behold it remembers when you install. So you can run a Tessl update and then you will update those versions. And that gives you some supply chain visibility. And then once you have all of those, you can start introducing things like quality and security controls to be able to set a bar for what gets installed or what gets published into your workspace. And then you can introduce some cross agent compatibility so you know just like npm doesn't care if you're on windows or a limited machine. It's just obvious that for you. Can we do the same as we do. For agents?
And then the last bit I'm going to talk about is observability. And at the end of the day we can build really quality software. So far you can get in the lab. So you can build your software, you can test it, you can secure it, you can optimize it. And then at some point you need to step out. And that comes in terms of durability. For agents runtime observability is monitoring the coding agents. And when you monitor those you can use those to improve the quality of what you have so you can extract real world evapot scenarios and problems and use those to update and optimize your skills. And you can mine free gaps. So you can look at agent success as a whole. And extract values for new skills and things that we should. Remove.
And in general as you think about this whole process, you know, combining all these tools together you increasingly should see a development life cycle. But it should be a context development life cycle not a skill, not a software, not life cycle. We should as humans live in the context of the life cycle and leave the SDLC to the agents or engage them to enable the agent to do the SIEM in which we generate context evaluated tested, optimize it as we need to distribute it via good package management security consumer observe what has happened. And do that again and again.
So in summary, we have a new software stack for agentic development. It builds on the models that are like our operating systems. We need to make sure what we build is compatible with them. We have tools as utilities. We have context as a new code. We have harnesses that are like frameworks. Not every developer will build their own harness. But probably more organizations would substantially customize harnesses if not build their own in those composing to factory lines that feel very much like tight lines. You have repeatable type of input coming in successful output coming out and into whole dev processes. In factories. And within those skills are the new code. And we should treat them that way and give them the right tools for it. Again, hopefully at Tessl will help you do a bunch of those. I'd encourage you to try them out. But just the mental model is even more important. Than that.
And then one last thing before I wrap. At Tessl, you know, we're excited about this. We've been giving builder tools to be able to do a gender development and develop skills. And we've been giving skills to do that. And we think we should give you a harness as well. And so we have a nascent version of the Tessl agent, which is a vertical agent to help you develop content harnesses factory lines factories and more. You can use that locally as fill those things. You can use it in the pipeline as you continue to optimize them or even from the Tessl control center. So it's really new. If you want early access and you want to be part of this and spend a bit of time trying it out and giving us feedback, stop by booth forex or if you're on the stream here, you can check out Tessl.cogents to sign up for. Early access.
So that's it for me. What the close off is just say a huge thank you as we get into this conference. We believe indeed that there's a new software development paradigm that is coming into view. And we're excited that we're building it together. We think building a new paradigm is not a vendor or any company's job. It is a community activity. So as Simon pointed out, I encourage you to engage here, learn, but also share your learnings. That's a big part of what we're doing here. So looking forward for two amazing days. I just want to say huge thank you. I want to get a selfie with me. In this packed room. Thank you.
I mean, it is a patch room. We have reached, we are our contacts window is unfortunately full. We will begin compacting. We're going to be adding some chairs to the back for the next keynote. So don't worry about that. Guy Podjarny mentioned that we've got a live stream. We're not just the 600 people that we are 650 here right now. We also have another 2,000 people watching this track. So it's this track on this day that is watching live on the live stream. So everyone say a big hello after three to people on the live stream. One, two, three. No, I haven't forgotten you. Tell us where you're from. Put it into the comments and we'd love to hear from you. So next up we actually have a 15 minute break so that we can allow people to change rooms. In this room, we have Donaldson here a second ago who is the CTO of Netlify talking about build for humans now agents are here. We also have daily herb from quantum back talking in the latent space. We have Joseph cancel us talking about home security reinventing. So we'll see you in 15 minutes. Thanks very much.
.tessl-plugin
talk-azriel-executable-specs
talk-baker-sadogursky-context-engineering-skills
talk-batey-building-product-teams-age-of-ai
talk-birgitta-closing-keynote
talk-cormack-tests-lie-observability-ai
talk-debois-agent-enablement
talk-douglas-training-ai-on-your-own-code
talk-dubnov-merge-rate-ai-adoption
talk-farley-vibe-coding-best-we-can-do
talk-firtman-web-mcp-agentic-web
talk-foxwell-reinvention-dev-team
talk-groetzinger-skills-everywhere
talk-jones-odevo-ai-native-transformation
talk-jourdan-pipelines-to-prompts
talk-katsioloudes-code-security-ai
talk-kerr-bipolar-disorder-dysregulation-ai
talk-kushwaha-benchmarking-agent-era
talk-lamis-context-engineering-dreaming
talk-lawson-agent-experience
talk-lopopolo-harness-engineering
talk-lubken-embedding-pi-coding-agent
talk-maleix-collective-intelligence
talk-marsden-agent-desktops
talk-martinelli-spec-driven-development
talk-moss-skills-team-workflow
talk-obstbaum-willoughby-vibes-to-metrics
talk-overweg-one-brain-no-filtering
talk-podjarny-skills-are-the-new-code
talk-roberts-ai-native-brownfield
talk-roberts-brownfield-ai-native
talk-ruiz-agents-on-canvas-tldraw
talk-scheire-artificial-intelligence
talk-selajev-docker-sandboxes-agents
talk-sloan-harness-engineering-beyond-code
talk-smith-connecting-context-future-transports
talk-stack-humans-architect-ai-writes-code
talk-syme-agentic-repository-automation
talk-thomas-ai-native-engineering
talk-trieloff-browser-agents
talk-walter-runtime-intelligence-agents
talk-wotherspoon-humans-vs-slop