CtrlK
BlogDocsLog inGet started
Tessl Logo

ainativedev/latest-aidevcon-speakers-london-2026

AI Native DevCon 2026 London — all conference sessions as interactive skills

66

Quality

82%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

Overview
Quality
Evals
Security
Files

outline.mdtalk-groetzinger-skills-everywhere/

Outline — Skills Everywhere

Speaker

John Groetzinger — Principal Engineer, Cisco Customer Experience Engineering. ~14 years at Cisco, joined via the Sourcefire acquisition (intrusion prevention software, still a core technology in Cisco firewall today). Spent ~12 years in Cisco TAC (Technical Assistance Center) supporting financial institutions, banks, hospitals, and critical infrastructure under high-pressure conditions. Full-stack developer for 12–14 years; last ~year focused on production agentic systems. AWS Certified Solutions Architect; AI-native development practitioner since 2023. Recently shipped a customer-facing AI-native multi-agent platform that reached GA "a couple weeks ago" with "over a thousand users on boarded in just a few weeks".

Abstract (as provided by user)

Agent architectures and dev tools change fast — what's current today is legacy in six months. But as long as you're using general-purpose frontier models, you will always need your context. Your runbooks, your engineering standards, your institutional knowledge — that's the layer that compounds. That's where the investment should go.

Here's the problem: most teams manage that context separately for humans and agents, if they manage it at all. The wiki says one thing, the agent's context says another, and someone files a bug about "hallucinations" when the real issue is stale documentation. Worse, if your engineers can't easily read and understand what the agent knows — without digging through a massive prompt — they'll never trust the system.

In Cisco Customer Experience, we've been building context pipelines that serve both audiences from a single source of truth. The same knowledge base article a support engineer reads is the skill an autonomous agent uses to pre-triage customer cases before a human even opens the message. A complex evaluation framework reached a global dev team in days through a skill install — not onboarding meetings or training sessions. When your humans trust the context and your agents use it correctly, the whole system flows.

This talk walks through the context pipelines we've built — packaging, versioning, and evaluating engineering knowledge as installable skills — and the cultural shift to make "is this a skill?" a reflex. Invest in your context — everything else is going to change anyway.

Thesis (synthesised — distinct from the abstract)

Frontier models are already good enough for business value at the medium tier — the binding constraint is context engineering, not model intelligence. Skills are the right packaging because they are honored across harnesses (Claude Code, Codex/dev, Copilot CLI) and across models. The two patterns that make skills work at enterprise scale are (a) pipelining human-curated knowledge into agent skills with LLM-driven updates gated by change severity and evals, and (b) treating evals as unit tests for agents so that a shared skill can be safely modified like a shared library. Humans should spend their time at the extremes — defining what "good" looks like via evals, and reviewing major changes — never on diffing markdown in the middle.

Section TOC

#SectionSummaryTranscript lines
1MC introductionMC sets up the talk and welcomes John on stage.1–8
2The real bottleneck is context, not modelsYou don't need a smarter model; you need smarter context. Medium-tier models suffice once skills + evals are in place.9–32
3Why skills changed his perspectiveSkills are honored across harnesses and models, making context portable. Evals unlock using cheap models deterministically.33–54
4Personal background14 years at Cisco via Sourcefire, 12 years in TAC, last year on production agentic systems. GA platform with 1000+ users.55–72
5What a skill isMinimum is a skill.md; can include rules (more markdown), scripts, scaffolding. Evals are the key addition. Worked example: a repo-standards skill that forces uv over pip.73–98
6Story 1 — Pipelining the Cisco TAC knowledge base into skillsStrong KB culture; MCP-search-the-HTML didn't work. Two engineers built a pipeline that converts curated articles into skills with evals. Change-severity gating routes only major changes to humans.99–158
7"Skills are not for you — stop writing them by hand"The skill is for the agent. Humans should spend their precious time on evals and KB content, not on the markdown in the middle. Let the small model build its own skill.159–186
8Story 2 — Shipping an eval framework to 8 distributed teams as a skillBuilt a custom evaluation framework (JSONL datasets, environment-aware scripts, observability-driven assertions) and packaged it as an installable skill instead of running onboarding meetings.187–262
9The README-to-Confluence syncSingle source of truth in a git repo with paired skill.md (for agents) and README.md (for humans). A deterministic markdown→HTML script syncs to Confluence for managers who don't use git.263–294
10The cultural shift: "is this a skill?"Default reflex when anyone asks for documentation. Avoid 15 engineers building 15 skills for the same thing. Evals + shared skills work like a shared library.295–320
11What to do tomorrowStart small with one concept you keep re-explaining. Add evals. Semantically version your skills (0.0.x → 0.x → 1.0).321–354
12TakeawaysArchitectures change; context is the durable investment. Maintain in one place, distribute to many systems.355–366
13Q&ASkill.md vs README similarity; orchestrating across many skills (skill explosion); repo layout; review/PR culture; pointer to deeper eval discussion offline.367–end

Terminology glossary (definitions Groetzinger actually gave)

  • Skill — minimum: "you have that skill.md that just, you know, it's like your starting point for your agent". Can ship with "rules, which is essentially just more markdown files. You can put scripts in there, any scaffolding you want."
  • Eval / evaluation — "the unit test for your agents." Defines "the bare minimum that you expect this skill to always help with and always do it well."
  • Agentic fan out — the new pattern where "I asked it to do, you know, this massive prompt, and it spawns 15 sub agents, and that cost is just really increasing." A reason the cost of using top-tier models for everything has become a real enterprise concern.
  • Medium-tier model — his current baseline: "I'm GPT medium reasoning… I challenge my engineers to really try to use that medium tier as their baseline."
  • Router (not orchestrator) — Cisco's term for the central orchestrator in their multi-agent platform: "We don't call an orchestrator because we're Cisco. It's a router. It's a semantic router."
  • Change-severity gating — pipeline classifies KB article changes as "minor change, moderate change, major" and only routes major changes to a human reviewer; minor changes auto-release if evals pass.
  • Semantic versioning of context — 0.0.x = early sharing for feedback; 0.x = stabilising; 1.0 = "battle tested and evaluated", a promise that "it'll work… on the first try with almost no friction".
  • "Is this a skill?" reflex — the cultural default reaction when someone asks where documentation lives: "Go ask, do we have a skill that explains this so I can install it in my agent? That needs to become the default reaction."

Named frameworks / concepts introduced

  1. Context as the durable investment — "our agent architectures, they change every day… your context is your durable investment."
  2. Evals as unit tests for agents — "Evals are the same in this world with agentic. They are the unit test for your agents." Implication: shared skills can be modified PR-style like a shared library, with evals enforcing the contract.
  3. The KB-article-to-skill pipeline (Story 1) — GitHub-action-style pipeline triggered on KB article update → LLM classifies change as minor/moderate/major → LLM updates the corresponding skill → evals run → minor + passing = auto-release; major = human review; new topic = new eval required.
  4. The skill-as-onboarding-vehicle pattern (Story 2) — package a framework as an installable skill, distribute via a registry (he mentions Tessell), let each team's coding agent apply it on their behalf. Avoids onboarding meetings nobody pays attention to.
  5. Paired skill.md (for agents) + README.md (for humans) + deterministic sync to Confluence (for non-git users) — single source of truth in a git repo, "very simple… deterministic conversion" from markdown to Confluence HTML. "Don't have an LLM update Confluence. Many can do that too, but it's kind of token heavy."
  6. JSONL over JSON for eval datasets — "each example for each eval is an individual line. Your coding agent can make more precise edits on the data set."
  7. Environment-aware eval invocation — evals point at an endpoint, not a Python import. Same eval script runs locally, in staging, and in CI. Assertions check tool calls, parameters, token spend, and latency baselines via the observability system (LangSmith).
  8. Semantic versioning maturity ladder — 0.0.x (use yourself, share with a few) → 0.x (becoming stable) → 1.0 (promise of reliability).
  9. Humans at the extremes, not the middle — "Your humans either go far on the left or far on the right, but in between, you should not be wasting your time. Should not be caring about the text if your engineers are reviewing text diffs on a skill, you are wasting a lot of time."
  10. Let the small model build its own skill — "if you want a smaller model to do this workflow, let the small model build the skill for itself… 'how do you want to organize this information so you can be more efficient with it?'"

Open questions / not covered

  • How to evaluate non-text agent outputs (e.g. UI actions, multi-step tool chains beyond tool-call assertions) — only the LangSmith-style observability-based assertion pattern is described.
  • Specific eval framework code or schema — he describes constraints (JSONL, baseline schema, endpoint invocation) but does not show the schema itself. An audience member asks for further reading; he defers to a one-on-one chat.
  • Skill discovery / search at scale — when asked about orchestrating users to the right skill across hundreds of skills, he says "figuring out how to index your skills so they're searchable by an agent generally is the answer how you do that kind of depends on what your tooling is being used" and mentions Tessell, but does not present a worked solution.
  • Per-project vs. shared skill repository tradeoffs — confirms "each engineering team kind of has their own skills repo… one repo for like all of our team skills" but doesn't dive into mono- vs. multi-repo tradeoffs.
  • Specific model names or vendor comparisons — he names Opus, Haiku, "GPT medium reasoning", and mentions Claude Code, Codex/dev, GitHub Copilot CLI, but does not benchmark or rank them.
  • Security/IP considerations of LLM-edited skills — not addressed.
  • How to write a good eval for a non-deterministic task — acknowledged as harder ("a little more complicated with the non deterministic nature of all LLMs") but no recipe given.
  • The Sourcefire-acquired automation backstory referenced in his bio (industry award over a decade later) is not described in the talk itself.

talk-groetzinger-skills-everywhere

README.md

tile.json