ARTICLE

AI Native DevCon Day 2: From Agent Demos to Operating Models

Explore AI Native DevCon Day 2 highlights on operating models, workshops, and agent demos. Learn how to enhance AI-native delivery. Join the discussion!

Rohan Sharma

·3 Jun 2026·12 min read

TL;DR

Day 2 of AI Native DevCon shifted from agent capability to operating discipline. The strongest sessions focused on how teams can run AI-native delivery with clearer context pipelines, measurable agent behavior, safer execution boundaries, and better organizational ownership.

The scale showed up in the numbers too. Across the two days, DevCon brought together 650+ in-person registrations, around 2,000 online registrations, and a packed mix of sessions, workshops, hallway conversations, and practical lessons.

Day 2 leaned into workshops. That shift mattered because the second day was less about proving agents can do useful work and more about showing how teams can make that work repeatable.

Hey there, welcome back. Rohan Sharma here again continuing the devcon series.

Day 1 gave us the framing, including Guy Podjarny’s core point that skills should be treated like real software assets. Day 2 picked up from there and moved into the operating details. Once agents are inside daily engineering work, platform and product teams need to decide what changes first, who owns those changes, and how the results are measured.

Talks that shaped Day 2

Harness engineering beyond code

Marc Sloan from Tessl focused on the next gap many teams are hitting. Code context is increasingly structured, but product and design context still lives in external systems such as Figma, Notion, and Linear. Pulling that context live can reduce staleness, but it introduces drift in evals, versioning, and reproducibility.

The practical lesson was to stop treating external product and design context as random reference material. Teams need a defined layer between the repository and those external systems, with clear versioning so evaluations can be replayed against known context snapshots.

Without that, agents can produce work that looks technically correct while missing the product constraint that actually mattered. That is a very expensive kind of almost-right.

From vibes to metrics

Simon Obstbaum and Rob Willoughby from Tessl delivered a session focused on a challenge many engineering leaders are currently facing. Their distinction between output evals and trajectory evals is operationally important. A good answer is not enough if the agent used risky tools, skipped required checks, or ignored policy steps.

The useful measurement model came down to activation, trajectory, and outcome. Did the right skill trigger? Did the agent follow the right steps? Was the final result actually useful and correct?

The good part was the emphasis on partial compliance. Pass or fail is too blunt for agent workflows. If a workflow degrades halfway through, teams need to know where it happened, not just that something felt off.

Benchmarking beyond the model

Amit Kushwaha highlighted why many current benchmarks miss real agent behavior. Agent systems run long traces with tool calls, context accumulation, and latency bottlenecks that one-shot benchmark numbers do not capture.

For teams choosing infrastructure, the warning was clear. Do not optimize only for model speed. Real agent workloads involve tools, memory, caches, retries, and long-running traces.

The better benchmark is closer to production reality, with multi-turn tasks, tool latency, tail latency, and cache behavior over time. Otherwise teams risk picking systems that look great in a chart and struggle in the actual workflow.

Safe execution boundaries for agents

Oleg Šelajev from Docker covered a problem every platform team eventually sees. An unconstrained agent can make high-impact changes in the wrong environment. Sandboxing is not optional once agents are allowed to execute.

The practical takeaway was to treat environment policy as part of the harness. Filesystem access, network access, secrets, and permissions all need clear boundaries before agents are given the ability to act.

This is how teams lower blast radius. Not by hoping the agent behaves nicely, but by designing the room it is allowed to move around in.

Do not write prompts, write software

Baruch Sadogursky and Macey Baker from Tessl reinforced an idea that keeps proving useful in production. Break behavior into modular skills instead of maintaining one giant prompt. This makes agent behavior easier to test, review, and reuse.

The message was not “write a better mega prompt.” It was to turn repeatable behavior into composable skills that match real workflow stages. That gives teams something they can review, test, improve, and share across repos.

If you try one thing from this workshop, use the materials and skill templates as a starting point. Prototype one small skill pipeline in your own environment before trying to scale the pattern across every repo.

What kept coming up across the day

1. Context quality is now a platform responsibility

Marc Sloan, Shaun Smith, and John Groetzinger approached this from different angles, but the operational message was consistent. Context delivery is becoming an engineering system, not documentation hygiene. Teams need predictable context pipelines for both humans and agents.

The next step is ownership. Teams need to know who maintains context sources, how often they refresh, and how changes are versioned. Context also needs observability so teams can trace which inputs shaped an agent decision.

2. Agent performance needs production-grade telemetry

The sessions from Simon Obstbaum and Rob Willoughby from Tessl, plus Amit Kushwaha from NVIDIA and Justin Cormack, former CTO at Docker, made this very concrete. Teams need to measure how agents worked, not only what they returned.

Trajectory metrics belong next to existing quality signals. If your dashboards already show test health, release health, or incident trends, agent workflow quality should sit in the same operational view.

The benchmark scenarios should also look like real work. Multi-turn, tool-heavy, slightly messy, and full of the same constraints your teams face every day. Justin’s observability point connected neatly here too. Teams need runtime signals that can reveal agent-induced drift before it becomes a bigger production problem.

3. Adoption is an organizational design problem, not a tooling checkbox

Talks from Tammuz Dubnov and Birgitta Böckeler from Thoughtworks showed that adoption succeeds when review structures, ownership boundaries, and team rituals evolve with the tooling.

That means setting explicit contribution boundaries for AI-assisted changes and updating review criteria. The diff still matters, but so does the path the agent took to produce it. Birgitta’s adoption data made this especially grounded by showing where hidden costs appear, including review load, technical debt, and maintainability when speed becomes the only metric.

4. Workshops made the ideas practical

Baruch Sadogursky and Macey Baker from Tessl, along with Alfonso Graziano from Nearform, helped turn the bigger Day 2 ideas into something teams could actually try. The workshop-heavy format made the day feel less like theory and more like practice.

Derek Ashmore’s packed workshop, “The AI Agent Testing Pyramid,” focused on the different levels of testing agent systems need. For those following from home, you can attempt it on your own by following this repo.

Aashrey Tiku from Anthropic worked through a hands-on session on shipping a managed agent. It was a useful bridge between agent concepts and the practical work of packaging, managing, and operating an agent with the right boundaries.

That mattered because AI-native development is still new enough that people need patterns they can test, not just concepts they can nod along to. Alfonso’s spec-driven angle fit well here because prompts become far more useful when they are turned into testable, production-ready specifications.

5. Agent enablement needs real ownership

Ian Thomas from Meta and Katie Roberts from Nearform made the enablement side feel practical. Rollouts work better when platform safeguards are paired with updated team rituals, clear ownership, and realistic guidance for brownfield systems.

Katie’s legacy advice was especially useful. AI should help teams modernize incrementally, not generate another fragile layer on top of systems that are already hard to maintain.

If you missed Day 1, start here

Day 2 was workshop-heavy. If you missed the Day 1 virtual stream, start with these talks before digging into the workshop themes.

Guy Podjarny, Tessl - Skills are the new Code
Dana Lawson, Netlify - Built for Humans. Now Agents Are Here.
James Moss, Tessl - Using skills to pay the bills
Liran Tal, Snyk - Your AI Agent Installed Malware Because a SKILL.md Told It To
Ryan Lopopolo, OpenAI - Harness Engineering
Patrick Debois, Tessl - The Rise of Agent Enablement
Shachar Azriel, Baz - Executable Specs
May Walter, Hud - Runtime Intelligence for Continuous Agentic Performance Optimization
Dave Farley - Vibe Coding: Is this really the best we can do?

That set gives the right foundation for Day 2 across skills, context, verification, security, harnesses, runtime feedback, and team enablement.

AI Native DevCon is not over yet!

We are already working on the next AI DevCon, and yes, we are very excited to say that AI DevCon NYC is officially on the way.

If Day 1 gave the frame and Day 2 showed the operating model, NYC is where the conversation gets even more practical. Expect more on skills, harnesses, agent safety, context systems, benchmarking, product workflows, and what it really takes to make AI-native delivery work inside teams.

Super-early-bird seats are available now. If you want to be in the room for the next round of conversations, this is the time to grab a spot.

In the meantime, register for the AI DevCon newsletter. We will release the content shared over the conference, including selected highlights, session clips, notes, slide decks, and workshop materials as they are published.

COPY & SHARE

Rohan Sharma

Building the AI Native Dev community. DevRel at Tessl. Open Source contributor.

3 posts

READING

IN THIS POST

TL;DR Talks that shaped Day 2 What kept coming up across the day If you missed Day 1, start here AI Native DevCon is not over yet!

COPY & SHARE

Rohan Sharma

Building the AI Native Dev community. DevRel at Tessl. Open Source contributor.

3 posts

YOUR NEXT READ

AI Native DevCon Day 1: Making AI Agents Ready for Enterprise

AI Native DevCon Day 1 focused on making AI agents enterprise-ready, emphasizing reliability, skills as code, and adapting platforms for agent integration.

Rohan Sharma

·2 Jun 2026·12 min read

AI Native DevCon Day 2: From Agent Demos to Operating Models

TL;DR

Talks that shaped Day 2

Harness engineering beyond code

From vibes to metrics

Benchmarking beyond the model

Safe execution boundaries for agents

Do not write prompts, write software

What kept coming up across the day

1. Context quality is now a platform responsibility

2. Agent performance needs production-grade telemetry

3. Adoption is an organizational design problem, not a tooling checkbox

4. Workshops made the ideas practical

5. Agent enablement needs real ownership

If you missed Day 1, start here

AI Native DevCon is not over yet!

AI Native DevCon Day 1: Making AI Agents Ready for Enterprise

More articles by Rohan Sharma