What OpenAI, Stripe & ElevenLabs Devs Do Differently Now

Event — Securing the Agent Skill Supply Chain | Virtual | June 17Register

EnterpriseCareers Docs Registry

What OpenAI, Stripe & ElevenLabs Devs Do Differently Now

1,300 PRs a week with no human writing code. Live from AI Engineer London: Stripe, OpenAI, and Google on what's actually working. #AINativeDev

28 Apr 202601 h 05 minwith Steve Kaliski

context engineering AI Agents

In this episode

How aligned are teams at Google DeepMind, OpenAI, Stripe, and ElevenLabs on what’s changing in software development?

At AI Engineer London, with 100+ speakers and 1000+ engineers in the room, Simon Maple pulls together perspectives from across the ecosystem to understand where AI-native development is heading.

why traditional CI/CD “is dead”
the growing need for automated code review and guardrails
the move from more context is better to right context at the right time
the difference between general-purpose models vs specialised domain models

Context Engineering at Scale: Dispatches from AI Engineer London

The gap between what AI can do in theory and what it delivers in practice remains the central challenge for engineering teams. At AI Engineer London, conversations across the expo hall and speaker sessions kept circling back to the same themes: context management at enterprise scale, the evolving role of CI/CD infrastructure, and the practical realities of shipping agent-generated code to production.

The AI Native Dev podcast captured conversations with engineers from Stripe, OpenAI, Google DeepMind, and several startups building the infrastructure layer for agentic development. What emerged was a picture of an industry moving fast but still discovering the hard constraints that determine success or failure.

1,300 Pull Requests Per Week at Stripe

Steve Kaliski, an engineer at Stripe, described their internal coding agent called Minions, built on a fork of Block's Goose harness and using Claude and OpenAI models. The system now generates approximately 1,300 pull requests per week without human assistance during the coding phase. The only human intervention is code review.

The integration points are telling. A Jira ticket arrives, someone adds an emoji reaction in Slack, and the system spins up a complete development environment to attempt resolution. This level of automation required solving the context problem at significant depth.

"Stripe's been around for 10 plus years," Steve explained. "There's 10 plus years of context that doesn't exclusively live in the code base. Integration with payment methods or card networks or business logic around how we treat funds has legal and compliance contextualization."

Their approach involves pattern matching against known change types. About 80% of engineering at Stripe follows established paths: API changes, documentation updates, currency additions. By recognising which pathway a change follows, they can provide appropriately scoped context rather than dumping everything into the model. Different types of changes need fundamentally different context, and managing that routing is central to reliability.

CI/CD Infrastructure Hits a Wall

Madison Faulkner, a partner at NEA and former Meta AI researcher, delivered a talk titled "Why CI/CD is Dead," which sparked considerable discussion. The thesis: traditional CI/CD tools were built for humans submitting two or three diffs per week. When agents generate thousands or tens of thousands of changes, the infrastructure breaks.

"You can't have so many different agents taking versions of the same code and merging them," Madison observed. The merge conflict problem alone becomes unmanageable at agent scale.

Her framing suggests that current tools like GitHub Actions and similar systems represent a transitional phase. Companies like Namespace (one of her investments) are building complementary layers that speed up build and deployment times, reducing the bottleneck that emerges when agents are waiting for CI. But longer term, she predicts infrastructure will need to be rebuilt for what she calls "inference native" workflows.

The context engineering (https://claude.ai/blog/context-engineering-guide) implications are significant. If the deployment pipeline cannot keep pace with agent output, either agents slow down or quality gates get bypassed. Neither outcome is acceptable at scale.

Harness Engineering Without Plan Mode

Ryan Lopopolo from OpenAI described an approach that surprised many attendees: he has essentially banned his team from opening their editors. All code production happens through agents, with humans focused on refining output rather than generating it directly.

The key technical insight involves just-in-time context injection. Rather than front-loading all instructions into an agent.md file, his team injects relevant context only when needed, such as remediation instructions when a linter fails. This preserves attention for the actual problem rather than spending it on context that may not be relevant.

"The models are limited on two things, attention and context," Ryan explained. "We want the agents to largely cook with the minimum amount of instructions, giving context just in time in order to efficiently reclaim that context and let it operate over very long token horizons."

He explicitly avoids plan mode in most cases. The reasoning: reviewing and approving lengthy plans creates a false sense of control. If those plans encode incorrect instructions, accepting them pushes agents in wrong directions. Better to let agents figure out approaches through iteration, with feedback along the way.

The practical result: agents can execute for six, twelve, even thirty hours autonomously. Ryan mentioned buckling a laptop into his car's backseat to keep it running during commutes while agents work through complex changes.

Gemma 4 and the On-Device Context Challenge

Omar Sanseviero from Google DeepMind discussed the recent Gemma 4 release, which hit 10 million downloads in its first six days. The model family targets on-device deployment, fitting in consumer GPUs or even phones.

The context window limitations become more acute in this environment. Smaller models support 128K to 156K tokens, substantial for on-device but far shorter than what cloud-based Gemini offers. This makes context curation more critical. You cannot simply pass everything; you must be selective.

Omar noted that thinking efficiency matters as well. Models that generate very long chains of thought consume context that could be used for actual work. Getting dense, useful reasoning rather than verbose exploration becomes a design goal when context is constrained.

UK Government Digitises Planning with Multimodal Agents

Jordan Juritz leads the applied AI team for the UK government's Incubator for AI. His team has built multimodal agents that convert paper planning records into geospatial data.

The problem is visceral: local council basements contain boxes of paper documents recording what you can and cannot do with land in the UK. Every planning application requires someone to physically locate and read these records. One planning officer told Jordan they hoped the basement would flood to destroy the records so they would not have to keep searching through them.

The agents use Claude and Gemini for structured information extraction, then perform coordinate transforms from old pixel-based maps to modern geospatial formats. Planning officers review and correct the output, and that correction data feeds back into evals for continuous improvement.

The Prime Minister committed to digitising the UK's planning system by end of year, meaning hundreds of thousands of documents. A manual process that takes an hour per record now completes in about a minute at roughly 10 pence per document.

The Real State of Adoption

Nick Arcolano from Jellyfish offered a grounding perspective on what he observed at the conference. The gap between social media narratives and actual implementation remains wide.

"Everyone's like, it's harder than everyone on Twitter makes it out to be," Nick noted. His company tracks engineering metrics including token usage, and the emerging question is not just whether teams use more tokens but whether they use them effectively.

The parallel to other metrics is instructive. You would not run an engineering organization based on electricity consumption. Token spend as a success metric makes sense at very early adoption stages but quickly becomes insufficient. Engineering leaders want to understand why outcomes differ when token usage is similar.

Looking Ahead

The conversations at AI Engineer London suggest an industry that has moved past the "will agents work" question into "how do we make them work reliably at scale." The answers involve sophisticated context engineering (https://claude.ai/blog/ai-agent-context-management), rethinking infrastructure assumptions, and accepting that the tooling built for human developers needs substantial evolution.

Multiple speakers emphasised that experimentation remains essential. What worked three months ago may not be optimal now. The pace of change rewards teams that continually reassess their approaches rather than settling into fixed patterns.

The full conversations cover additional ground on small language models, specialised domain models, ACP as an emerging standard, and the UK government's work on AI tutoring benchmarks. Worth listening through for anyone tracking how production teams are actually deploying agents.

context engineering AI Agents

CHAPTERS