Skills on Tessl: a developer-grade package manager for agent skillsLearn more
Logo
Back to podcasts

What AI Engineering Looks Like at Meta, Coinbase, ServiceTitan and ThoughtWorks

with Ian Thomas, David Stein, Wesley Reisz, and Sepehr Khosravi

Transcript

Chapters

Trailer
[00:00:00]
Introduction
[00:01:13]
Deep Dive into AI Tools and Productivity Tips
[00:03:03]
Exploring Claude Code and Advanced AI Techniques
[00:11:44]
Migrating Legacy Architecture to Data Lake
[00:34:32]
Challenges and Solutions in Code Migration
[00:35:26]
Agent Preparation and Validation
[00:37:34]
Introduction to Ripper Framework
[00:49:52]

In this episode

In this episode recorded live at QCon AI in New York, host Simon Maple and Coinbase machine learning platform engineer Sepehr Khosravi explore the dynamics of maximising developer productivity with AI. Delving into Sepehr's insights on choosing the right AI copilot, they discuss the cultural shifts, process improvements, and architectural changes necessary for effective AI-native development. Key takeaways include adopting a proof-first culture, clarifying AI task levels, and making context a priority to convert speed into meaningful outcomes.

Live from QCon AI in New York, host Simon Maple sits down with Coinbase machine learning platform engineer Sepehr Khosravi—plus contributions from David Stein, Ian Thomas, and Wesley Reisz—to unpack what actually moves the needle on developer productivity with AI. The episode centers on Sepehr’s talk “Choosing Your AI Copilot: Maximising Developer Productivity,” but widens to culture, process, and the architecture shifts needed to sustain AI-native workflows at scale.

Proof Over Opinion: Culture and Process for AI-Native Development

Ian Thomas opens with a cultural truth that underpins successful AI adoption: proof wins arguments. In engineering-led organizations, debates about tools and approaches are best settled by working software and measurable outcomes. That mindset showed up throughout the episode—instrument what you try, show the delta, then scale what works. It’s a useful antidote to hype and a way to move beyond opinion toward reproducible value.

Wesley Reisz adds a crucial framing question when teams say they want “AI in development”: at what level are we talking? Is the goal code completion, agentic task execution, or changes to upstream process and architecture? In his work, they defined a clear, repeatable process (referenced as “Ripper 5”) that starts with a written spec and then pairs developers with LLMs through each step. The emphasis is on clarity of intent, bounded tasks, and fast feedback—so the AI’s output is both checkable and usable.

To keep conversations grounded, the team points to data like Stanford’s study of 100,000 employees: AI helped generate 30–40% more code, but 15–25% of that needed rework, netting 15–20% productivity. The implication is not “AI underdelivers” but “process quality determines the yield.” Spec-first work, clear acceptance criteria, and tight review loops convert raw code volume into shipped, maintainable features.

Your AI Copilot Stack: Cursor IDE + Claude Code

Sepehr’s daily environment is Cursor IDE paired with Claude Code in the terminal. He performs 80–90% of tasks in Cursor, then kicks off deep or ambiguous work in Claude Code, where the larger context handling and agent-style depth often succeed when general autocomplete doesn’t. Interestingly, his team tracks AI usage (in a supportive way), and token consumption differences across tools highlighted real utilization patterns while nudging him to try Claude Code more deeply—an experiment that stuck because it worked.

For developers skeptical of AI or new to it, Sepehr recommends starting with Cursor’s Tab AI. It’s low-friction autocomplete that can output 10–20 lines at a time, shaping muscle memory without changing your entire workflow. From there, activate Cursor Agent for bigger changes, then lean on multi-agent mode when you want to evaluate models or approaches side-by-side without derailing your day.

Multi-agent mode is especially useful when new models appear (e.g., comparing “Chat 5.2” against an existing daily driver like “Opus 4.5”). Benchmarks can be noisy or not match your codebase, so shadowing new models in real tasks is key: issue the same prompt to multiple models, compare the code and explanations, and decide based on clarity, correctness, and follow-through. Sepehr often prefers Claude because it explains the “why” behind changes, improving your understanding and future autonomy.

Speed Is a Feature: Composer, Latency, and Flow

Speed isn’t just convenience; it’s cognition. Cursor’s Composer model exists for this reason—it generates code quickly. Sepehr cited a page generated by Composer in 24 seconds versus 2 minutes and 30 seconds with another model. That delta is large enough to pull you out of flow, increasing context-switching costs and, ironically, error rates later. The joke that YC backed a “brain rot IDE” with TikTok while you wait for your agent to finish is a tongue-in-cheek signal: latency is now a developer-experience priority.

A practical pattern emerges: use Composer for scaffolding, boilerplate, and shorter single-file edits where speed dominates. When you hit ill-defined problems, cross-file refactors, or tasks with tricky domain invariants, escalate to Claude Code. This bifurcation helps you retain flow—fast where you can be, deep where you must be—rather than forcing every task through the same slow agentic path.

To reinforce flow, timebox agent runs. If an agent doesn’t meaningfully advance the task within 30–45 seconds for small jobs (or a few minutes for complex, multi-file changes), pause, refine the spec, and retry. Latency is feedback: if the model can’t move fast, your prompt may be under-specified, you may be overloading the context window, or you need to decompose the task into smaller steps.

Make Context Work: Rules, MCPs, and Spec-First Prompts

Sepehr emphasises that you should treat the AI like a junior engineer: it can be brilliant, but it needs the right context and constraints. Cursor’s Rules are the foundation here. He outlines four useful modes you can mix per project: always-apply rules for global preferences (style, security posture, diff-only edits), context-aware rules that Cursor applies when relevant, file- or directory-scoped rules for module-specific conventions, and manual-only rules for sensitive operations you explicitly opt into. Done well, rules serve as the “house style” and guardrails an onboarded teammate would receive.

Beyond code context, the Model Context Protocol (MCP) lets you wire documentation, APIs, and tools directly into the agent. This solves a big gap: code alone rarely explains domain invariants, data contracts, and “why it’s this way.” A documentation MCP allows the AI to answer questions and fill in missing intent, reducing hallucinations and preventing invasive refactors that violate non-obvious constraints. For many teams, connecting design docs, runbooks, and ADRs is the single highest-leverage improvement after enabling autocomplete.

Finally, manage the context window actively. As you near the limit, LLMs may default to terse, low-quality outputs. In Claude, you can use the /compact command or instruct explicitly: “You may compact prior context; produce the best possible answer.” Even better, tell it what you’ll do next so it can jettison irrelevant context. Paired with spec-first prompts (task intent, constraints, acceptance criteria), this keeps responses high quality without bloating tokens or slowing the loop.

From More Code to Better Outcomes: Quality and Architecture

The Stanford numbers (30–40% more code, but 15–25% rework) quantify something teams feel: AI is an accelerant, not a substitute for engineering rigor. To convert speed into outcomes, keep the bar high. Start with a spec and acceptance tests; use multi-agent comparison for risky changes; insist on readable diffs with rationale; and execute code under test harnesses or ephemeral environments before committing. You’ll ship faster than before, but just as importantly, you’ll ship with confidence.

David Stein zooms out to architecture, reminding us that productivity gains hit ceilings when they run into legacy systems. Most large companies have stacks that no longer match how they’d build today. His team is shifting to an off-production analytics architecture: a semantic layer with a query engine that serves metrics and insights without hammering production systems. That’s not just a data win; it’s an AI win. Agents can safely query, aggregate, and reason over business metrics when you give them a consistent semantic contract and a performant, isolated execution path.

This architecture pattern—semantic layer + query engine, off production—untangles operational concerns, improves performance, and creates a safe substrate for agentic analytics, test data generation, and observability. Combined with a culture of proof (prototype, measure, iterate) and a process that respects context (rules, MCPs, spec-first prompts), you get the compounding benefits promised by AI-native development rather than a collection of flashy demos.

Key Takeaways

Adopt a proof-first culture. Small, instrumented pilots beat debates—measure cycle time, review burden, defect rates, and deploy frequency to decide which tools and patterns stick. Clarify the “level” of AI you want on a task, then choose appropriately: Tab AI for low-friction speed, Composer for fast scaffolding, multi-agent for evaluation, and Claude Code for deep, cross-file work. Use multi-agent shadowing to assess new models in your codebase rather than relying solely on general benchmarks.

Make context a first-class citizen. Codify rules (global, context-aware, file-scoped, manual) so the AI behaves like a well-briefed teammate. Connect documentation via MCP to eliminate domain blind spots. Manage the context window deliberately, using /compact and spec-first prompts to keep outputs crisp and high quality. Finally, remember that sustained gains require modern foundations: consider a semantic layer with an off-production query engine to safely power agentic analytics and developer tooling at scale.

Chapters

Trailer
[00:00:00]
Introduction
[00:01:13]
Deep Dive into AI Tools and Productivity Tips
[00:03:03]
Exploring Claude Code and Advanced AI Techniques
[00:11:44]
Migrating Legacy Architecture to Data Lake
[00:34:32]
Challenges and Solutions in Code Migration
[00:35:26]
Agent Preparation and Validation
[00:37:34]
Introduction to Ripper Framework
[00:49:52]