Skills on Tessl: a developer-grade package manager for agent skillsLearn more
Logo
Back to podcasts

Intelligence ≠ Knowledge: Why Context Beats Bigger Models

with Guy Podjarny, and Simon Maple

Transcript

Chapters

Introduction
[00:01:08]
Deep Dive into AI Agents and Their Impact
[00:04:55]
Developer Frustrations with Agents
[00:35:31]
Context Management and Agent Enablement
[00:39:36]
Predictions for AI and Agents in 2026
[00:52:19]
Looking Forward to 2026
[00:59:50]

In this episode

In this New Year kickoff episode, host Simon Maple and guest Guy Podjarny delve into the transformative developer-facing shifts of 2025 and their implications for AI development in 2026. They explore the evolution from prompt-centric approaches to agentic systems, emphasising the importance of designing AI-native dev experiences that manage entire pipelines and focus on observability, structured contexts, and chat-first workflows. Key takeaways include the need for end-to-end instrumentation of agents, the benefits of specialised, well-specified tool graphs, and the merging of code generation and review into a cohesive feedback loop.

A whirlwind year turned “prompt engineering” from a party trick into a discipline of context design and agent orchestration. In this New Year kickoff, host Simon Maple is joined by co-host and guest Guy Podjarny (CEO/founder of Tessel) to look back on the biggest developer-facing shifts of 2025 and what they mean for building with AI in 2026. From agent observability and chat-first workflows to the convergence of code-gen and code-review, this episode distills patterns, pitfalls, and product bets developers can apply right now.

2025 In Review: From Prompt Tricks to Agentic Systems

The show itself mirrored the industry’s acceleration: 53 episodes, over 1 million YouTube views, 45,000+ subscribers across platforms, and 190 shorts. The most-watched episode featured Datadog’s Olivier Pomel—proof that leaders at scale are both shaping and responding to AI-driven development. But the bigger story is the thematic shift: early 2025 was still prompt-centric; by year’s end, teams were shipping agentic workflows, tool-augmented models, and context pipelines as first-class infrastructure.

An early episode with Macy Baker captured the wonder and fragility of prompts—deception games, clever incantations, and model-specific quirks. By contrast, late-year discussions focused on structured agent behaviors, context hygiene, and tool use. That evolution marks a maturation: developers stopped optimising for “the right words” and began designing systems that consistently guide models toward the right decisions.

The takeaway: prompts now play a supporting role in a broader system. Building an AI-native dev experience means owning the end-to-end pipeline—retrieval, tools, memory, and evaluation—and accepting you’re managing probabilities and policies, not just strings.

Designing for Two Black Boxes: Models, Agents, and Observability

A central theme this year: developers are now wrangling two black boxes—the LLM and the agent. The LLM’s reasoning is opaque, and the agent’s plan/act loops add another layer of uncertainty: when to search, what files to read, which tools to call, in what order, and how aggressively to iterate. Episodes with Uni (man-in-the-middle tracing) and Max (agents 101) stressed that developers need to reverse-engineer behavior using telemetry, not intuition.

Actionable practices:

- Instrument every turn. Persist a full trace: system prompts, user messages, tool calls, tool results, and model deltas. Label each step with timestamps, token counts, costs, and statuses (success/failure/retry). You can’t improve what you don’t see.

- Constrain degrees of freedom. Use a whitelisted tool registry with typed, schema-rich signatures; define preconditions and cost/time budgets per tool; set retry limits and backoff policies. Fewer, higher-quality tools beat a sprawling toolbox.

- Shape context deliberately. Build code-aware context (symbol graphs, repo maps, embeddings) and prioritise relevant chunks. Order matters: source-of-truth docs before user chat, current branch diffs before historic files, etc.

- Make system prompts policy-heavy, not hacky. Instead of “be helpful,” specify objective functions (e.g., reduce diff size, minimise tool calls), selection heuristics (when to search vs. read local files), and escalation policies (when to ask the user).

- Evaluate like a product team. Track task success rate, average steps-to-success, tool-call accuracy, hallucinated tool invocations, and cost-per-solved task. Maintain regression suites of realistic tasks to compare providers, prompts, and tool sets.

Uni’s comparative analysis of Claude, OpenAI, and Gemini CLI showed that tool usage frequency, system prompt styles, and planning aggressiveness vary materially across providers. Benchmark your workflows against multiple models; your ideal provider often depends on your tool graph and repo structure, not brand halo.

Chat-First Development: From IDE Panes to Slack-Native Agents

A standout conversation with Slack’s Samuel Messing underscored a usability truth: if development is increasingly collaborative and multi-turn, chat is a natural interface. The industry started to lean into that—Devin launched with Slack integrations; Claude Code added a Slack app; and many teams now prefer rich chat UIs over terminal-bound “agent shells.”

Design patterns for Slack-native agents:

- Threads as tasks. Each thread represents a scoped mission with its own context and memory, making it easy to revisit, summarise, or hand off.

- Slash commands as typed tools. Expose durable capabilities (/plan, /diff, /test, /deploy) that map 1:1 to your agent tools and accept structured JSON arguments.

- Code-aware attachments. Let users drop files, gists, or PR links; your agent ingests them into a task-specific context index and calls code intelligence tools (symbol graphs, LSPs) behind the scenes.

- Ephemeral sandboxes. For safety and speed, spawn short-lived environments where the agent can clone, build, run tests, and produce artifacts. Attach logs and previews back into the thread.

- Identity and permissions. Map Slack users and channels to repo permissions, secrets scopes, and deployment rights. Every tool call should include the acting identity and an auditable reason.

- Bridge to IDEs. Offer “Apply patch in repo” deep-links that open a PR or a local workspace change. Chat is for intent and iteration; the IDE remains best for inspection and fine edits.

Stevie’s “UI evolution” framing—chat panes growing from side panels to primary canvases—suggests a boundary shift: code remains source-of-truth, but the main control plane is moving into collaborative chat. If your agent UX still assumes a solo terminal user, you’re likely leaving adoption on the table.

Code Review Meets Code Generation: The Cursor–Graphite Convergence

Podjarny called it early: if a review agent can reliably find issues and propose fixes, why wait until review? The logic—and later, Cursor’s acquisition of Graphite—signals a convergence: code generation, review, and fix-application are becoming a single feedback loop.

How to build for this convergence:

- Review-in-place autofix. When the review agent flags issues, let it propose minimal diffs, run lints/tests locally, and push commits back to the PR with an “Explain your fix” note.

- Policy-driven gates. Use risk scoring (blast radius, dependency changes, security flags) to decide which fixes can auto-apply vs. require human approval. Start with lint/doc/test-only fixes.

- Test Impact Analysis. When agents propose diffs, run only impacted tests first to reduce latency. Escalate to full suites on success or high-risk changes.

- Continuous critique. Let the same static analysis, security scanning, and style checks that power review also inform earlier code-gen steps. Fewer surprises at review time.

- Metrics that matter. Track review cycle time, diff acceptance rate, rework rate, flaky test incidence, and “agent-added defects.” Use these to tune when the agent asks for help vs. pushes ahead.

Expect 2026 tooling to ship “closed-loop PRs” where an agent plans, codes, self-reviews, self-fixes, and presents a ready-to-merge change with a crisp audit trail. Teams that wire CI/CD and review automation now will be first to benefit.

Solo Builders, Agent Teams, and Product Scope: Lessons from Base44

The Base44 story—largely built by Mure, quickly acquired by Wix, then scaled to millions of users and thousands of daily payers—showed what’s now possible with agentic platforms. Meanwhile, Lovable’s hypergrowth and Tom Hume’s caution about the “single-person unicorn” myth framed a nuanced reality: agents can compress headcount, but not eliminate product scope decisions, quality bars, or go-to-market work.

What developers can emulate:

- Build-with-your-tool. Dogfood relentlessly. If your agents can spec, scaffold, code, and iterate your own product, you’ll converge on the right tool graph and observability fast.

- Choose a narrow, high-context vertical. General agents sprawl and drift; vertical agents exploit structured context (domain schemas, workflows, compliance rules) to deliver reliability.

- Treat agents as a workforce. Model your pipeline explicitly: clarify > plan > retrieve > act > validate > summarise. Use typed tools, budgets, and SLAs for each step.

- Engineer for cost and speed. Track cost-per-success, cold/warm start times, and tool-call hit rates. Introduce caching (embeddings, search results), and prefer local analysis over web search when possible.

- Design upgrade agility. Keep model/provider abstraction layers thin and swappable. Different tasks (planning vs. refactoring vs. doc generation) may benefit from different providers.

The headline isn’t “solo forever,” it’s “ship with a tiny team and an agent workforce.” The teams that scope well, instrument deeply, and iterate fast will outpace bigger orgs still arguing over prompt styles.

Key Takeaways

- Instrument your agents end-to-end. Persist turn-by-turn traces, tool calls, costs, and outcomes. Observability is how you tame both black boxes.

- Constrain and type your tools. A smaller, well-specified tool graph beats a sprawling, ambiguous one. Encode preconditions, budgets, and fallback policies.

- Make chat the control plane. Slack-native workflows (threads-as-tasks, slash-command tools, ephemeral sandboxes) unlock collaboration and adoption far beyond terminal UIs.

- Unify code-gen and review. Let the review agent auto-fix the issues it finds under clear policies. Measure diff acceptance, time-to-merge, and agent-added defects to guide automation levels.

- Dogfood and specialise. Build your product with your own agent stack, pick a tight vertical, and focus on reliability via domain context—not generic breadth.

- Optimise for provider fit, not brand. Claude, OpenAI, and Gemini differ materially in tool behavior and system prompt patterns. Benchmark your workflows and mix providers by task.