Skills on Tessl: a developer-grade package manager for agent skillsLearn more
Logo

Smaller Context,

Bigger Impact

Guy Podjarny
Founder & CEO, Tessl
Back to podcasts

What Holds Devs Back From Multi-Agent Thinking

with Guy Podjarny

Transcript

Chapters

Trailer
[00:00:00]
Introduction
[00:01:07]
Keynote Speaker Introduction
[00:02:05]
The Evolution of AI in Software Development
[00:03:40]
Challenges and Solutions in AI Agent Reliability
[00:04:44]
Context Engineering in Practice
[00:11:52]
Conclusion: The Future of AI and Context Engineering
[00:24:42]

In this episode

In this episode of AI Native Dev, host Guy Podjarny explore the shift from AI-assisted coding to agentic development, where AI systems autonomously handle tasks. They discuss the importance of spec-driven development, emphasizing the need to provide AI agents with concise and targeted context to ensure reliability and productivity. Learn practical strategies for integrating this approach into individual, team, and ecosystem-wide workflows to harness AI's potential effectively.

AI is transforming how we build software, but letting agents “do the coding” introduces a new reliability problem. In this episode of AI Native Dev, Guy Podjarny unpacks spec-driven (or context-driven) development: how to feed AI agents just enough, but not too much, of the right information so they can deliver trustworthy work. Drawing on lessons from Snyk’s developer-first security journey, Guy connects the mindset shift required for AI-native development to practical tactics you can adopt today—moving from single-player use to team workflows and, ultimately, to ecosystem-wide practices.

From Autocomplete to Delegation: The Agentic Shift

The first wave of AI dev tooling boosted individual productivity with autocomplete and inline chat—pioneered by tools like GitHub Copilot and Cursor. The second wave has been agentic: developers now delegate tasks to AI systems that plan, search, edit, and test code. This shift promises step-function productivity gains, but it also surfaces new failure modes. Agents can be “spooky good” one minute and confidently wrong the next, breaking unrelated parts of a codebase or declaring tasks complete when they aren’t.

Guy frames this as the capability–reliability gap. Studies like Meter’s show developers expect AI to speed them up, yet observed completion times can be slower due to rework and verification. Ignoring agents isn’t viable—the tech is too powerful—but harnessing it requires new operating principles. The question isn’t “should we use agents?” but “how do we help agents succeed predictably?”

Beyond Silver Bullets: What Worked, What Didn’t

The industry has cycled through “this will fix it” phases. Fine-tuning can teach brand-new skills but struggles to override entrenched model behaviors like coding style or architectural preferences. Retrieval-augmented generation (RAG) helps when your domain has unique terms and scattered docs, but context gathering is messy—relevant facts are often implicit, non-obvious, or not well-indexed. “Just use huge context windows” also disappoints: as context balloons to millions of tokens, attention dilutes. More input often leads to less focus.

Agentic search is a real step forward—let the agent explore the codebase, docs, and tooling as a developer would. But unconstrained exploration can be slow, expensive, or misdirected. The modern consensus Guy endorses is context engineering: stating the problem with precisely the information an intelligent system needs to plausibly solve it without additional fishing. As Shopify’s Tobias Lütke put it, the craft is in composing context so the task is solvable without mind reading. In practice, that looks a lot like writing specs.

Context Engineering = Spec-Driven Development (Single Player)

Start with the base context: your code. Good file names, clear module boundaries, and up-to-date docs enable an agent to perform agentic search and load the right files automatically. In Guy’s demo, asking an agent to “add an Edit button to each to-do item” prompts it to scan the project, locate UI components and state, and make changes without any handholding. But aesthetics and conventions are rarely encoded in code. The first attempt renders a blue button that clashes with the project’s Jurassic theme.

Enter explicit context—your “spec.” A small agents.md (or claude.md, cursor rules, etc.) that states “use the site’s theme colors” and any other non-code conventions (even “use British spelling”) nudges the agent to the correct outcome. The lesson: keep these specs short and targeted. Overlong guidance dilutes attention; short, well-scoped instructions travel further. Treat context files as code: version them, keep them close to the repo, and prefer concise “rules of the road” over exhaustive style treatises. For task work, add micro-specs in the prompt or PR description, scoped to the change you want.

Practically, think in three tiers:

  • Base context: the codebase and in-repo docs the agent can read.
  • Global explicit context: a brief, evergreen spec with org/project conventions.
  • Task context: a small, situational spec attached to the change request.

Measure Before You Optimise: Evaluating Agent Work

“You can’t optimise what you can’t measure” isn’t just for ops anymore. Agents are statistical systems; success should be expressed as rates, not absolutes. Guy emphasises building lightweight evaluation harnesses so you can iterate on prompts, specs, and tools with feedback loops.

His team’s experiment illustrates why brevity beats breadth. They asked agents to add session-backed authentication to a small “dodge the blocks” game. Alongside the implementation prompts, they generated a security scorecard rubric (using agents) to grade the result. Three modes were tested: no guidance; a precise ~3KB slice of OWASP authentication guidance; and a longer ~20KB version that subsumed the short one. Results: no guidance scored ~65%; the short OWASP context jumped to ~85%; the long version fell to ~81%—more words, worse focus. They ran this across multiple agents (Claude, Codex, Cursor) and saw the same pattern.

Action this by creating a mini-benchmark for your codebase:

  • Define realistic tasks (e.g., add feature X, refactor module Y).
  • Pair each task with tests and a rubric covering correctness, security, style, and performance.
  • Run agents multiple times per condition to capture variance.
  • Track success rates, time-to-completion, diff size, and human review effort.
  • Modify specs and constraints, then re-measure. Keep what moves the metrics.

From Single Player to Teams and Ecosystems

Spec-driven development becomes truly valuable when it scales beyond an individual. For teams, establish a shared, minimal global spec—naming conventions, architectural decisions, security posture, logging standards, and UI tokens. Keep it short and stable. Then augment with module-level specs (e.g., “payments module uses this idempotency pattern; API errors follow RFC7807”), and attach per-task micro-specs in tickets or PRs. Encourage contributions from design (tokens and interaction patterns), product (acceptance criteria), and security (threat models, guardrails) so agents inherit institutional knowledge.

Constrain agentic search to trustworthy sources. Point agents at the repo and a curated docs folder rather than the whole internet. If you add RAG, index only vetted docs. Prefer tool usage that keeps the agent inside the project boundary (e.g., read_file, run_tests) and make external calls explicit and auditable. Small, composable specs also unlock reuse at the ecosystem level: versioned spec packages for security practices, API contracts, or UI systems that can be imported across repos. This enables teams to share conventions without copying walls of prose into every prompt.

Finally, put specs in the loop. Add a “spec check” to code review: the agent summarises how the diff adheres to the spec and flags divergences. Use CI to fail builds when generated code breaks spec-defined invariants (e.g., missing auth checks, violating logging formats). Measure adherence and outcomes across teams to see which specs improve reliability—and prune the rest.

Practical Workflow Patterns Developers Can Adopt Today

A reliable agent workflow blends code context, small specs, and constrained search. Start by investing in code readability: descriptive file names, clear directory structures, and up-to-date READMEs make agentic search effective. Add a repo-level agents.md with 10–15 bullet rules max. For each feature, create a micro-spec in the issue/PR as a checklist of acceptance criteria, edge cases, and constraints. Ask the agent to plan first, then implement, then run tests, then justify how the result satisfies each checklist item.

On the tooling side, lean on agents that support dynamic file reading, test execution, and iterative planning. Use big context windows judiciously—reserve them for complex, tightly scoped tasks where the extra context is truly relevant. Prefer “bring the right 3KB” over “dump 2M tokens.” If you need RAG, keep the corpus tight and the retrieval focused on task-relevant namespaces. When you’re tempted to fine-tune, ask whether a small, well-placed spec would achieve the behavior change faster and more reliably.

Key Takeaways

  • Treat context as a spec: Write down the minimum information an intelligent agent needs to plausibly solve the task—no more, no less.
  • Keep specs short and layered: global repo spec (stable, minimal), module specs (focused), and per-task micro-specs (temporary, precise).
  • Constrain the agent’s world: Prefer curated sources, explicit tool calls, and in-repo docs over unbounded web search or bloated context.
  • Measure everything: Build small evaluation harnesses with tests and rubrics; track success rates, time-to-completion, and review effort.
  • Optimise by subtraction: Shorter, sharper guidance often beats long-form policies; attention dilutes with context bloat.
  • Evolve to team and ecosystem: Version specs, share them across repos, add spec checks in review/CI, and let design/security/product contribute rules of the road.

The bottom line: agentic development works when we pair powerful models with disciplined context engineering. Start small, measure, and scale your specs from single-player wins to team-wide and ecosystem-level reliability.