CtrlK
BlogDocsLog inGet started
Tessl Logo

ainativedev/latest-aidevcon-speakers-london-2026

AI Native DevCon 2026 London — all conference sessions as interactive skills

66

Quality

83%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

Overview
Quality
Evals
Security
Files

outline.mdtalk-maple-ai-native-devcon-welcome-spec-reviewer/

Outline — Welcome to AI Native DevCon (Spec Reviewer talk)

Speaker

Shachar — Product Manager at Buzz, based in Tel Aviv. Self-described as a "product manager [veteran]" with ~a decade in product management. Mission at Buzz: "building the best and the most precise AI code review agent in the market." Planning to relocate and establish a new site for Buzz in San Francisco.

Host / event framing: Simon Maple (Head of Developer Relations at Tessl, AI Native Dev co-host) — named in the event metadata as the host who introduces the session. The transcript body itself is delivered by Shachar; Simon's words are not separately labelled here.

Audience Q&A: three unnamed audience questioners at the end.

Abstract

Not provided by user. [inferred] A case study from Buzz on building "Spec Reviewer" — an agent that verifies whether implemented features actually match the spec — and the context-engineering lessons learned along the way (planner/verifier split, sub-agent delegation, base-branch grounding, ephemeral sandboxing).

Thesis (synthesis)

Coding agents in 2026 are good at generating code but bad at verifying features against their specs, which is why human code review is still a bottleneck. Solving this is a context-engineering problem, not a model-capability problem: you must (a) split planning from verification, (b) delegate per-requirement verification to parallel sub-agents, (c) ground the verifier in the base branch rather than the diff to avoid solution bias and hallucinated requirements, (d) scope each sub-agent to the relevant layer (frontend vs backend), and (e) sandbox browser sessions when the agent must visit customer URLs. The strategic takeaway: the big coding-agent vendors overlook cracks like this, and that's where small teams can win.

Section TOC

SectionSummaryApprox. lines
1. Opening & self-introShachar introduces himself, Buzz, Tel Aviv, plan to move to SF1–25
2. The problem: code review is the bottleneckDiscovery-call findings: PRs piling up, trust deficit, agents ignore spec parts25–55
3. Motivating example: the "continue button" overlap bugA spec he wrote himself, with a recording, was still implemented wrong55–90
4. Why coding agents fail at verificationThey focus on generating, not verifying; code-vs-code isn't the answer; need a deployed feature in staging90–120
5. Spec Reviewer architecture v1 (single agent)Agent reads tickets/designs/staging, extracts requirements, validates each — and Claude warns it's a hard task; context window explodes120–155
6. Split 1: Planner + VerifierTwo agents — planner extracts requirements, verifier checks them. Sessions stop crashing but requirements get skipped and quality is inconsistent155–185
7. Split 2: Per-requirement sub-agentsParallel sub-agents, one per requirement, with an orchestrator collecting verdicts185–210
8. Hallucinated requirements & base-branch groundingAgent invents requirements (e.g. "new responder command", "backward compatibility"); fix: give it the base branch (not the diff) + scope it to the relevant layer210–250
9. Sandboxing untrusted URLsCustomer integrations require visiting arbitrary URLs; use a third-party sandbox (AWS Agent Core) — ephemeral sandbox per requirement250–285
10. The dream made realSpec Reviewer running on Buzz's own code; agent sessions navigating dashboards, integrations, Stripe subscription, Google/Tessl onboarding285–310
11. Three key takeaways(1) Context engineering is still hard in 2026; (2) specs + code is a gold mine; (3) use proven third-party tools in high-risk areas; bonus: find the cracks the big labs overlook310–345
12. Q&A 1 — Regression testsYes, Spec Reviewer also runs critical-flow regression checks every PR (e.g. subscription)345–375
13. Q&A 2 — Why not generate tests instead?Generated tests cover imaginary scenarios; specs reflect real-life intent; byproduct: teams write better specs375–410
14. Q&A 3 — Lighter models for sub-agents? & exploratory testingHeavy models for extraction, small agents for verification; focus on the 1–5% that matters410–460
15. Q&A 4 — How to verify the spec was implemented as written + spaghetti code concernCut off by time; question only partially answered460–end

Terminology glossary (speaker's own definitions)

  • Spec Reviewer — "an agent that is called spec reviewer. It's going to have access to specs… designs… and… the feature that is deployed in the staging environment or in a preview environment. And verify it."
  • Planner agent — "Planner's role is to extract the requirements from the spec and understand what are going to be the failure cases that I'm going to verify through the verification process. Only one task to extract requirements."
  • Verification agent — "going to navigate through different files through the UI through the design and understand if the specific specs that were provided by the planner were met in this feature."
  • Sub-agent delegation — "instead of one agent that is checking 10 or 12 or 15 requirements sequentially, I'm going to have 15 or 12 agents that are running in parallel. Each of them is reaching to a specific verdict and in the end there's an orchestrator that collects all the verdicts."
  • Base branch vs diff — "if we give it the diff… it's biased to the specific solution that the engineer choose to implement. But if we give it the base branch before the change, the agent is open-minded to different kinds of approach and is more critical about the solution that was chosen."
  • Scoping — "if I'm reviewing a front end feature, there's no reason to be concerned about backend issues because I'm just going to create noise that are irrelevant for this specific feature."
  • Ephemeral sandbox — "when the agent needs to validate a specific requirement, there will be a sandbox that would be running for that specific requirement with a browser session checking the specific feature."

Named frameworks / concepts introduced

  1. Spec Reviewer architecture — Planner → parallel per-requirement Verifier sub-agents → Orchestrator collecting verdicts, with per-requirement ephemeral sandboxes for browser navigation.
  2. The two context-engineering moves: (a) "dividing between planning and execution"; (b) "delegating agentic tasks between multiple sub-agents".
  3. Grounding rule: specs are "a snapshot and what we're trying to achieve" but "the code is how we ground that agent to reality." Give it base branch + scope to the relevant layer.
  4. Risk heuristic: "If you identify… high risk areas… prefer using third party proven tools that you can use instead of exposing yourself to security issues." Concrete instance: AWS Agent Core for sandboxed browser sessions.
  5. Strategic heuristic for builders: "Look for the gaps that the big coding agents are not able to fill and build the product there."
  6. Byproduct effect ("renewable energy") — when teams know specs will be used by a verification agent, they write better specs.

Open questions / not covered

  • The talk does not give a quantitative evaluation of Spec Reviewer (accuracy, false positive/negative rates, cost per PR) beyond anecdote.
  • The talk does not cover prompt-level details of the planner or verifier (no prompts shown).
  • The talk does not specify which models are used where, beyond "heavy model" for extraction and "small agents" for verification, and a generic mention of "openai and anthropic ones".
  • The final audience question — "how did you make sure that the spec that you gave the agent were actually implemented the way the spec has been written?" and the spaghetti-code follow-up — was cut off by time and not substantively answered ("So I have three seconds. And it's okay.").
  • The talk does not address how Spec Reviewer handles ambiguous or contradictory specs beyond noting that "human beings aren't consistent about how the writer expects."
  • No discussion of how the orchestrator resolves conflicting verdicts between sub-agents.
  • No discussion of how designs (visual assets) are actually ingested or compared.

Speech-to-text artifacts worth knowing

  • "Tessla" → almost certainly "Tessl"
  • "vehic" → likely "veteran"
  • "Asian sessions" → "agent sessions"
  • "Father. Doc" → garbled audience-member intro
  • "twist" → likely "Twitch" or similar
  • "platinum" → "planner" (in the planner/verifier split section)
  • "fiber" → an unrecognised ticketing-system name (possibly "Fibery")
  • "Tel Aviv… slide" → likely "flight"

talk-maple-ai-native-devcon-welcome-spec-reviewer

README.md

tile.json