Skills on Tessl: a developer-grade package manager for agent skillsLearn more
Logo
Back to podcasts

What Developers Can Build Next With AI

with Baruch Sadogursky, Liran Tal, Alex Gavrilescu, and Josh Long

Transcript

Chapters

Introduction
[00:01:21]
Baruch on AI and Code Review
[00:03:01]
Behavior-Driven Development (BDD)
[00:06:19]
Automating Specifications with AI
[00:11:31]
Security Concerns with AI Tools
[00:19:04]
Backlog MD and Spec-Driven Development
[00:31:27]
Managing Tasks and Collaboration
[00:41:19]
Web Interface and AI Integration
[00:46:00]
Spring AI Capabilities and Demo
[00:51:01]

In this episode

In this episode of AI Native Dev, host Simon Maple and guests Alex Gavrilescu, Baruch Sadogursky, Josh Long, and Liran Tal discuss shifting from code-centric to spec-centric development in the AI era. They explore how human-readable specifications can align developers and stakeholders on intent, allowing deterministic tooling and AI to handle implementation. The panel emphasizes creating a chain of trust from specs to tests to code, ensuring AI assists without compromising quality or control.

In this episode of AI Native Dev, host Simon Maple is joined by Alex Gavrilescu, Baruch Sadogursky, Josh Long, and Liran Tal to explore a provocative shift: moving from code-centric to spec-centric development in the age of AI. The throughline is simple but powerful—developers and stakeholders can align on intent through human-readable specifications, then let deterministic tooling and bounded AI do the heavy lifting. As Baruch quips, you shouldn’t trust a monkey to write Shakespeare—or tests. Instead, build a chain of trust that starts at the spec and ends in code that demonstrably does what it should.

From Code-First to Spec-First: Reducing the “Someone Else’s Code” Tax

The conversation opens with a common anxiety: if AI writes more code, will developers stop looking at code altogether? Baruch reframes the concern. It’s not that developers hate reading code; they hate reading code that isn’t theirs. AI-generated code is, by definition, “someone else’s code,” and the novelty of reviewing LLM output will wear off. That means we need a representation of intent that is easier to agree on than raw code.

Human-readable specs fill this gap. Rather than arguing over diffs in an IDE, teams can encode product intent in a shared language that both technical and non-technical stakeholders understand. The more you formalize intent up front—and make it reviewable—the less you have to rely on brittle, after-the-fact interpretation of what the code “should” do. The hosts argue that this is how developers keep doing deep, meaningful work, even as the mechanics of code production become more automated.

This approach also creates a practical division of labor. Areas you deeply care about get explicit specs and tests. For less-critical paths, you can let the LLM infer implementations, confident that if it goes off course it highlights missing intent rather than a failed engineering process.

Why TDD and BDD Stalled—and How AI Can Unstick Them

The panel gets candid about why TDD didn’t “win.” If developers are the only ones who can write and read tests, the tooling excludes the product and business voices who own intent. Developers, biased toward action, often rush to implementation and backfill tests later. The result: tests validate code, but they don’t source the truth of the product.

BDD tried to fix this. Frameworks like Cucumber introduced Gherkin’s Given-When-Then structure to describe behavior in semi-structured natural language. The goal was inclusivity—let anyone propose and review behavior. In practice, however, the syntax felt rigid for many product folks, and the coupling between BDD specs and implementation was often loose. Specs drifted like stale PRDs, and teams returned to “the code is the source of truth.”

AI offers a way out without compromising rigor. Instead of asking product managers to handcraft Gherkin, you can feed a requirement doc to an LLM to draft initial scenarios. Yes, the model may hallucinate or miss context. But that’s acceptable if you commit to a human-in-the-loop review cycle and keep the spec readable. The order of operations changes: draft specs with AI, review and correct as a team, and then turn that intent into executable tests deterministically.

A Chain of Trust: From Prompt to Spec to Compiled Tests to Code

Baruch lays out a repeatable chain of trust. Start with a prompt, requirements, or PRD-like doc. Use AI to generate Gherkin-style scenarios that capture intent in Given-When-Then form. Because these artifacts are plain language, anyone can review them. Crucially, treat this as iterative—expect to refine the spec until it reflects the product truth.

Next, convert specs into executable tests deterministically using tooling like Cucumber (JVM), SpecFlow (.NET), Behave or pytest-bdd (Python), Godog (Go), or Serenity for richer reporting. The key insight is to avoid nondeterminism at this stage. LLMs are “slightly better than random monkeys typing,” so don’t ask them to write tests. Parsing and wiring Gherkin into step definitions and fixtures should be ruled by algorithms, not generative models. Run the conversion 10 times; get the same result 10 times.

Finally, unleash the LLM to implement code that makes those tests pass—under guardrails. If the model tries to “cheat” by editing tests to green them, block it. Make test directories read-only, mount them as read-only volumes in containers, or enforce pre-commit hooks and CI policies that reject test modifications. The invariant is simple: tests encode intent; code must conform. If code passes, you can trust it because the tests were compiled from an agreed spec.

Human-in-the-Loop and LLM-as-Judge: Practical Review Patterns

Specs are only useful if people actually review them. Long specs are a real risk—humans skim, miss details, and wave things through. The team proposes a two-tier safety net. First, human review for high-signal sections and critical flows. Second, LLM-as-judge to cross-check coverage and consistency. Use a different model than the one that generated the specs to ask questions like: Do these scenarios cover all acceptance criteria? Are edge cases addressed? Are there contradictions or ambiguous steps?

This “model ensemble” approach turns AI into a verification tool rather than a generative authority. Treat its output as suggestions, not truth. Incorporate lightweight prompts to expose gaps in coverage, such as comparing scenarios against user stories, acceptance criteria, and known non-functional requirements (performance, security, compliance). The result is a pragmatic review loop that scales as specs grow without sacrificing human judgment.

Over time, you can template this process. Maintain a cookbook of scenario archetypes—CRUD, search, auth, billing, error handling—and let the LLM propose instantiations. Bake in domain-specific step libraries so compiled tests always target stable step definitions. Your review then focuses on scenario correctness, not syntactic ceremony.

Tools, Guardrails, and a Rollout Plan Developers Can Adopt Today

The episode closes with actionable guidance to operationalize this approach. Start with a thin slice: pick one critical feature or service boundary. Draft specs from the existing PRD with an LLM, then workshop them with engineering and product. Adopt Cucumber or an equivalent in your language stack, and standardize a minimal set of step definitions for your domain to ensure the spec-to-test path is deterministic.

Enforce guardrails. Protect the test tree with OS permissions, Git attributes, and CI policies. If you generate code in a container, mount the tests directory as read-only. In your CI, require that any changes to tests come with human approval and are not authored by automation accounts. Consider mutation testing or coverage gating on the compiled tests to ensure they actually fail when behavior regresses, not just when code changes.

Tune your LLM usage. Keep model temperature low for code generation to limit variance. Use a separate model for judge/critic tasks. Log prompts and outputs for traceability. Most importantly, be explicit about where you’ll accept inference. For low-risk utility functions, let the LLM implement without exhaustive specs. If something comes out wrong, treat it as an intent gap—write or refine the spec, recompile tests, and rerun. This creates a virtuous loop where missing behavior becomes visible and fixable.

Key Takeaways

  • Make specs the source of truth. Capture intent in human-readable Given-When-Then form that anyone can review.
  • Keep test generation deterministic. Use Cucumber/SpecFlow/Behave and stable step libraries; don’t let LLMs write tests.
  • Let AI implement code, but protect the tests. Enforce read-only tests and CI policies so code must conform to intent.
  • Use a second model as a judge. Cross-check scenario coverage and consistency, then finalize through human review.
  • Iterate where it matters. Spec and test the risky, complex flows; allow LLM inference on low-risk areas, and backfill specs when needed.
  • Start small and template. Pilot on one feature, build reusable step libraries and scenario archetypes, and scale with confidence.

This spec-first, AI-assisted workflow keeps developers in control, invites stakeholders into the conversation, and turns LLMs into powerful, bounded tools—so you can move fast without trusting monkeys to write Shakespeare.

Chapters

Introduction
[00:01:21]
Baruch on AI and Code Review
[00:03:01]
Behavior-Driven Development (BDD)
[00:06:19]
Automating Specifications with AI
[00:11:31]
Security Concerns with AI Tools
[00:19:04]
Backlog MD and Spec-Driven Development
[00:31:27]
Managing Tasks and Collaboration
[00:41:19]
Web Interface and AI Integration
[00:46:00]
Spring AI Capabilities and Demo
[00:51:01]