Logo
Back to articlesMy Coding Agent Lied to Me (And I Have the Screenshots)

30 Jan 202614 minute read

baruch

Baruch Sadogursky

Baruch Sadogursky is a Developer Advocate who helps developers move from vibecoding to spec driven development, with deep experience from JFrog and now at Tessl.

Viktor Gamov and I have a problem. We can't stop talking about AI coding agents. We do conference talks about them. We argue about them in DMs. We have opinions about prompting strategies that would bore normal humans to tears.

So naturally, we decided to do a live stream where we torture ourselves in public.

The premise was simple: build the same app twice. First, pure vibecoding: tell the agent what we want and let it rip. Second, give it a structured context using Tessl's spec-driven development tile: same goal, same model, different approach.

TLDR: one of these produced an app that silently lied to users. The other one asked us what we actually want to build. Mind-blowing, I know.

(The stream, amazingly cleverly called “Agent Johnson, Special Agent Johnson, No Relation," (wink-wink, Die Hard fans!) is coming soon, so stay tuned. But the results were too good not to write up immediately.)

What We Built (Or Tried To)

The task: an airline loyalty program assistant. You pick Delta or United, ask questions about status tiers, perks, how to earn miles, upgrade rules—that kind of thing. Simple enough that we could build it in a stream. Complex enough that getting the details wrong would be obviously bad.

I fired up Claude Code in an empty directory and typed:

Create an app that will serve as an airline loyalty program assistant for Delta and United, and will answer questions about status, perks, etc.

That's it. Pure vibes. No plan, no requirements, no architecture discussion. Just a prayer and a prompt.

Let's call this Project AirPointsHelper3000 (because we played computer games in the 90s).

AirPointsHelper3000: A Study in Confident Wrongness

The agent got to work immediately. It picked Python and FastAPI (I didn't ask for either). It scaffolded files. It hardcoded—and I cannot stress this enough—HARDCODED—the loyalty program information directly into the source code.

01.png

You know what's in that hardcoded data? Whatever was in the model's pre-training. From years ago. Before Delta restructured its Medallion program. Before United changed its PlusPoints system.

The app confidently serves users factual misinformation about how to earn status.

But wait, it gets better.

When I tried to run it, I got this:

02.png

"Sorry, I encountered an error. Please try again."

Try again? Try what again? What will be different the second time? The third time? (I'm easily gullible, so I tried three times. Nothing was different.)

I hadn't set the Anthropic API key. Fair enough, my mistake. But the app's error handling strategy was essentially "something's wrong, figure it out yourself, good luck."

So I checked the Claude Code terminal. Maybe there's a helpful error message there?

"The app is running. You can access the Airline Loyalty Assistant at localhost:8000."

"Worked for 42s."

Everything's great! Except the browser is throwing errors. The agent had no idea anything was wrong.

03.png

To set the API key, I had to exit Claude Code and use the shell directly. And when I restarted Claude Code?

It had no idea what was going on. Blank slate.

I typed "restart the app." The response: "Let me first understand what kind of application this is."

Then it searched the entire codebase. Found 100 files. Started re-ingesting everything just to understand what it had just built. Every token spent re-learning your own code is a token you're paying for twice.

04.png

Finally, API key set, app running, I asked about the Delta Platinum status. And the app confidently served me... outdated information from its pre-training data.

05.png

Check out that fine print at the bottom of the UI: "Information may not reflect the most current program terms." The app knows it might be lying. It just warns you in 10-point font and hopes for the best.

This is vibecoding in a nutshell. You get something that looks like an app. Purple gradients, chat interface, the whole AI-generated aesthetic. But underneath? Useless error handling. Silent failures. Context amnesia. Stale data served with confidence. No tests. No specs. No way to verify if it's doing what you actually wanted.

The Model Is Smart. I Gave It Nothing to Work With.

Don't get me wrong, Claude is an amazing model (especially Opus 4.5). I just gave it nothing to work with except vibes.

No requirements. No constraints. No definition of done. No way to verify correctness. I essentially said "surprise me" and then acted shocked when the surprise was bad.

Andrej Karpathy coined "vibecoding" as a joke about his weekend projects. It became Collins Dictionary's Word of the Year. And now 25% of Y Combinator startups reportedly have codebases that are 95% AI-generated.

We're speedrunning technical debt at scale.

The industry is starting to figure this out. Amazon launched Kiro, a full spec-driven IDE. GitHub released Spec Kit. There's an open-source project called OpenSpec gaining traction. Martin Fowler's team at ThoughtWorks is writing about spec-driven development as a pattern.

The whole industry is figuring this out at the same time.

Take Two: Teaching the Agent a Process

For round two, I did something radical: I gave the agent context about how to work, not just what to build.

First, I initialized Tessl in the project:

tessl init

This sets up a local MCP server that exposes “tiles”, structured pieces of context, to whatever agent you're using. Think of it as making context discoverable and consistent.

Then I installed the spec-driven development tile:

tessl install tessl-labs/spec-driven-development

This tile doesn't teach the agent about APIs or libraries. It teaches the agent a methodology. Write specs before code. Gather requirements. Get approval. Then implement. Then test.

Same prompt as before:

Create an app that will serve as an airline loyalty program assistant for Delta and United, and will answer questions about status, perks, etc. Use spec

driven development.

And just like that, our agent (same agent, same model!) had a brain transplant.

The Agent Started Asking Questions

Instead of diving into code, Claude stopped and started interviewing me.

06.png
  • What kind of interface? (Web API? CLI? Web app? Slack bot? Discord bot?)
  • What tech stack do you want? (Programming language? Framework?)
  • Which LLM provider should it use?
  • Where should the knowledge come from?
  • What types of queries should it answer?

Meanwhile, the vibecoded version just assumed "AI slop webchat" and started building.

These questions come directly from the spec-driven development tile's workflow: before writing a spec, clarify requirements.

I answered. It kept asking. By the time we were done, we had actually talked about what we were building.

If you think this demo sounds staged (two DevRel guys on a stream, what do they know about shipping production code), here's someone who's spent almost a decade securing networks at Cisco:

*Claude started asking clarifying questions about intent and design decisions.*

— João Delgado, Technical Leader for Security at Cisco, CCIE

Specs Before Code

Then it did something the vibecoded version never did: it created a spec/ directory.

07.png

Inside: structured specification documents. Overview. Functional requirements. Non-functional requirements. User flows. Even ASCII mockups of the UI.

08.png

The agent is thinking out loud in a structured format before writing any code. A contract between us and the machine about what "done" means.

And here's the key part: it waited for me to review the specs before implementing anything.

The tile encodes a human-in-the-loop approval step. The agent doesn't just dump specs and plow ahead. It explicitly pauses for "stakeholder approval." In this case, that's me squinting at a Markdown file for a sec and saying, "Yeah, looks good."

The Implementation Actually Worked

Once I approved, the agent started building. But now it had rails.

It used a proper project generator (Quarkus archetypes) rather than ad hoc scaffolding. It discovered it needed a web search for fresh data and found Tavily (Hebrew-speaking readers will chuckle at the name, as I did). It asked about framework preferences and stuck to them.

And then, and this kinda blew my mind, it started installing more tiles automatically.

09.png

The agent identified the tech stack it was using and pulled in the relevant Tessl tiles for those frameworks. Context about Quarkus REST. Context about Quarkus dependency injection. Context about the testing framework.

This is what Tessl calls the Spec Registry—over 10,000 tiles for public libraries, versioned alongside the software they document. It's like npm for agent context. The agent doesn't hallucinate APIs because it has access to the actual documentation, current as of the version you're using.

The result? An app that:

  • Used fresh web search data instead of stale pre-training information
  • Had proper error handling (errors actually displayed)
  • Included tests (the vibecoded version had zero)
  • Followed a consistent architecture
  • Had an airline-themed UI instead of a generic AI purple
10.png

Oh, well, we still hit some bugs, like the web search integration needed an API key I had to configure, and United's anti-scraping tech gave us some trouble, but the process was different. When something broke, we knew where to look. The specs told us what the app was supposed to do. The tests told us what it actually did.

Two Types of Context (Actually, More)

Here's how to think about Tessl Tiles:

Tiles are reusable, shareable pieces of engineered context. Whatever context can contain can be packed, shared, and reused in a tile. We keep discovering new types.

Library/framework tiles teach agents how to use specific tools correctly. The Quarkus tile. The Spring Boot tile. These prevent API hallucination and version confusion. When Spring Framework 7 drops and your model pre-training doesn't know anything about it (obviously), the tile fills that gap.

And in case you think we cherry-picked a toy example:

*I wanted to use LangChain's Agent Middleware concept. Without the tile, Claude just created its own implementation

literally no LangChain code there. With the tile, it saw the light.*

— João Delgado (yes, the same guy! he had a productive day)

Methodology tiles teach agents how to work. The spec-driven development tile doesn't care what language you're using. It cares about the process: gather requirements → write specs → get approval → implement → test.

Skill tiles are coming next. Skills are modular executable capabilities. Skills can contain scripts that actually run, and slash commands that invoke them, and a skill tile is a distribution and installation vehicle for Claude (and
other
agents) skills. Suddenly, tile doesn't just tell the agent about a tool; it actually gives the agent the tool!

More types will follow. If you can engineer it into context, you can ship it, and we can distribute it as a tile.

You need all of them working together. Library context without process yields correctly spelled chaos. Process without library context gives you well-organized hallucinations.

What We're Trying Next

The spec-driven development experiment opened a bigger question: how do we verify that what's running in production actually matches the intent we wrote in the specs?

We're calling this the "intent integrity chain"—closing the loop between what we asked for, what the agent built, and what the code actually does. Specs are step one. But we want the whole chain auditable.

Viktor and I are planning more experiments. More streams. More structured suffering in public.

Try It Yourself

The spec-driven development tile is available now. No credit card required. If you've got Cursor or Claude Code (or any MCP-compatible agent), you can install it in about thirty seconds:

tessl init
tessl install tessl-labs/spec-driven-development

Then just mention "use spec-driven development" in your prompt and watch your agent start asking questions instead of assuming answers.

Will it prevent all mistakes? No. Will it make your agent's failures more debuggable and its successes more reproducible? In my experience: yes.

Vibecoding is fun for throwaway weekend projects. For anything you actually care about, give your agent a process.

Or don't. And enjoy explaining to your users why your app is confidently lying to them about airline miles.

Baruch Sadogursky is a Developer Advocate at Tessl, where he helps developers stop vibecoding and start spec-driven development. Previously, he spent years at JFrog convincing people that artifact repositories matter. He was right about that, too.