The Tessl Registry now has security scores, powered by SnykLearn more
Logo
Back to articlesWebMCP: Making Web Apps Faster and Cheaper for AI Agents

6 May 20269 minute read

Maximiliano Firtman

The 5-second click problem

An AI agent takes a screenshot of your web app. It runs the image through a vision model, identifies a button, calculates the coordinates, and triggers a click. Five to ten seconds have passed. In that time your JavaScript has re-rendered, the button has shifted, and the click lands on the wrong element. The agent retries. Your token bill grows.

This is how most coding agents interact with the web today. Most of us have watched it fail at scale, and most of us have seen the inference bill at the end of the month.

The conventional response is to wait for smarter models. I think that misreads the problem. The interface is the issue, not the model. WebMCP, an emerging standard API proposed by the Chrome team, works with a different contract: instead of agents reverse-engineering your interface, your interface tells them what it can do.

I'll be unpacking this live at AI Native DevCon London on June 1, with demos and a deeper look at what's still missing from the spec. If you build for the web and you've felt the cost of agent-driven traffic, the room will be worth your time. The article below is the short version.

Join us at AI Native DevCon
Join us at AI Native DevCon (use C0DE30 for 30% discount)
Join us at AI Native DevCon

Why I keep coming back to this

I've been building for the web since 1995, and I've spent the last three years writing about how LLMs and browsers fit together. When the first ChatGPT browsing plugin shipped, I started reverse-engineering how it actually rendered websites. That's where this argument starts.

Something has shifted in the last twelve months. Agentic browsers are no longer a research project. ChatGPT Atlas, Perplexity's browser, Microsoft Edge's agent mode, and Chrome's own agent mode all ship the same idea: an LLM that drives the browser as if it were the user. Cloud-based agents from Anthropic, OpenAI and Google do roughly the same thing through Playwright, Puppeteer or the recent Codex in-app browser.

The common assumption is that these agents become reliable by becoming smarter, with better vision models, better DOM parsers, and better selector inference. I keep finding that this story is a step too far. The bottleneck isn't intelligence. It's that the modern web wasn't built to be read by machines.

A typical React page is a tree of generic containers with no semantic meaning. The accessibility tree, which works well on native mobile because screen readers depend on it, performs poorly on the web. So agents fall back to vision on screenshots, which is slow, costly, and unreliable. WebMCP attempts to flip that relationship, and there are three shifts in mental model worth understanding before you decide whether it matters for your stack. I'll cover the headlines below and save the implementation walk-through for DevCon.

Three shifts that change how agents and web apps interact

1. The DOM is not the API

The conventional wisdom is that agents will eventually parse the DOM well enough to drive any page reliably. Smarter models, the argument goes, will close the gap.

I find this incomplete. The DOM is the output of a rendering process optimized for human eyes. After the React era, most pages are layers of generic containers and class names that mean nothing outside the framework. An agent reading that DOM is trying to reverse-engineer intent from layout, which is exactly the problem screenshots have, with extra steps.

What to do instead is stop treating the DOM as a contract and start exposing one explicitly.

A small example. An airline website asks the user to enter origin, destination, and dates. Every airline does it differently. The calendar widget is custom, the autocomplete behaves in its own way, and an agent has to figure out each one from scratch every time. With WebMCP, the page exposes a single search-flights tool with a typed input schema, and the agent calls it directly. The same logic powers the human UI and the agent path. One implementation, two consumers.

2. Some actions belong on the client, not the public API

The conventional wisdom is that if you want to support agents, you expose a REST API or a server-side MCP endpoint and let them skip the browser entirely.

This is incomplete because a lot of useful behavior on a modern web app doesn't live on the server. Shopping carts, drafts, in-progress configurations, partially filled forms, all of these live in the client. Authentication flows like Apple Pay, Google Pay, and WebAuthn are designed to live on the client by design. Replicating all of that behind a public API often means rebuilding logic you already have, and sometimes it means breaking the security model you depend on.

What to do instead is think carefully about which actions are state-on-the-client actions and expose those through WebMCP, while keeping server-side actions on a server-side interface. The line between the two is where most of the interesting product decisions sit.

A worked example I'll show at DevCon: a checkout flow where the agent can browse and add to cart, but the page interrupts the agent and asks the human to confirm before any payment is taken. WebMCP includes a primitive for exactly that handoff, and the way it composes with the broader MCP ecosystem is, I think, the most underrated part of the spec.

3. Your page can finally tell agents apart from humans

The conventional wisdom is that there's no reliable way to know whether your site is being driven by a person or an agent. Bot detection has historically been a fingerprinting arms race.

This was true until recently, but WebMCP changes it. The API fires events when an agent takes control of the page and exposes CSS pseudo-classes you can style on. For the first time, your web app can know it's being used by an AI and respond accordingly, without resorting to behavioral heuristics.

What to do with that signal is a product decision, and one I think most teams haven't started having yet. You can design different UI for agents where useful, track agent traffic in your analytics, and decide which actions a human should confirm versus which an agent can complete autonomously.

A practical example: X recently changed its publishing flow to detect whether a real finger touched the screen, because it was tired of agent-generated content. With WebMCP you don't need that kind of fingerprinting. The browser tells you. The harder question is what you do once it does, and that's the part of the talk I'm most looking forward to.

What this means for devs

The tl;dr is that most AI agents interact with web apps the way a tourist navigates a foreign city, by squinting at signs and guessing, and WebMCP gives the city a directory.

WebMCP doesn't replace proper UI testing from coding agents, but it can increase speed and reduce cost of coding agents using webapps for different other purposes.

The single takeaway: if your site already receives agent traffic, whether that's e-commerce, support, or anything else with forms, the question worth asking is which user actions you want exposed as tools and which you want to keep behind a click. That decision shapes your token bill, your conversion rate, and your trust model.

The implementation is small once you've made the decision. The decision itself is the harder part, and it's the part I'll be working through live at AI Native DevCon London on June 1. If you want to compare notes on WebMCP, agent reliability, or how this fits with context engineering, come find me there.