
Why Every Developer needs to know about WebMCP Now
Also available on
Transcript
[00:00:00] Guy Podjarny: Hello everyone. Welcome back to AI Native Dev. Today we will be jumping into the world of the web, which is actually a fun space because my background in the entrepreneurial world has been in making websites faster, especially with the rise of the mobile web. And that's where I got to meet Max, who had been building a ton along the way but at the time was optimizing.
[00:00:27] Guy Podjarny: For mobile performance, for responsive websites, for many aspects of the rise of the mobile web, and today, in the AI world. So Max, thanks for coming onto the show here.
[00:00:39] Maximiliano Firtman: My pleasure. Thank you.
[00:00:41] Guy Podjarny: So before I begin, I can kind of wax poetic a little bit about your background; it's probably better for you to describe the journey.
[00:00:47] Guy Podjarny: So tell us a little bit, just for people to have the mindset of where your information is coming from, a bit about your experience.
[00:00:55] Maximiliano Firtman: Yeah, sure. So I've been a web developer for 30 years, so that's a lot. [00:01:00] 30 years now. So I started doing websites in 1995, actually, so my first website was created in the editor.
[00:01:14] Maximiliano Firtman: So we are talking in text mode, and then I was getting into Windows 3.1, opening Netscape, and opening manually the HTML file to see how it looks. So that was the idea at the time. And yeah, in my journey, I've seen everything. [00:01:30] So from all the different design patterns of different languages like ASP, PHP, then moving to the front, Web 2.0 stuff, HTML5.
[00:01:43] Maximiliano Firtman: And I focused on mobile. I'm also a mobile developer. So native iOS and native Android did Objective-C and Kotlin and Java for Android, then Kotlin. And I always liked the merge of the two worlds, the mobile and the web. So I authored a couple of books on the mobile web.
[00:02:04] Maximiliano Firtman: Then moving into the performance side of that, I authored a couple of books on the mobile web as well. So I have books on JavaScript. So I actually am authoring right now, even in the AI era, an old-fashioned book. So that's going to be my 15th book.
[00:02:26] Guy Podjarny: How much are you using AI in the writing of this book?
[00:02:28] Maximiliano Firtman: Actually, no. And that's part of the deal. So I'm writing a book on vanilla web, so how to write websites without libraries. That actually is useful in the AI world because it's better for context and things like that. So it is better for adjunct AI tools to code some web apps without a lot of dependencies.
[00:02:55] Maximiliano Firtman: So that's kind of my journey. And of course, as every developer, because I'm a developer, I'm in AI right now. So there is no other way, no other option. So that's why I started doing some courses on working with GPT three years ago.
[00:03:19] Maximiliano Firtman: As soon as the API was released from OpenAI, I started doing articles and content and courses on how to integrate your websites and your apps with OpenAI APIs and how to do prompt engineering for your apps, things like that. And as soon as ChatGPT, because I'm saying ChatGPT because it was the first one when they created the first browsing plugin, I started doing research on how that works.
[00:03:43] Maximiliano Firtman: So how is ChatGPT rendering your website? How does it work? So if you want to optimize your content for that, how do you do that? So I started writing some articles on that. Technology is evolving every day. We know that. So there is a fever right now. So I'm trying to keep updated on every new technology, design pattern, and tool that will help developers to merge into this AI world.
[00:04:14] Guy Podjarny: Yeah, amazing. And yeah, fever is a good word for it. There's definitely some intensity around the pace of change that is exhilarating and daunting at the same time. So what triggered this conversation is your recent work and focus on WebMCP, and I found that super interesting. As I think about the problem at a high level, it resonated with me that it felt like agents keep needing to reverse engineer a webpage.
[00:04:42] Guy Podjarny: They arrive there and try to find signals on it. They're doing it impressively well, but it still feels somewhat inefficient. So it really caught my attention when I saw the title. And I think you had a big tweet that got some big reach around it.
[00:05:03] Guy Podjarny: So maybe let's start the conversation by digging into WebMCP. Tell us a little bit about what it is and what problem it's trying to solve.
[00:05:11] Maximiliano Firtman: Okay, sure. So WebMCP first is an experimental API that is still not stable. It was presented by the Chrome team, so it's right now working in Google Chrome under a flag.
[00:05:25] Maximiliano Firtman: And the idea is to offer AI agents another way to use web app services. Because right now when we're talking about AI agents, we are talking about ChatGPT agent mode, Claude agent mode, or Gemini. But also we are talking about open crawl that is browsing the website, and also agentic browsers like ChatGPT, Atlas, Perplexity browser, even now Google Chrome or Microsoft Edge. They have an AI mode or agent mode that can browse the website for you.
[00:06:04] Maximiliano Firtman: But right now they're browsing the website typically in two ways. The most common way is taking screenshots and then analyzing them with an image model, saying I need to click here in those coordinates.
[00:06:28] Maximiliano Firtman: But what happens if in those [00:06:30] five to ten seconds the JavaScript moves that button? So it doesn't work. Then another screenshot is taken, and the agent says maybe the website scrolled or the content changed. So that is inefficient, both in time and in cost, because that requires tokens.
[00:06:50] Maximiliano Firtman: Okay, we need to pay for that.
[00:06:52] Guy Podjarny: How and why are they doing images versus understanding the webpage itself?
[00:06:58] Maximiliano Firtman: The other way is analyzing the DOM. The thing is that after the React world.
[00:07:13] Maximiliano Firtman: When you look at the DOM that we are shipping to the user, it's not semantic. So it's just a list of a hundred divs. Divs. It's just a generic tag that you have on the website that has no semantics. You actually understanding what's [00:07:30] there, the DOM might not be useful on every website, right?
[00:07:36] Maximiliano Firtman: So, for example, on mobile apps or native mobile apps, if you want an agent to use an iOS application on your iPhone, you can also do screenshots, but they also have some kind of accessibility tree that is actually really useful, right? Because it's the same tree that accessibility tools such as screen readers are using to understand your app.
[00:08:00] Maximiliano Firtman: So I think that's pretty cool. And on the web, you can do something like that, but it seems like in some tests the results are not really good. So that's why most of them are still using the old way of taking screenshots.
[00:08:20] Guy Podjarny: Right. Yeah.
[00:08:21] Maximiliano Firtman: It makes sense. It works, but it's competing inefficiently.
[00:08:26] Guy Podjarny: I think that makes sense.
[00:08:27] Guy Podjarny: I guess when you describe it, if you try to build [00:08:30] an application today, it's so dynamic. It changes all the time, but it's also really hard to decipher. Even as an expert developer, opening up that DOM and figuring out what's what without looking at the visual is actually,
[00:08:45] Maximiliano Firtman: Well, it's optimized for a human brain, actually. Right?
[00:08:47] Maximiliano Firtman: So it's optimized for that. So that kind of deviation is still, it can do that, but with screenshots, and that's where WebMCP appears as a solution. The idea is that WebMCP is kind of an API. So the web app developer will expose an API to the agent and say, Hey agent, if you want to talk to my website, here are a couple of services that I can provide.
[00:09:16] Maximiliano Firtman: So you register a list of services or tools that you make available to the agent, and it's just JavaScript functions that you expose. So then the agent, when [00:09:30] browsing a WebMCP-capable website, will see all the services available and try to see if one of those services is suitable for the goal that it has.
[00:09:43] Maximiliano Firtman: And instead of browsing the website as a human, it will just execute those functions.
[00:09:53] Guy Podjarny: And in this context, the service provider of these functions is the webpage. It's not necessarily the thing [00:10:00] behind the screen. So services can be anything that local execution can satisfy, like clicking a button or getting some content. What are examples of services that you see people starting?
[00:10:13] Maximiliano Firtman: The service executes on the client, in JavaScript, on the webpage. But that service can also use normal web APIs to connect to the cloud, to hardware using Bluetooth, or wherever.
[00:10:28] Maximiliano Firtman: So actually, the simplest example is an airline website. You get into an airline website, and you're searching for flights. You need to go to a text field for your origin, then your destination, and then open the calendar or type the date, and every airline is different.
[00:10:48] Maximiliano Firtman: So the calendar is different, so the agent has to understand how it works. Instead of that, you offer a tool, a [00:11:00] function that searches flights. You specify the schema of the input data. You're going to specify origin, destination, two dates, and whether it's one way or round trip. You pass a JavaScript function to the WebMCP API and say, When the agent wants to execute this, this is the function.
[00:11:26] Maximiliano Firtman: That function can do local stuff or go to your backend. You can share the function with your standard UI. There is a flag to check if an agent or user is executing this, in case you want to behave differently.
[00:11:50] Maximiliano Firtman: You can take advantage of your current architecture and just call that. It will return asynchronously. So [00:12:00] the agent will wait. In that way, you can go to the cloud, a hardware sensor, or even ask the user. For example, if you're confirming something with a cost, you can say, Hey agent, I need to ask the user.
[00:12:19] Maximiliano Firtman: The API includes a way to interrupt the agent and call the user, meaning the human user.
[00:12:30] Guy Podjarny: Yeah. You tell the agent to ask the user, or do you literally,
[00:12:36] Maximiliano Firtman: No, you ask the browser. The browser will say, "Agent, you will wait because I will ask the user." It can be a dialogue on the screen asking, "Do you confirm buying this flight?" If the user says yes, then you go back to the agent and confirm the operation.
[00:12:57] Guy Podjarny: And this is relevant when the user is actively [00:13:00] behind the browser. So for agentic browsers it works, but if I ask Chat or Claude something and it goes on, with open cloud, it's a problem not interacting with me as a user, right?
[00:13:13] Maximiliano Firtman: Yeah. I guess the standard still needs to figure that out. Because this is important, WebMCP, as it is right now, is an experimental feature. It works only on visible browsers and not on headless browsers.
[00:13:33] Guy Podjarny: Got it. So that's an important distinction. For now, it's not the intent of the protocol, just the initial implementation.
[00:13:41] Guy Podjarny: Exactly. And so how, if it's server-side actions, how is this different than having an API? Why wouldn't I just have an open API spec that I make available and properly link from my site and have the agent just programmatically interact with my system?
[00:13:58] Maximiliano Firtman: Yeah. Well, that's actually a very good question, because even the same question happens in the MCP world without the web.
[00:14:05] Maximiliano Firtman: So MCP is another protocol that we use to connect AI tools with servers. Right now, even open cloud or other tools, for example, you can connect MCP tools, but they can just go and execute CLI tools in the terminal. So they don't need MCP. Something like that happens here on the web.
[00:14:30] Maximiliano Firtman: So if you have a RESTful API, for example, or another kind of API, or an MCP API or a CLI, then you don't need to use your website. That's a fair question. Maximiliano Firtman: I think it has to do with the kind of project that you have. For example, some tools need authentication, or they rely on a local database that you have on the client.
[00:14:57] Maximiliano Firtman: So that tool works better in the browser context. Maximiliano Firtman: From a security perspective, you may not want to open the API to the outside. So it's a different way to do that. But it's important to understand that WebMCP is not working automatically for your website, so you need to go to your JavaScript code and implement the API.
[00:15:22] Maximiliano Firtman: It's not going to be applied automatically.
[00:15:25] Guy Podjarny: Yeah. And I think in that sense it is the same as building an API. But I get that there's increasing functionality. That's why I perceived it initially as something in which the service provider is the page. The WebMCP call, the recipient of it, is a JavaScript function that is called on the page, and then it can do whatever is needed.
[00:15:45] Guy Podjarny: So at the very least, when you have rich functionality on the page and you don't want the agent to do the heavy lifting of uncovering it, you want to give it an agent-friendly programmatic interface to your client side.
[00:16:05] Guy Podjarny: So at the very least for that, this is novel. I guess you could have hoped the agent understands this with some sort of LLM.txt or whatever, but this is a standardized way to do that. Yeah, there is definitely overlap [00:16:30] with server-side APIs. But for example, one quick case where WebMCP might be better is Apple Pay or Google Pay.
[00:16:27] Maximiliano Firtman: If you want to finish a payment process and use Apple Pay, it has to happen on the client. You're not sending credit card details over a RESTful API.
[00:16:54] Maximiliano Firtman: So that's one example where you want to take advantage of the [00:17:00] client-side architecture, and you want the agent to access that. That's why you often see shopping carts as examples of where to implement WebMCP. The shopping cart lives in the client, not the server.
[00:17:18] Maximiliano Firtman: So you can work with the client side, make an order, and then that final step goes to the server. The question of why the agent doesn't go directly to the server is fair. Right now, it has to do more with following the standard UX flow.
[00:17:45] Maximiliano Firtman: Instead of designing a web app fully optimized for agents, maybe in the future we will build systems just for agents, and then the architecture will be completely different.
[00:18:02] Guy Podjarny: Yeah, very interesting. You were saying that in the initial iteration it is focused on visible browsers.
[00:18:10] Guy Podjarny: And you're right, that introduces that sort of user interaction. So some of it is client side, some of it is user interaction, and some of it might be convenience or a preference th-side,u want to provide via WebMCP versus open API.
[00:18:27] Guy Podjarny: I'm curious about the path or the jump from there to headless browser. Per your starting point description, headless browsers also do this unpleasant task of trying to grab an image and reverse engineer a webpage in a variety of ways.
[00:18:45] Guy Podjarny: And so they would probably benefit from access to content and things like that. I guess what is your expectation in terms of the primary use cases for a headless browser?
[00:19:02] Guy Podjarny: What's your guess on when that happens?
[00:19:04] Maximiliano Firtman: Most of these agents are using Playwright or Puppeteer.
[00:19:09] Maximiliano Firtman: These tools were originally created for automation, for browser automation. So they are creating a headless browser executing your web app there. ChatGPT initially was using a tool that converted the HTML into markdown, so it wasn't even executing JavaScript. So the initial plugin was really simple.
[00:19:35] Maximiliano Firtman: And to be honest, if you are browsing, if you're asking ChatGPT, Claude, or Gemini to read a website and summarize it using a URL, in most situations they are still doing that. They're just downloading a markdown version of your HTML. So if your website is 100% client-side rendered, it's not going to work. It will say it can't read this website.
[00:19:59] Maximiliano Firtman: But if you're using an agent, that's the next step. The agent is typically assuming the user's role, and in that case it's using tools like Playwright or Puppeteer to act as a user. In terms of expectation, I'm not sure if the Google Chrome team will think too much about that.
[00:20:27] Maximiliano Firtman: They are targeting how to increase Chrome agent mode, so they're working on that API. We need more companies, maybe OpenAI, now that OpenAI is the sponsor of Open Cloud, to be on board with the W3C to discuss APIs for those agents. I'm not seeing them doing that right now.
[00:20:53] Maximiliano Firtman: They were not part of the web community, so this is probably something new for them. Anthropic, which created the MCP protocol, may also be interested in getting into WebMCP. I'm not seeing that yet, but I guess that will happen at some point in the next few weeks or months.
[00:21:17] Maximiliano Firtman: And that will trigger some updates in this API.
[00:21:22] Guy Podjarny: Yeah, I think that makes sense. And I think Anthropic is actually an interesting example because they are not that strong around image analysis compared to some of their competitors. That could be interesting for them to bias in favor of doing it correctly.
[00:21:38] Guy Podjarny: You mentioned MCP as well. What's the trigger behind the WebMCP name? What is similar and what is different between that and non-WebMCP?
[00:21:50] Maximiliano Firtman: I think what is similar is the concept. If you look at the technical details, it's a completely different protocol.
[00:21:58] Maximiliano Firtman: So yeah. MCP works with JSON in and out. It's the standard CP for AI agents. They work over different versions and different protocols. Originally, they were working on HTTP, but using a technique that was really old that they brought about in an Ajax book three years ago.
[00:22:29] Maximiliano Firtman: And now they were using that technique because they were not supporting sockets or WebSockets. So it was kind of a long polling technique over HTTP. But also, they have a socket version, like a binary socket, for local servers and clients. So when you have a local thing, you can just talk with sockets.
[00:22:50] Maximiliano Firtman: It is a completely different idea. The similar concept is that you expose your services to an LLM. Exposing the services is just typing a name of the service and then a description in English. The LLM will understand that description and the schema that you want. That's a formal schema.
[00:23:15] Maximiliano Firtman: "I really need four arguments: the first one is an integer, the second one is a string," etc. Well, the same idea is here, but you cannot just export your MCP into WebMCP. You need to write it from scratch because it's a completely different architecture. This is JavaScript.
[00:23:32] Maximiliano Firtman: WebMCP is JavaScript-based. It's actually pretty simple from a coding point of view. You just say navigator.modelContext. registerTool, and you pass three arguments: the name, the description in English (like a long description), and the transcript function. And that's all.
[00:23:52] Guy Podjarny: Yeah.
[00:23:53] Maximiliano Firtman: So the tool that the agent will actually query on all those tools, it will read descriptions, and you'll say, "Oh, I need to find, I don't know, change a flight reservation."
[00:24:06] Maximiliano Firtman: Okay, you will look into all your services; you see any description that matches that behavior or that tool, and you'll say, "Oh, this one," and it will execute that function, passing the arguments that you requested as the input schema. That's roughly how that works. And your API will return some data.
[00:24:27] Maximiliano Firtman: It can be a boolean, it can be an integer, it can be a message, or it can be an object. That object goes in JSON back to the agent that will process that with the LLM.
[00:24:38] Guy Podjarny: Yeah. I think simplicity is great; to hear that it's simple. I think it's actually probably one of the things that helped MCP also get adoption, because you create something that is very streamlined and slightly capitalizes on the fact that you can communicate all sorts of complexity in that natural language line and rely on the fact that you have a consumer there.
[00:24:58] Guy Podjarny: And same goes for the response, right? You can reply with the whole complexity of what it is that you are returning. It can be simplified because of that. In that sense, it is a lower lift than even REST APIs, which were already, in turn, a simplification.
[00:25:17] Guy Podjarny: But as compared to that, it is still simple. So it just lowers the barrier and allows more of these types of interfaces to be created.
[00:25:28] Maximiliano Firtman: Yeah, sure. Also, there is even a simpler version because you can also use the declarative version. In that case, it's just HTML, so you don't even need to write JavaScript.
[00:25:37] Maximiliano Firtman: If you have forms in your HTML, you add some attributes to the form, and you're specifying to the agent, "Hey, if you want to trigger this action, this is the form, and here you have the fields that you need to fill for triggering this action." So WebMCP also works in HTML without JavaScript. Something interesting is that it's also adding some ideas that we don't have right now.
[00:26:03] Maximiliano Firtman: For example, as a web app developer, there is no way to know if your website is currently being managed or used by an agent. So there is no way. If you want to detect, for whatever reason, even for analytics, that it is actually an agent using your website, there is no way to actually do that.
[00:26:24] Maximiliano Firtman: So on X (Twitter), they changed something now to stop agents from publishing content on Twitter. They say that they are actually verifying if there is an actual finger touching the screen, which on the web, I mean, you cannot actually do that. You can check with the touch bags to see if there is a touch.
[00:26:48] Maximiliano Firtman: But basically, with any tool like Puppeteer or Playwright, you can emulate the touch, and the website will never know it's an agent. Well, now with WebMCP, we have new events that you can listen to on your website over the window object. This is technical stuff, but there is toolActivated and toolCanceled.
[00:27:15] Maximiliano Firtman: That means that the tool or the agent, it's named "tool" within the API, is actually controlling the website. So then you can change the UI. There are also some CSS pseudo-classes that will let you change the user interface a little bit when the agent is in charge. So then if the user is actually seeing the screen, you can tell the user that the agent is in charge despite the browser UI. That is also making that entity somehow.
[00:27:50] Guy Podjarny: Right. I think it's interesting to think about the drivers for adoption of people, like why would they bother writing this? I think there's one; maybe I'll go on a tangent a bit, and then I'll come back to that question, which is that there is a broader trend in the world of AI of unifying things that we used to split apart.
[00:28:10] Guy Podjarny: And so the separation of the API from your webpage is something that, in many aspects, is good architecture, right? It is the right way to do it so that the two are not coupled. In a sense here, you're building an API into your webpage or you're intentionally violating it.
[00:28:30] Guy Podjarny: But in agent land, when you start creating multiple entities that the agent has to orchestrate within them, it gets confused. And so, for instance, there's been a big rise of monorepos and pulling everything into the repo because then the agent can just load the repo and it's all within their reach.
[00:28:49] Guy Podjarny: So it's interesting to think about; one of the advantages of a WebMCP is that it puts the API, or the actions, within the webpage, and it's a more contained unit and therefore easier for the agents to interact with. But coming back a bit to my question there: what would you say are the top motivations?
[00:29:14] Guy Podjarny: I think you've alluded to a few for someone to embrace this. Who would be the first cohort to come in when this is no longer experimental? Which, this is AI land, so maybe that happens faster than we think. Who would be the first? What are the strongest motivations to put in the effort?
[00:29:41] Maximiliano Firtman: I think that the strongest motivation will appear on e-commerce websites. Wherever e-commerce website you have, you want to sell products or services. You actually don't care if it's a human or an agent; you want the money from the consumer. And so that's a motivation to offer as many tools as possible as fast as possible so they can purchase your products or services quickly.
[00:30:08] Maximiliano Firtman: I think that's the first one. When you look at other kind of content, I know you have a blog, you have a newspaper, actually, I'm seeing the opposite sometimes. I'm seeing authors actually rejecting agents. "I don't want the agents to come into my website."
[00:30:27] Guy Podjarny: Yeah.
[00:30:27] Maximiliano Firtman: Because of IP or because the user will not get into my website. They will just get the content and the authority gets out. So there is no credit for that, or not enough credit. You can still apply WebMCP there, but I don't think it will make a lot of sense in a blog or in a newspaper. But if you have anything that has to do with service support...
[00:31:00] Maximiliano Firtman: So if you have a problem with, oh no, your phone bill, then your phone company, your agent, if you are asking your agent to go and try to solve the problem... I mean, all that working with support tickets and all that stuff can be also another pretty useful use case.
[00:31:20] Guy Podjarny: Either way, it sounds like it.
[00:31:21] Guy Podjarny: And by definition, or by design, it is oriented at interactive pages. So pages that are predominantly content, beyond their own agenda or dislike of agents, it's also probably a little bit less useful for them. What about-
[00:31:36] Maximiliano Firtman: Well, if you look at the API, the declarative API works on forms.
[00:31:42] Maximiliano Firtman: So if your website has forms, then it can be a target of WebMCP. If you don't have any form on your website, it seems like you don't have a use case.
[00:31:53] Guy Podjarny: Yeah, there's no real need. What about web testing? So, the other entity that suffers from the ever-changing visuals of websites is the pursuit of end-to-end testing that doesn't require you to rewrite your tests with every pull request.
[00:32:11] Guy Podjarny: Is there a use case for this? For WebMCP to be, it can't do the visual testing, but it can do navigation. To the extent that you've blended that it's the same form that is now marked as a tool, or the same JavaScript function that is called by the webpage and the tool?
[00:32:34] Maximiliano Firtman: I don't think it's going to... I mean, right now when you're doing end-to-end testing, we are talking about using tools like Playwright and Puppeteer. Those are the same tools as the agents are using these days to browse the website.
[00:32:49] Maximiliano Firtman: But I think that WebMCP is more for unit testing, because you're actually testing a function and seeing if the function is working properly. It looks more like that than the end-to-end human user testing because they have to do with how everything looks. So I think that you will still need to do that.
[00:33:10] Maximiliano Firtman: That doesn't mean that you cannot start using LLMs and maybe local image models to improve the speed and accuracy of that testing. Because it's also possible that right now, some very small models can actually do web testing pretty well, even locally without consuming cloud tokens.
[00:33:42] Guy Podjarny: Yeah, super interesting. So let's actually shift into this local model's path. Just to wrap up a bit on WebMCP, it sounds like everyone should try to keep up on this sort of content. So if you have elaborate web applications, you should stay on top of it. It's in experimental mode, which, again, I assume it will graduate out of, appropriate to the pace of change that is in AI. That's probably a reasonable bet. It is currently focused on interactive websites and visible browsers.
[00:34:21] Guy Podjarny: And I guess the people that need to be most attuned to it are probably people that are either looking to block bots and things like that, like that post on X, or people that want to enable browser-side functionality like purchasing in e-commerce. Does that sound like a good summary?
[00:34:35] Maximiliano Firtman: Yeah, sounds like a good summary.
[00:34:36] Maximiliano Firtman: Maximiliano Firtman: If you want to try it now, you need Chrome 146 and need to enable the flag, and you need to install an extension that is called "Model Context Tool Inspector" that will let you debug your WebMCP code. It's pretty simple, so you don't need to spend like two months of training. It is actually a pretty simple transition.
[00:34:57] Guy Podjarny: Yeah. Probably the harder question is to think about this mindset and say, "What are the actions that you want and what are the user flows, as opposed to the technical complexity of it?"
[00:35:05] Maximiliano Firtman: Yeah, correct.
[00:35:07] Guy Podjarny: So let's move a bit to that model. The other topic that I've seen you discuss and track at the cutting edge of is this whole world of models within the browser.
[00:35:21] Guy Podjarny: We talk about giving an interface now to an external browser that is agentic, but this is literally within the webpage. So first of all, tell us I know there are two modes there, right? Ship your own model versus something that's in the browser. What is this world of browser-built models?
[00:35:42] Maximiliano Firtman: As of today, most, if not all, AI-based APIs or apps powered by AI are using the cloud. I mean, it seems like the normal situation. You're consuming tokens from Gemini, from OpenAI, from Claude, and from whatever cloud provider that is actually deploying your models in the cloud, and you're paying for that.
[00:36:09] Maximiliano Firtman: That's cloud AI; that's the standard today. But there is another way that is growing every year that right now is called Web AI. I'm not sure if all developers are using that name or "client-side AI." And with Web AI, the idea is that you will execute AI models, even LLMs, locally on users' devices.
[00:36:37] Maximiliano Firtman: We are not talking about executing locally on your web server; we're talking about the client, the user's device. And for that, maybe you're thinking, "Well, but we don't have the power of ChatGPT." Yeah, of course. We don't have the power of the latest Opus 4.6. But the thing is that in a lot of cases, you don't need that power.
[00:36:58] Maximiliano Firtman: Mostly when you are integrating AI in a web app, maybe you just want to, for example, categorize a post. You want to check if the user is using any hate in the text or if they're adding insults or whatever; you want to filter that. And you don't need the latest Opus 4.6 to do that. You can use a very small model that works perfectly for that, and it's going to be cheaper for you because you don't need to pay for the cloud.
[00:37:34] Guy Podjarny: To implement, just to clarify, this is about a web model, so it's different. A part of this "don't pay for the cloud, run locally" is already available by running actually decent-sized models (not Opus) on your machine through Ollama or LM Studio.
[00:37:52] Maximiliano Firtman: Yeah.
[00:37:53] Guy Podjarny: But this one is something that's in between? Or what's another variant?
[00:37:59] Maximiliano Firtman: To be honest, it's the same. The difference is that with Llama or Llama Studio, you are downloading your model to your computer, and in this case, the models will be downloaded and executed in the browser for the user without the user even knowing that.
[00:38:14] Maximiliano Firtman: So the user doesn't need to install anything. The user doesn't need to understand about models. The model is already on the user's computer, and that's the first mode to execute this. That's a built-in model that right now will basically be available on a lot of browsers.
[00:38:37] Maximiliano Firtman: Today only Chrome is supporting that feature, that's built-in AI. So in that case, if you have Chrome right now on desktop, this is on desktop right now; Android is coming, but right now it's on desktop only. Chrome can download a small version of Gemini, Gemini Nano, and it's available client-side to infer from any website.
[00:39:03] Maximiliano Firtman: Gemini Nano is even smaller than Gemini Flash, the one that is on the cloud. But for a lot of tasks, for example, you can translate content, and it's a pretty good translation. I think the test is like there are 25 languages where that passes the test, so it's pretty good, and you can translate content completely client-side just by using a JavaScript API that talks to a local model that is already in the user's Chrome browser.
[00:39:34] Guy Podjarny: And you said Chrome can install like, if I'm writing a webpage, can I assume that it's there?
[00:39:40] Maximiliano Firtman: Because by default when you install Google Chrome, the model is not there. The first website that requests the usage of this API will actually download the model for the first time and install the models on users' devices, and it will be available for the rest of the website. For that website and another website.
[00:40:01] Guy Podjarny: How big is it? What is-
[00:40:03] Maximiliano Firtman: I think it's around eight gigabytes, something like that. So that's why it's not coming with Chrome.
[00:40:08] Guy Podjarny: Yeah, it's not a small investment.
[00:40:10] Maximiliano Firtman: Exactly. But once it's downloaded, it's between four and eight because, actually, there are three models. Based on what you are requesting, it will download one or the other, but it's between four and eight gigabytes. So yeah, it's more than Chrome itself, so that's why it's not built-in with Chrome yet, but maybe in the future. It's going to be built-in with the OS, so maybe we are heading towards that future.
[00:40:35] Maximiliano Firtman: Then the browser will just ask the OS, "Hey, do you have a model?" and the OS will say, "Yes, we have these models." And then the website will just execute it locally. I think that we are heading towards that future. And so that's one way to execute models locally: to use built-in APIs. Today that is Chrome-only, which is good but not good enough to say, "I will use this and nothing else."
[00:40:58] Maximilian o Firtman: And the other way is to use libraries. Today there are some low-level APIs on every browser, including Safari and Firefox, and those are WebAssembly to use the CPU, WebGPU to use the GPU, and now on some operating systems you have WebNN (Web Neural Network). Actually, if you have a computer that has a special AI chip inside, it can actually use it from JavaScript.
[00:41:34] Maximiliano Firtman: On top of these three APIs that are low-level, there are a lot of open-source libraries. There are like eight different libraries now from Chrome and from different providers that you can use that can run models using CPU, GPU, or AI chips like TPUs or NPUs.
[00:41:57] Maximiliano Firtman: And so that means that your JavaScript, your web app, can download an open-source model. You can download Llama; you can download Gemma from Google. Now you have some versions that are half a gigabyte or 500 megabytes. You can download that in JavaScript and then execute it client-side on an iPhone. It works in Safari on the website.
[00:42:28] Guy Podjarny: Yeah.
[00:42:28] Maximiliano Firtman: So again, it depends on what you need to do. Maybe it's not for using as a therapist or for teaching users history, because those models are not pretty good for that. But for summarizing, categorizing, or supporting chatbots, you can create your mini RAG.
[00:42:53] Maximiliano Firtman: RAG is an architecture that will let you connect your data with an LLM so you can make a support chatbot that talks to your own information with those LLMs. In that case, the inferring cost happens client-side.
[00:43:14] Guy Podjarny: Right.
[00:43:15] Maximiliano Firtman: So it's being executed client-side, and you always client-side fallback into a cloud API. Now Google is offering a Firebase API that will execute client-side, and as a fallback, it will execute a model server-side.
[00:43:31] Guy Podjarny: If for some reason-
[00:43:33] Maximiliano Firtman: Yeah, if the model cannot be executed client-side.
[00:43:39] Guy Podjarny: I think that's super interesting. As you think about future capabilities and use cases for it, it sounds like right now, the numbers sound a little bit scary until you compare them to like a YouTube video, right? It's not really that massive a deal to download half a gigabyte or something like that. It's not something you would want every page load to require.
[00:44:05] Maximiliano Firtman: But remember that also you can use APIs to store that offline. So then if it's a recurring user, it's just a one-time download.
[00:44:14] Guy Podjarny: But it's an interesting opportunity to provide things that reduce, I guess, kind of reduce. One optimization or value proposition that you mentioned is one of costs. And so you might not need to worry about everybody hammering your support bot or your translation bot. The second is probably some element of latency. I guess that depends; sometimes the cloud servers will be faster.
[00:44:36] Maximiliano Firtman: It depends on the case, actually. But for machines, probably the latency is equal. It depends on how much text you're sending. If it's very small, like you are adding a label to something, sometimes the local model works really fast.
[00:44:59] Guy Podjarny: Yeah. And then I guess on top of that, there's also the privacy or the security aspect of it. If you want to allow certain SaaS applications some functionality to happen only on the client side, there are a bunch of these end-to-end encryption functionalities and things like that, so the server actually never sees the data at the moment that might get translated, and therefore you cannot use LLMs on it. But if you're running them locally, those might be reasonable, right? Within WhatsApp, exactly.
[00:45:26] Maximiliano Firtman: Yeah. Also, we need to maybe stop thinking of LLMs as the only solution. There are models that are specifically targeting one use case. For example, Apple released an open-source model to do some kind of image detection.
[00:45:45] Maximiliano Firtman: You can put something on the camera, and it will tell you, "Oh, that's Icebreakers Mint Cinnamon." I like those ones. And that is maybe 200 megabytes; it's a model of 200 megabytes and locally client-side. Now you can have an OCR system that can detect objects completely locally.
[00:46:06] Maximiliano Firtman: It opens a lot of opportunities for a lot of apps and web apps that can run even without internet, just offline. That can be very specific for some use cases, and it actually works. You don't need LLMs; you need a model that was specifically created for that purpose.
[00:46:27] Guy Podjarny: Yeah, it's interesting. It's probably that you needed an LLM in the ecosystem, probably to create those models, because they probably are distilled versions of a large-
[00:46:36] Maximiliano Firtman: Yeah, sometimes they are.
[00:46:37] Guy Podjarny: That's been optimized. But, you know, like, I guess in this case it's neither "L" nor "L," right? It's neither large nor lengthy, necessarily.
[00:46:46] Guy Podjarny: Yeah. It is kind of an image model, which I guess is a language model of sorts.
[00:46:49] Maximiliano Firtman: But based on what you said, right now you can download; there is a new version, for example, of Qwen, one of the Chinese models, with 0.5 billion parameters. That one is really flexible, and you can fine-tune it.
[00:47:06] Maximiliano Firtman: Then you can have a very small LLM that is pretty bad with facts, but you can fine-tune that model on your computer. You don't need a very large computer to fine-tune that with your information. And then you have a pretty decent small LLM that knows about your stuff and can be executed locally on every device, even mobile phones.
[00:47:34] Guy Podjarny: Yeah, I think that is important in terms of the trajectory of the industry. Because, I mean, I would say that today you're probably on the cutting edge if you're doing that, and you need to be quite thoughtful. There are probably, of all the people listening and engaging with these technologies, probably a very small fraction who have use cases that they need right now that truly match the cost, as in effort versus reward.
[00:48:01] Guy Podjarny: But I do believe that open models, and I don't think I'm unusual in that sense. are getting much better, and they will get better at "AI speed," right? They tend to lag six-ish months behind the world. Tuning in might further close the gap if you have enough of a specific use case.
[00:48:22] Guy Podjarny: But even without it, I think as a competency to build, if you're building a new business, if you're building a new interaction approach for your application, these are probably core capabilities you should keep in mind: thinking about what needs to go to the server and what goes on the client.
[00:48:39] Guy Podjarny: I guess it's not dissimilar to the thought of responsive websites, but then even more like the progressive web apps that run on the client side, where a lot of server-side functionality has moved to the client.
[00:48:57] Maximiliano Firtman: Yeah, that's right.
[00:48:58] Guy Podjarny: Super interesting. Are there examples of who you've seen out there that are most interesting in terms of using client-side AI right now?
[00:49:09] Maximiliano Firtman: I've seen a lot on the customer support bots thing because they found that their token bill can get pretty high. Mostly when they're hacking their prompts, at some point they're losing control of the money they're spending. And I've seen that they are interested in seeing if they can replace that with client-side.
[00:49:10] Guy Podjarny: Yeah.
[00:49:11] Maximiliano Firtman: If you want to hack your prompt, go ahead; it's your computer. You're hacking your own model, so I don't care if you do that. I think it's about cost at some point, right? Because everything is fine when you're doing your first project, you release a project, your MVP, maybe it's 50 bucks, and you are okay.
[00:49:42] Maximiliano Firtman: But when you are scaling that, maybe you receive a bill of $100,000 for tokens that you need to pay. And when you look at that bill, you say, "Well, let's see if we can cut this without changing the quality of the service by moving some parts to the client."
[00:50:30] Guy Podjarny: Yeah, it's so interesting. It's like an escalation level, right? You're talking to the frontline support, which is your local model, and then it's like, "No, no, let me talk to your boss. Let me go to the cloud side to get it."
[00:50:48] Guy Podjarny: It's also interesting how support has become a bit of a role model in terms of identifying the trajectory of application use for LLMs. Because probably you can debate software development, but you see the pattern over there as well. In support, you definitely saw that be the first role to, I guess you can say, be displaced.
[00:51:11] Guy Podjarny: In software development, I have this intro slide for Tessel where I talk about agent development traits. One of the ones I have on there is that it's "cheap and expensive," because it's initially cheap because a single person can do so much with it.
[00:51:24] Guy Podjarny: And then you get the bill and it's just like, "Do I need to be spending that much? Can I do this a bit more cost-effectively?" But most people are not there yet; most people are still in the discovery phase. It's interesting to think about support being the forerunner there.
[00:51:39] Maximiliano Firtman: This is getting better and better every year.
[00:51:42] Maximiliano Firtman: Still, if we look at some data, that we don't have, I think that probably companies using client-side AI are less than 1%. But it's getting better and better in terms of quality, cost, and performance.
[00:52:11] Maximiliano Firtman: If other browsers, I know maybe Safari, without the new Apple Intelligence contract with Google, maybe they will also do something similar in Safari. They might get a local model on the local side. If that happens, well, maybe we will see more built-in AI APIs in web apps.
[00:52:25] Guy Podjarny: Yeah, I think that makes perfect sense. It's almost hard to imagine how that doesn't work. I guess one thing we didn't talk about in the context of web AI is the sandboxing of the browser.
[00:52:36] Guy Podjarny: I guess if you operate in this fashion, as compared to an Ollama or whatever it is, you're still working within the sandbox. If you have two tabs, barring the sharing capabilities that browsers support, each of them will have its own copy, its own data set. So one malicious website wouldn't be able to siphon off another tab's information.
[00:53:08] Maximiliano Firtman: Correct. Also, we need to remember that the LLM is still a black box that you send an input and it gets you something back.
[00:53:16] Maximiliano Firtman: In this case, two websites will send different inputs and they're not tied together in any way; it's just two prompts to the same LLM. So the security thing is going to be the same with or without the LLM.
[00:53:33] Guy Podjarny: Yeah. But it's interesting to think about, I think all of this conversation really revolves around the browser as the AI sandbox and how, since the browser is the entry point to the world and to your systems and to your data on so many fronts, how do you get it to interact with agents?
[00:53:44] Guy Podjarny: Increasingly we have the rise of agentic browsers. In the WebMCP land, you create better interaction points, many of which revolve around security, authentication, and purchases, it's about trust delegation, sensitive actions that you might bump up.
[00:54:12] Guy Podjarny: I think that's very interesting. And then you can augment all of those with some local LLM action that is part of your browser, which is independent of the other path. The browser will have LLM capabilities itself.
[00:54:30] Guy Podjarny: So maybe in the first layer, it's more about a brokering or an interface layer, but increasingly it might actually embody the engine to perform activities. From a website builder perspective, first you need to think about how agents interact, and then you need to think about how you embed agentic functionality literally into your local web.
[00:54:51] Maximiliano Firtman: Yeah. Also, I think this will become more important because even when some people think, again, for the probably nth time, that web design is dying, if you think about it, all these new apps that are appearing because AI is coding them are web apps. So actually, the web is growing, not dying.
[00:55:15] Maximiliano Firtman: I think that means that we will have more opportunities than even before to actually start thinking and using this new architecture.
[00:55:25] Guy Podjarny: Yeah. Super interesting. So I guess, you know, maybe to close off, you can kind of take out your crystal ball. We've sort of seen the journey of how responsive web applications, I'm not even sure if I'm using the modern name, moved a little bit away from the web.
[00:55:44] Maximiliano Firtman: Rich Internet Applications (RIA).
[00:55:47] Guy Podjarny: Rich Internet, right. Not the RWAs? Okay. Rich Internet applications have become the norm in many facets of applications; they have become the way to build many apps, and a lot of functionality has moved to the client.
[00:56:05] Guy Podjarny: I guess, if you dare to cast your eyes three years out, even in the world of AI, do you envision there’s going to be applications that really innovate and that this will become the norm to have substantial in-browser, in-website LLM functionality?
[00:56:25] Maximiliano Firtman: So, I think that in the future we will see fewer native apps and more, at least on mobile devices, because maybe on the desktop there is now with AI a new approach where a lot of people are doing native apps, which is a security problem.
[00:56:42] Maximiliano Firtman: But anyway, in the mobile world, I think there will be more web apps actually in the future. Maybe without using the term, it's just a link; it's just a QR code; it's just an interface that users are using or agents are using. And maybe we are not going to call them web apps anymore, just "AI apps" or "apps."
[00:57:07] Maximiliano Firtman: Whether they are "AI-coded apps" or however you want to call them, they're going to be web-based. On that note, I think that the price and the quality of local AI will improve a lot in the next few years, which will trigger a lot of new use cases for local AI. Not all of them will be local, and not all of them will be web apps, but I'm seeing a lot of them moving into that future that will be faster and probably more performant.
[00:57:39] Guy Podjarny: Yeah. Super interesting. And it's funny how every time the web gets stronger, it raises this debate of, "Why do you even need native?" Can I just have it be the web?" I guess there's a claim here that if functionality becomes more agentic, if the interfaces actually get simpler because they get very chat-driven or natural, like API-based, you're picking up that gauntlet again. You're throwing down the glove, mixing my metaphors here, and saying you think this debate will come up again and the web will win.
[00:58:22] Maximiliano Firtman: Yeah, I think so.
[00:58:24] Guy Podjarny: Cool. Max, thanks a lot for coming in. Super interesting conversation. I think it's an exciting new frontier for the web. It sounds like it's still, today, probably for the pioneers and maybe for the e-commerce shops or those who really care about controlling agentic behaviors.
[00:58:44] Guy Podjarny: But again, this is AI pace, so it's probably going to become relevant to many, many more very quickly. So thanks for sharing the great information and coming here on the podcast.
[00:58:55] Maximiliano Firtman: My pleasure. Thank you.
[00:58:57] Guy Podjarny: And thanks everyone for tuning in, and I hope you join us for the next one.
[00:59:04] Guy Podjarny: And we are done. Cool. I think.
Chapters
In this episode
An agent cannot read your website. And that needs to change.
In this episode of AI Native Dev, Guy Podjarny sits down with Maximiliano Firtman, 30-year web developer and author of 14 books, to talk about what building for the web looks like when traffic comes from agents and humans both.
They get into:
- why AI agents taking screenshots of your website is inefficient and expensive
- what WebMCP is and how it gives agents a direct API into your website
- how a 200MB Apple model running offline is opening up a whole new category of web apps
- why every vibe coded app is a web app and what that means for the future of the web
Your next visitor might not be human. Are you ready for that?
WebMCP and Client-Side AI: The Browser as Agent Interface
Agents navigating websites today face an awkward problem: they have to reverse-engineer interfaces designed for human eyes. Screenshots get analyzed, coordinates get calculated, and buttons get clicked based on visual inference. It works, but it is expensive, slow, and fragile. When JavaScript moves a button between screenshots, the whole process breaks down.
In a recent episode of the AI Native Dev podcast, Guy Podjarny sat down with Maximiliano Firtman, a web developer with three decades of experience who has been tracking the intersection of web technologies and AI since the early ChatGPT days. The conversation explored two emerging capabilities: WebMCP, which gives agents a programmatic interface to websites, and client-side AI, which runs models directly in the browser.
Why Agents Struggle with Modern Websites
The current approach to agent browsing involves either taking screenshots and using image models to interpret them, or parsing the DOM directly. Neither works particularly well. Screenshot-based approaches suffer from timing issues and consume significant tokens. DOM parsing struggles because modern web development, particularly the React era, has produced pages full of non-semantic divs that offer little meaning to automated systems.
"When you look at the DOM that we are shipping to the user, it's not semantic," Max explained. "It's just a list of a hundred divs. So understanding what's there, the DOM might not be useful on every website."
Accessibility trees offer one alternative, similar to how screen readers navigate applications. But tests have shown inconsistent results, pushing most agent implementations back to the inefficient screenshot approach. WebMCP emerges as a potential solution: let website developers explicitly expose functions that agents can call directly.
How WebMCP Works
WebMCP, currently experimental in Chrome 146 behind a flag, allows web developers to register JavaScript functions as tools that agents can discover and invoke. The API is straightforward: call navigator.modelContext.registerTool with a name, a natural language description, and the function to execute. Agents browsing a WebMCP-enabled site see available services, match them against their goals, and call them directly rather than navigating the visual interface.
The functions execute client-side, which opens interesting possibilities. They can interact with local storage, trigger payment flows like Apple Pay that must happen in the browser, or maintain user authentication context. There is also a declarative HTML version where forms can be marked up with attributes that expose them as agent-callable actions without writing JavaScript.
This differs from traditional APIs in important ways. A REST endpoint requires the agent to manage authentication, understand the API structure, and make server requests. WebMCP functions live within the authenticated browser session, can access client-side state, and follow existing user flows. For shopping carts that live in the browser rather than the server, or payment processes that require client-side verification, WebMCP provides capabilities that server APIs cannot.
The specification also introduces new events: toolActivated and toolCanceled fire when agents take control, letting developers adjust the UI or notify users. CSS pseudo-classes allow styling changes when agents are operating. This creates transparency that does not exist with current browser automation approaches.
E-Commerce as the First Adoption Vector
The conversation surfaced e-commerce as the likely first major use case. Retailers want sales regardless of whether the buyer is human or agent. Exposing search, cart management, and checkout as WebMCP tools makes purchases faster and more reliable for agentic shoppers.
"You actually don't care if it's a human or an agent; you want the money from the consumer," Max observed. "That's a motivation to offer as many tools as possible as fast as possible so they can purchase your products or services quickly."
Content sites face different incentives. Publishers worried about agents extracting content without driving traffic may resist making their sites more agent-friendly. But for any site with forms, transactions, or service interactions, the value proposition is clearer. Support ticket systems, reservation platforms, and configuration interfaces all stand to benefit from structured agent access.
Client-Side AI: Models in the Browser
The second major topic was running AI models directly in user browsers rather than making cloud API calls. Chrome now supports built-in AI through Gemini Nano, a smaller model that downloads on first use (around four to eight gigabytes) and remains available for subsequent requests from any website.
Beyond built-in models, open-source libraries can load models like Llama or Gemma using WebAssembly for CPU, WebGPU for graphics processors, or WebNN for dedicated AI chips. These approaches work across browsers, including Safari and Firefox, enabling inference without any cloud dependency.
The use cases differ from general-purpose chat. "In a lot of cases, you don't need that power," Max noted about frontier models. Translation, content categorization, hate speech detection, and support chatbots can run on much smaller models. A 500-megabyte model running locally can handle specific tasks that would otherwise require cloud tokens.
The cost implications are significant. Support bots in particular have driven interest because token bills can escalate quickly, especially when users discover prompt manipulation. Moving inference client-side eliminates per-query costs entirely. "If you want to hack your prompt, go ahead; it's your computer," as Max put it. "You're hacking your own model, so I don't care if you do that."
Security and Privacy Advantages
Client-side execution also enables functionality that cloud APIs cannot safely support. End-to-end encrypted applications cannot send user data to external servers for processing. Running models locally keeps sensitive content within the browser sandbox, opening possibilities for AI features in privacy-focused applications.
The browser sandbox provides security boundaries that desktop AI tools lack. Each tab operates independently. Malicious websites cannot access another tab's model interactions. The same isolation that protects web browsing extends to local AI inference.
The Browser as AI Sandbox
The conversation pointed toward a future where browsers serve as comprehensive AI interfaces. WebMCP provides structured ways for external agents to interact with web applications. Client-side AI provides local inference capabilities for web applications to use. Together, they position the browser as a key platform for agentic computing.
For web developers, the practical implications are emerging. E-commerce sites should track WebMCP development and prepare to expose purchasing flows as agent-callable functions. Applications with significant token costs should evaluate whether client-side models can handle specific tasks more economically. Anyone building interactive web applications should consider how agents will navigate and interact with their interfaces.
The technology is experimental today, probably suited for pioneers and teams with specific requirements around cost or agent interaction. But in AI development, experimental tends to become standard faster than traditional timelines suggest. The browser has survived many predicted deaths. As more applications become web-based, partly because agents are building them, the intersection of web technology and AI capabilities seems likely to matter more, not less.
The full conversation covers additional ground on responsive web applications, the trajectory of context engineering (https://claude.ai/blog/context-engineering-guide) for web interfaces, and the technical details of model loading. Worth a listen for anyone building web applications that agents will need to use.