We're live on Product Hunt!

Product hunt

Upvote & share feedback

Registry EnterpriseDocs SIGN UP FREE

Back to podcasts

From IBM Acquisition to AI-Native Observability | Dash0 CEO

10 Feb 2026with Mirko Novakovic

Also available on

Transcript

[00:00:00] Tessl Intro: Before we jump into this episode, I wanted to let you know that this podcast is for developers building with AI at the core. So whether that's exploring the latest tools, the workflows, or the best practices, this podcast's for you. A really quick ask. 90% of people who are listening to this haven't yet subscribed.

[00:00:24] Tessl Intro: So if this content has helped you build smarter, hit that subscribe button and maybe a like. Alright, back to the episode.

[00:00:32] Guy Podjarny: Hello everyone. Welcome back to the AI native Dev. Today we're going to plunge deep into the world of DevOps and observability, and what does that look like in the world of AI?

[00:00:44] Guy Podjarny: And to dig into that, we have Mirko Novakovic, who was the founder of Instana, built an observability platform, and sold it to IBM. That's where we met, Mirko, way back then. And today is the founder and CEO of Dash0, building that out. So Mirko, thanks for coming onto the show.

[00:01:03] Mirko: Yeah, thanks for having me.

[00:01:05] Guy Podjarny: So just to dig in, start by giving us a bit of context of Dash0, the company you have built and are running today. Tell us a few words about it and some of its core.

[00:01:16] Mirko: Yeah. So we are an AI native observability platform. We actually started, when we started, we were not promoting AI native at the beginning.

[00:01:29] Mirko: We were OpenTelemetry native. And that turned out to be a good foundation for AI, by the way, which we can discuss. But the idea was there is a new standard in observability called OpenTelemetry, which standardizes the format of the telemetry data: logs, metrics, traces, and end-user events, and also standardises the tagging system on that telemetry data, which is called the Semantic Convention.

[00:01:53] Mirko: So, mm-hmm. The host name now is host name or host name. And um, that. That is the way we have started. So we built a full platform for logs, traces, metrics, and user monitoring. Taking only open telemetry data into it and keeping it as open telemetry and making the most out of the semantic convention by creating context.

[00:02:16] Mirko: So if you look at a trace, you see all the logs, you see the metrics, the underlying infrastructure, and everything in context based on the Semantic Convention, and we try to make a very easy onboarding flow, a PLG type of sales motion, and have it easy to use.

[00:02:35] Guy Podjarny: Yeah. So part of it is kind of just a next gen observability platform that sort of gives you all the tools.

[00:02:41] Guy Podjarny: And I know it kind of combines all this sort of broaden the scope of what observability captures today with the traces and logs and all that jazz. But I think that the core of OpenTelemetry, or OTel, is really interesting. I guess maybe you can sort of say a few words about OTel itself and its distinction between what is it when it is a format and what it is when it is the specific fields, right?

[00:03:07] Guy Podjarny: Because I know when OpenTelemetry came around, there was a lot of, maybe like, a first set of companies that sort of built on top of it and were maybe overly optimistic at not just the format, but like the type of information that would just magically, I will just connect to the system; if they already have OTel, I will have the info I need. And they learned not all the information is always there.

[00:03:22] Guy Podjarny: But some did standardize. So maybe you can say a bit about what have you seen in terms of OpenTelemetry? Where has it standardized just formats, and what data is typical to find in any OTel-based capture?

[00:03:44] Mirko: Yeah, it is. Let us start with why. I think it is also meaningful. Before OTel, every vendor, including myself as Instana, did their own format. Right? So we had an agent and so you specified your old format, and you sent your old one. This is the old school. This is the tracing agent, not an AI agent.

[00:04:04] Mirko: Exactly. That is the old-school agent that you install on your host, basically, that captures the data, right? The CPU utilization, the logs, and the traces. Exactly. It is not an AI agent. The benefit was that you could make everything you needed for your platform right into the format. And you could do some automatically. For example, if the agent was running on the host, you could add the host name because you were running on it and just send it over, right?

[00:04:31] Mirko: And OpenTelemetry actually was the first approach to standardize this format, right? So that you have no proprietary data, and I think the main drivers were the cloud providers. Yeah, because if you are running something like AWS Lambda or any managed service and you want to provide telemetry data of that service, how do you do that?

[00:04:54] Mirko: You either provide it in twenty formats, like Datadog, New Relic, or Dash0, or you have a standardized format that everybody can understand, and that is how it started, right? So I would also say if you look at a lot of vendors, they say, "We support OpenTelemetry. "What that means is that you can send them OpenTelemetry data.

[00:05:21] Guy Podjarny: Right.

[00:05:21] Mirko: But I would say still 90% of the vendors, they take that data and convert it into their internal format because that is how the platform is built. So you can send it, but then in the system, essentially the tagging system, et cetera, is gone, right? Because it is now Datadog format or any other vendor, right?

[00:05:36] Mirko: And I think what we have built is something where the data is always OpenTelemetry, it stays OpenTelemetry, right? The naming is still OpenTelemetry in the tool. And, yeah, coming back to your question, I think there is only one tag that is mandatory and that is service name. Okay?

[00:06:03] Mirko: So the only tag, and essentially everything that emits telemetry data, is a service in that sense. And so if you have a service, I do not know, a payment service, and you send a log, you add this service name equals payment service. And if you have a metric of that, you do the same. And now you can correlate everything on a service level by saying, "Give me all the metrics, logs, and traces for the right of the service name."

[00:06:23] Mirko: Yeah, that is the only thing that is mandatory. That is mandatory. And that is also a bit of a problem, I would say, for vendors because we also see that a lot of the data we get from customers at the beginning does not have all the information that would be needed. For example, if you now ask, "Give me all the logs of that pod or that host," if that log does not have the pod name and the host name as a tag, we cannot do it right because.

[00:06:50] Mirko: Again, going back to proprietary agents. There, we could add that information to the log ourselves. But now we are relying on the customer sending us the right context. And if we do not get that context, we cannot really recreate it. Right. So. There are companies like Olyka that are only focusing on the quality of telemetry data.

[00:07:14] Mirko: They actually look at data, and then they will tell you, "Oh, by the way, here we see it is running Kubernetes, but we are missing the Kubernetes cluster name and the pod name." And so we can give you hints for doing better. Right. By the way, I think there is also a big chance for AI. We could actually look at data and then proactively add configuration data or populate those kinds of fields.

[00:07:41] Guy Podjarny: But this is a bit of an inverted version, which is the core of it is to say whoever it is that is operating the system, they have metadata that they store on top of it instead of having that metadata be extracted or inferred or captured in the observability system or in an external system otherwise monitoring those services.

[00:08:02] Guy Podjarny: You kind of collaborate or sort of push the customer, help them, and make them more relevant. Maybe AI does that more and populates that information at the source, you know, to sort of decorate the services with the right.

[00:08:18] Mirko: Exactly. We, for example, also built an open-source project, a Kubernetes operator.

[00:08:24] Mirko: And by using that operator, we do all that configuration stuff for you. Yeah. So if you use that operator, we will essentially make sure that all the telemetry data has the right Kubernetes information and the right host information. And we also auto-inject the agents, the agents, for example, into Java or Node.js into your runtimes on the fly, right? So that you get all this that works.

[00:08:46] Guy Podjarny: "Agent" is such a confusing word right now. Like again, these are agents. It is; these are not the AI agents; these are the observability agents. The sort of thing. Exactly, exactly. Yeah. Really, like, I do not know how we got ourselves in this trouble, but today in observability you definitely have ambiguity around the word agent.

[00:09:04] Mirko: Or insecurity. Right?

[00:09:06] Guy Podjarny: That's true as well. So I mean, at this point listeners might be a little bit like, "Isn't this an AI startup and AI podcast? You know what is going on here?" So you mentioned that OTel proved actually quite useful when it came to agents. Tell us a bit more about that.

[00:09:22] Guy Podjarny: Like, why? Why was it useful?

[00:09:25] Mirko: It was useful. I think it started when these LLMs came out and we started experimenting with them; we saw that we got really good results with those agents, with those platforms, like Claude, for example. We use Claude internally, giving us really good results by saying, "Hey, analyze this trace." Right.

[00:09:48] Mirko: Put in a trace, which is literally a text format of tags, right? Which is specified by OpenTelemetry. And now the thing is, because it is actually open source, it is openly documented, and it is an open standard. All these models are trained on the data and they literally understand that host.name is a host name, and they can start basically understanding the context, getting what is HTTP status code 404.

[00:10:15] Mirko: You then know that it is actually a problem, right? And now it can really analyze these things. So OpenTelemetry turned out to be really useful because all the models by default understand the format and understand OpenTelemetry and therefore can really work with it similar to what they can do with code, right?

[00:10:34] Mirko: It is like it is a text structure. The text format is very well specified. It has a syntax, it has semantics. And so it can actually really do interesting things and analyze telemetry data. I mean, we were really impressed by the output, right?

[00:10:52] Guy Podjarny: And this is a, this is still format oriented, right?

[00:10:55] Guy Podjarny: Because there isn't, I do not know, at least of any big bodies of telemetry data, even OpenTelemetry data, that are just available for the LLMs to train on. Generally, I think in the world of DevOps, like traces, everything you collect, even in Dash0, it is in your system.

[00:11:14] Guy Podjarny: It is not published anywhere. Unlike, maybe, GitHub for code.

[00:11:17] Mirko: It is not published. There are a few repositories where you can find a large amount of logs and spans as examples, right?

[00:11:26] Guy Podjarny: Yeah.

[00:11:27] Guy Podjarny: People just, by the way, post as a, like, were they custom created or is it like donated logs? You know, from a donation?

[00:11:32] Mirko: It is kind of donated, yes. It is kind of donated. And by the way, a funny story is when we started, there was an OpenTelemetry sample application, also open source. And the sample application has very well-documented problems in it. Right? Different types of errors. And when we started with LLMs, we asked the LLMs about problems in that sample application because as it was running, it always gave us perfect answers.

[00:11:57] Mirko: And we were super excited at the beginning. But it turned out the LLM was also trained with the problems documented on the actual list. So it could do it because it could cheat on the test. It could cheat on the test because it knew the problems upfront. Right? So results were amazing at the beginning, but then we figured out, okay, it is actually not that amazing with other things.

[00:12:16] Mirko: And then we kind of connected the dots that it was already trained on the documentation of the problems. Right?

[00:12:23] Guy Podjarny: I guess drilling into that a little bit, so the LLMs were naturally better, you know, that was like a nice benefit of choosing open technologies that the LLMs came pre-made or pre-read to process OpenTelemetry data.

[00:12:35] Guy Podjarny: Is that they are not very good at analyzing traces? They are not very good at understanding time series data? They do not, the volume of data is there, I guess has on not on those fronts. Like OTel, no OTel, that does not terribly matter.

[00:12:56] Guy Podjarny: Right? Or I guess what has been your experience in terms of the native support of just dropping this trace into whatever Claude or ChatGPT and get some result?

[00:13:09] Mirko: I think there are two separate problems, right? One is you have one trace and there is a problem in it, like the erroneous span or something.

[00:13:14] Mirko: If you do drop that trace into, for example, Claude, I think it will definitely come up with an analysis and tell you, "Hey, I see there is a problem with this trace." And depending on the metadata it gets, for example, if it is a database problem and it gets a database status code in there from an Oracle database, it will look up that error code and will give you context on that error.

[00:13:36] Mirko: So they are really good at saying, "Hey, this is actually exhaustion of a connection pool in the Oracle database, and you should do this and that based on the documentation." This is essentially what you would do as a human, right?

[00:13:55] Mirko: You would see that code and then search for it, and it does the job for you. Where it is not really good is if you have thousands or millions of traces to figure out anomalies instead of one, because they cannot really do that large amount of data, right? The volume. So what you have to do there is you have to provide the agent, if you build, now we are talking about the AI agent, you have to give the AI agent the right tools to do the analysis, right?

[00:14:18] Mirko: So we have a functionality, for example, called triage. What triage does is it compares; you give it a million traces and ask, "Is there any anomaly in it for erroneous traces?" And it would look at all the tags and would tell you, "Oh, the ones with the error always have this customer ID as a tag."

[00:14:44] Mirko: And then it will return that result. And now we provide that tool to the agent through an MCP server, right? And the AI agent can now use that tool, that triage tool, and it will use it autonomously. It will say, "Okay, there is a problem. Let's figure out if there are any anomalies.

[00:15:05] Mirko: So let's use that triage feature. "So yes, you have to build your API essentially in a way that it works for the agent and that the agent can use it.

[00:15:16] Guy Podjarny: To be consumed. So I think maybe let's delineate. We want to talk a little bit about the way in which AI meets observability needs and that type of analysis.

[00:15:27] Guy Podjarny: And I think there are sort of three pillars here to talk about. One is what we started talking about here, which is more the AI power. How do you use AI as an agent to be smarter, right? To offload more of that work, provide good functionality.

[00:15:44] Guy Podjarny: The second is about agent as a consumer. And I think you started now talking about the agent will consume. So I want to disambiguate a little bit of which agent is that? Is it your agent or is it like a class agent? And then maybe we go a little bit more philosophical, talk about product and talk about scope and the likes.

[00:16:00] Guy Podjarny: So maybe let's start. We started digging a lot more. The LLMs cannot do this, it cannot do that. So why don't we talk like you have a bunch of agent-powered observability, I do not know how you call them. I will let you say that in a second. But under the mantle of Agent Zero is the capabilities.

[00:16:18] Guy Podjarny: So tell us about that. What are useful things to do today with AI when it comes to this world?

[00:16:27] Mirko: I mean, when we started, Agent Zero was just one AI agent. But then over time we figured out that there are actually a lot of use cases where AI agents make sense.

[00:16:38] Mirko: And now Agent Zero is just a platform for agents and we have different ones. And I think the most prominent in the whole space, there is also a category, AI SRE agents, is essentially troubleshooting, right? I mean, you get a call, 3:00 AM in the morning. There is an outage, you have a problem and now you want AI to support you figuring out what the problem is, right?

[00:17:01] Mirko: And it turns out they are actually pretty good at it. So that is the SRE agent. We call him that agent, the Seeker. So it is an agent. We give every agent a name. And so that is the Seeker. And the Seeker is essentially the troubleshooting agent that helps you with any kind of problem figuring out in the data what the root cause of the problem is. Right.

[00:17:26] Guy Podjarny: And this is I guess the root of the root cause analysis agent. Is it mostly about discovering the data? Is it more about analysis? I guess what would you say are the core competencies of that agent?

[00:17:44] Mirko: I think it must do both, right?

[00:17:45] Mirko: First, it must understand the underlying system and the dependencies because the agent literally has to figure out where to look at, right?

[00:17:52] Guy Podjarny: Yeah.

[00:17:53] Mirko: And then it is about digging into the data and figuring out if it is a log or a span or a metric that is the root cause of the problem.

[00:18:05] Mirko: Is there any correlation, right? So it is a step of things, right? Normally, you would see the agent figuring, "Okay, which services are affected by the problem, and what is the underlying infrastructure?" And then it would ask our server, "Give me the spans and logs, the erroneous logs of that service, and give me the CPU time of that underlying Kubernetes pod."

[00:18:25] Mirko: And so it gathers all the data, analyzes it, and tries to narrow it down. Right? So that is essentially how these RCA, root cause analysis, agents work. Right. And again, they work pretty well, right? Yeah, because most of the errors are text-based. And it is easy for an agent to look at the data and figure out what to do next.

[00:18:50] Guy Podjarny: Yeah, I think the domains that are getting the most success are the very text-heavy domains and this one definitely qualifies there. How do you think about like, you build your agent, I am sort of breaking my own ordering there of the pillars of it.

[00:19:04] Guy Podjarny: But, you know, talk about building your own agent. You're describing a sequence of actions. I can imagine Claude Code or any sort of other agent that I have myself going off and using tools to do these things and get the right steering, I guess. How do you think about building your own agent versus integrating into whoever it is that the customer has as an agent? Why have your own, and do you think that is a long-term or short-term reality?

[00:19:33] Mirko: It is something I am not a hundred percent sure of how it will work. So we do both, right? We integrate into agents like Claude Code or Cursor or whatever through our MCP server. So there the use case is, or the idea is, the developers are inside of the IDE, it is Cursor or something, and now they can ask a question: "Figure out the errors of that service in production and suggest we fix it in code," right?

[00:19:54] Mirko: So it will connect to our MCP server, do the analysis that we just discussed, narrow it down, can then use the functionality of Cursor to match it with the right code and see if it could suggest a fix in the code, right?

[00:20:16] Mirko: I think that is a natural kind of use case for a developer. Why would the developer log into a second tool? Why wouldn't you stay inside of the tool? Right? So that is one sort of integration. But then there is the other use case that you get a Slack message or a PagerDuty message and it says, "Hey, we have a higher error rate on your payment service," and you want to click on it and get a full analysis in context of the data, right?

[00:20:44] Mirko: With the dashboards, with the spans, with the logs, you want to analyze it, and there the agent is inside of our tool. That is the Agent Zero then, and it will guide you through the UI. Right. And I am just off a meeting with my product team because I do think that this is something we are working on right now, that the interaction patterns in software will totally change with agents, right?

[00:21:11] Mirko: I think at the moment we are still at the beginning where a lot of the agents are chat interfaces and they are kind of giving you an answer. My view on it is that it will be more like an interactive mode between the user and the agent. Right.

[00:21:26] Guy Podjarny: Right.

[00:21:27] Mirko: As an example, today I am using Google Presentation, and there is this feature which says, "Hey, do you want me to update your slide?"

[00:21:38] Mirko: And I think the result is pretty disappointing because the result is that it creates an image out of your presentation. Right. And now the slide is an image. It is nice because it looks like it, but now it is not interactable anymore for me as a user. Right. You can continue editing it yourself.

[00:21:54] Mirko: Exactly. And I think the way you want it as a user is that I say, "Hey, help me make my presentation slide nicer," and then I can still go in and change the size or the text. Right. And that is the same for us at the moment. The agent will tell you this and that is the root cause.

[00:22:10] Guy Podjarny: Yeah.

[00:22:10] Mirko: But maybe what you really want is, I think that it will show you, it will actually do the filtering, and everything inside of the tool will pinpoint to it.

[00:22:19] Mirko: And now as a user you can say, "Yeah, but I removed this error and now do the analysis again." Right? Because I know that this problem is not really the problem. So it is more an interactive mode. Right. I think that is something I really, we are creating right now. So at the moment we are still at this type of integration.

[00:22:38] Mirko: Right. It is nice, but I do not think it is the final way we want to see. More as an interactive mode, right? Which in coding you already have, right? It creates code. You can now change the code.

[00:22:52] Guy Podjarny: Yeah. Although even, yeah, and even though there is kind of the conversation right now, for instance, on the IDE versus the terminal-based.

[00:22:58] Guy Podjarny: You know, what is the preferred path on it? But it sounds like you're dividing it into the maybe the lowest level is the tools. It is just the ability to access different parts of the system. That is probably just your APIs, and you have the MCPs. On top of that, it sounds like a subset of analytical tools to figure out what the next thing might be, or an agentic tool.

[00:23:22] Guy Podjarny: So there might be, as always with software, low, composable layers of tools and those you can call them now from here. It is an interface question. And maybe there was a case of a developer at Claude Code at their terminal. Maybe there is a case of a headless, you know, when an incident got called, something automatically ran.

[00:23:43] Guy Podjarny: And I think for both of those, it probably does not really matter. For the first one, clearly it is wherever the user was. And so if they are in Claude Code, then you should integrate as a tool. The middle layer probably can run either way, right? You can invoke a headless Claude Code to sort of have whatever the relevant kind of MCP tools to do the analysis.

[00:24:04] Guy Podjarny: But really it is your sort of agentic process of it, so just as well run in your system. But the other part is if you are really in a "Hey, I am troubleshooting this" exercise of "How do I collaborate with the AI over here?" And that is the part that we are sort of most shaping at the moment.

[00:24:26] Guy Podjarny: It reminds me a bit of when I had a conversation here with Merrill, the CEO and founder of Graphite. And we talked about reviews. To some extent, I think code review as it is done today, in my opinion, a lot of that would be a bit nonsensical in the sense that you want the same reviews and the same questions to happen before. But review as an action of reviewing the results of the AI's coding is probably more fundamental than ever and will continue to grow in importance.

[00:24:57] Guy Podjarny: Absolutely. I also think for us, it is an existential question, right? Because if it turns out that the user will use observability in other tools, I think then we are just a database. Right?

[00:25:09] Mirko: Absolutely. I also think for us, it's an existential question, right? Because if it turns out that the user will use observability in other tools, I think.

[00:25:22] Mirko: Then we are just a database. Right?

[00:25:24] Guy Podjarny: Right.

[00:25:24] Mirko: And that will mean that it is a race to the bottom in terms of pricing. And so the value then is not generated anymore in our tool; it is generated in SSA or any other SRE agent tool or whatever. So for us it is really existential to see, "Okay, how can we provide more value inside of our tool than you get in a Cursor or in an AI SRE agent?"

[00:25:42] Mirko: Because at the end, people will pay for the value, not for the data. And if the value is generated somewhere else, then we will at the end be a database.

[00:26:01] Mirko: So for me, it is really about, "Okay, if I can create this interaction with the user and can provide real value, that is awesome." And also I have to say, you probably know, but if you look at large organizations, we have hundreds of users registering for observability, but only a fraction of them are really using it.

[00:26:27] Mirko: Right? Because you need to kind of be an expert to do troubleshooting service. So I think we have the chance with agent AI to enable almost a hundred percent of the users to get value out of it, because it can help you follow the right actions inside of the tool, and it can guide you. So I think that is really powerful.

[00:26:48] Mirko: And then you get more value, right? Because a hundred percent of the users can get in and troubleshoot.

[00:26:52] Guy Podjarny: They should self-service, and they can build those. And the most clear demonstrations of AI productivity in the entire software development lifecycle, including observability, are cases where you used to have a dependence on another team and you do not anymore, right?

[00:27:05] Guy Podjarny: Or in general where you skip a whole step. Some of it is a product or a support, or someone else logs a bug and a background agent figures it out, and you only meet the solution at the pull request. Maybe at some point even that gets auto-resolved. So that is an example.

[00:27:22] Guy Podjarny: Sometimes it is something that needed to go from the backend team to the frontend team and then get deployed. But now one of those teams can take it all the way. So I think the opportunity that you're describing here makes a lot of sense, which is the more self-sufficient and the more sort of single owner can take it all the way through, the more productivity you get.

[00:27:44] Guy Podjarny: And now the question becomes just like, "What is the scope of the product, and what value do you get from which tool?"

[00:27:52] Mirko: Exactly. And there is a second; we just talked about the root cause analysis, but there is a set of other agents, which I call mostly removing toil, right?

[00:28:00] Mirko: It is removing tasks that you really do not like. And a simple example is building dashboards for a service. And then you have an update in your service, and you have that service on fifty dashboards, and now you have to update all these fifty dashboards manually to have that new metric on them. Right?

[00:28:19] Mirko: And I think that is where AI is also super powerful, right? It can create and suggest dashboards for a problem or service. It can automatically update those dashboards. Same for alerts, right? You have alert rules. Something changes, and you need to update all these alert rules. Right? And I think that is where we also created agents that help you create and update dashboards and create and update alerts.

[00:28:45] Guy Podjarny: Yeah.

[00:28:45] Mirko: And do those things. Right? I think that makes a ton of sense.

[00:28:49] Guy Podjarny: Yeah.

[00:28:50] Mirko: And another big point is adding more context. And that means we are adding other data sources to observability to understand the context of a problem better. As an example, we connect to Linear and Jira, or we connect to your Notion, and by doing that we can actually see, "Oh, well, maybe there was; I have a database issue with the schema, maybe there was a task that updated the schema, right?"

[00:29:17] Mirko: And I can correlate it to it. Or maybe there is a documentation in Notion with a schema update, which I can use and pinpoint user to it, or I can use the code, the connection to GitHub to pinpoint a change in the code, et cetera. Right. I think adding more context is also very important now because the agents can use that context to create better answers.

[00:29:42] Guy Podjarny: Yeah. And so this comes back to the root cause analysis. That agent might come along and figure it out, but you have to relate, like give it access and relate it to it. And I guess organizationally, one of the challenges that I hear talking to the enterprise side of it, talking to the companies using these different tools, is sort of this infinite mesh of connectivity between all these different tools.

[00:30:06] Guy Podjarny: Like the number of tools I need to connect to my Linear, I need to connect. And then there is going to be another subset of tools that might want to connect to the Dash0 as a source of information. I want to read something about the logs because I know I am writing code, and I want to get that information doing it and all this sort of cross-pollination of it.

[00:30:23] Guy Podjarny: And so it is interesting; I do not know what your view is. At Tessl we more and more think about context as a core competency that is a bit agent agnostic. And so there are statements about your system, statements about your knowledge, your practices, and how a system should operate, but also like "What should your code do?" and "What are your coding best practices?" and "How do you use a certain library?"

[00:30:42] Guy Podjarny: And all of those things right now are being intelligently AI powered, assessed, and extracted out of systems by so many different agents as we all figure out what is going on, and we kind of store them inside.

[00:30:58] Guy Podjarny: I guess, how do you think about the challenge of, clearly, there is "Where do we go?" right? Are you going to curate context in twenty different spots, or do you want to curate them in some central environment? And so I guess my question back, maybe to the existential thing as well, is what do you think is the core of observability knowledge?

[00:31:25] Guy Podjarny: You know, if the coding side revolves around the code or maybe the product functionality, like observability, you are a database, but what is the most important long-term insight or sort of knowledge that you think lives in operations and not earlier in the system?

[00:31:42] Guy Podjarny: I don't know if that makes sense.

[00:31:44] Mirko: No, it makes total sense. I think so. I am just not sure if I have a hundred percent of an answer to it right now, but I think the question makes total sense, and I do think there are multiple things, right? One is, as we discussed, the tools that you provide, and that can be a lot of knowledge and context already, right?

[00:32:05] Mirko: How do you figure out what are anomalies and traces? And how do you figure out these tags? I think also there will be a lot of knowledge about how you evaluate models and the evaluations that you have, and how you essentially make sure that the root cause analysis does what you think it should do.

[00:32:31] Mirko: Right. And also giving some guidance. There are more and more of these to-do lists that you give so that the workflow is not different every time, but you have the knowledge of how to do the analysis if there is a certain type of problem, right? And which step you should follow on the wider scope.

[00:32:53] Mirko: I think that is something, and then I think last but not least, it is really the user experience, the way you integrate the agents into a collaborative mode, basically coming from a single-player mode also to a multiplayer mode where at the end of the day, if you troubleshoot, you normally have these war rooms, right?

[00:33:13] Mirko: And multiple people are working on that problem, and agents are now part of the war room, right? You may have one or two or three agents working with you in parallel, investigating things and giving you information. That information is then processed by the user. You give more context to the agent, and that way you can, I think that will be a big part of it, right?

[00:33:37] Mirko: Who figures out the best user experience, the best way of doing it? And to your last question, I like the idea of having that knowledge somewhere persisted about a system, about certain things, and about how it works. I do not know if we would be the system. I also like the idea that it should be centralised. I mean, you had these enterprise architecture management tools before, right?

[00:34:04] Mirko: You had the whole system documented, or you should document your whole landscape and give hints. I think I can see why you would have a lot of that context in a centralized tool where all the agents can access and get some understanding of how things work and how they are configured, rules, et cetera.

[00:34:24] Mirko: I think that makes sense, right? The enterprise architecture management tool for agents that has it in a way that agents can work with it and understand it and understand certain things, I think, makes sense.

[00:34:37] Guy Podjarny: Yeah. I think it is a good analogy. It is like so many things; when you are sufficiently old like us, you know, you sort of see, have seen a bunch of things, you know, a whole bunch of practices that were good practices.

[00:34:50] Guy Podjarny: They were just too hard to maintain at pace and at scale. And now with agents, maybe we do those. And so those sort of central systems of record, well-attuned for consumption, we will touch on consumption by whom in a sec, is an opportunity to build up.

[00:35:07] Mirko: When I started my career, I worked for IBM, and we had a different set of documents we had to maintain in every project.

[00:35:15] Mirko: And one of them, I really loved it, was a document called Architectural Decisions. And essentially whenever you made the decision, "I used this in that framework" or whatever, you had to document it and explain why. Right? Because whenever you have a new team member join the team, they ask why. Why did you use this framework?

[00:35:34] Mirko: Doesn't make sense. Right. You can pinpoint them to the document because there was actually a reason for it. Yeah. And once you read it, you understand. I think that also would be good for agents, right? Yeah. If they come in and say, oh, why did you do this and that? But maybe there is a reason for it that the agent cannot understand. And so you should have some sort of categorized knowledge for those agents, right?

[00:35:59] Guy Podjarny: Yeah. And curated and in a place that you can also choose to change it over time on it. And then on top of that, what we see is, you know, you write it down, but not all agents listen the same.

[00:36:08] Guy Podjarny: And so you need to, I guess as I think about AI native, I think AI native is primarily about delegating work to AI and/or tasks on it. And I think when you think about delegation, the sort of the really two things when you think about the human analogy that you need to do are: Okay, did you equip the entity that you're delegating it to with enough information that it's sort of plausible that they will get it done?

[00:36:35] Guy Podjarny: Right? Like, what is your intent? What is the knowledge around the system? So what information is available to them? And then two is how do you verify their work? And so to me that's basically context and spec and evals. You know, those are like the two. So you have to have some means. The evals are clearly not going to be comprehensive.

[00:36:51] Guy Podjarny: Just like with humans, you know, you're not going to know how to verify every single task that your employee did because then it's sort of pointless to have them. But you do need to sort of spot-check, I guess, and sort of assess it. Yeah. And so with barriers, right, like, here's the information. Did you listen?

[00:37:07] Guy Podjarny: Did you understand? Right. You can force them to listen, but understanding it depends on the model. So, let's maybe talk a little bit about the human side and to start that, you know, if I talk about AI native, this was like a little bit of my definition when we spoke about what an AI native product is.

[00:37:25] Guy Podjarny: A lot of your focus was on this notion of building an agent first. And so how do you think about the future of your product in terms of its primary constituency? You know, is it agents, or is it humans? How do you delineate the two?

[00:37:37] Mirko: Yeah. The way we do the design of our product now is really that we think of everything first from the perspective of an AI agent, right?

[00:37:50] Mirko: And also what an AI agent could probably do with that information and how we can make life easier for the user then, right? So, that's probably the most important thing for us at the moment, to really always say, "Okay, can an agent do that work? How would that look, and how would that information feed back to the user?"

[00:38:13] Mirko: That it's understandable, and also I would call it traceable, right? So, I think for us, essentially in observability, one of the most important things we got from our users is that they want to understand why the AI agent has come to a certain conclusion. Right? And you really want to follow it, right?

[00:38:30] Mirko: And if we say, "Yeah, the problem is here in the database," then often the question is, "Oh, why did you come up with that conclusion?" Right? Yeah. And then you want to follow the steps essentially of how the agent was getting there. So it's always about, "Okay, what can the agent do, and how do we make that understandable to the user?"

[00:38:51] Mirko: And where are the interaction points of the user with that agent? Right? So, it's not anymore that we say we built a tool where the user is the primary interaction point with us. We think that the agent is the primary action point now.

[00:39:09] Mirko: It should do most of the work, and the user should just interact with the agent at points where the user has the knowledge to make the agent better. Right?

[00:39:21] Guy Podjarny: So the interface, I guess another way to sort of say this is you think the primary UX for the user will be to interact with the agent. And so the there's, like, a layer of functionality that is aimed quite heavily at the agent to achieve this.

[00:39:40] Guy Podjarny: And then for the user, you have to solve for a different problem. And I guess some things go away. I don't know. Is there an example that jumps to mind of how this, I guess, kind of led you to build something a little bit differently or?

[00:39:55] Mirko: Let's think of a very simple thing, a dashboard, right, for a service.

[00:39:59] Mirko: So you have a service, a payment service, and a dashboard. It used to be, the normal thing was you put on the red metrics where you show how many calls you have to that service and what the response time was. How many errors did you have? And I think today the primary information we give you on a service is a textual description of the status of the service.

[00:40:20] Mirko: It would tell you, "Hey, your service is operating fine. It is in the range of the performance of the last 30 days, but we found two errors that came up recently that we haven't seen before, which are suspicious." And then you say, "Okay, let's investigate them." And then from there, you're not looking at charts or anything anymore, right?

[00:40:42] Mirko: You're looking at that. And then the next thing is that you interact with the agent. You say, "Okay, investigate those problems for me." And so now it jumps to the trace view and analyzes all the context that is already set, right? It has filtered down to exactly those problems, right?

[00:41:01] Mirko: It'll show you the right information, the context. And now the next thing is it describes, "Look at here. It's only for that customer in that scenario." And here we go. Right? So the steps are now very simple. You get all the context of product design. It's not anymore about charts and numbers.

[00:41:21] Guy Podjarny: Yeah.

[00:41:21] Mirko: And those things, because the agent will do the work for you to look at those charts essentially and give you a summary of the things that are important for you to look at.

[00:41:30] Guy Podjarny: Yep. And I really like that, and I think it comes back to the first principles of like, why did we have the chart in the first place?

[00:41:37] Guy Podjarny: Well, the task wasn't, "Give me a chart." The task really was, "I want to understand what the status of my service is." And charts are just the way we get used to looking at them.

[00:41:51] Mirko: And by the way, charts are good for users, not good for agents. Right. That's an interesting thing also, right?

[00:41:56] Mirko: We created charts because we, as humans, are really good at looking at the chart and seeing a spike, right?

[00:42:03] Guy Podjarny: Right.

[00:42:04] Mirko: Where the agent will actually look at the underlying data and do a deep analysis of the data. But we cannot look at 5,000 data points and then see this; we don't see the spike in 5,000 data points. Right?

[00:42:15] Mirko: That's not how our brain works. We look at charts, too, but that's a good point, right? An agent doesn't need charts anymore. The charts are just for the user.

[00:42:24] Guy Podjarny: Yeah.

[00:42:24] Mirko: And essentially we created the chart for the user to pinpoint an anomaly.

[00:42:33] Mirko: Yeah. So now when the agent already takes, "I found an anomaly. Do you want me to investigate it?" The charts get useless, right? Yeah. Because you don't need it anymore. But this is exactly how I think we have to rethink first principles. Why have we started with it? And we normally built a user experience to the weaknesses of the human brain. Right.

[00:42:49] Mirko: Right.

[00:42:50] Guy Podjarny: Yeah.

[00:42:50] Mirko: And now we can optimize it because the agent can do the heavy work, right? The heavy lifting, and we just follow it. Right?

[00:43:01] Guy Podjarny: I love that. But also, I love how it actually sort of touches both sides of the weaknesses and strengths, right? You're saying one part is there's an opportunity to give a user, a human user, a better answer than a chart.

[00:43:15] Guy Podjarny: You know, here's the actual conclusion of the chart, but also that if you work from outside and you're trying to have the agents just rely on the information that is available to the human, it would actually do a worse job because the agents are not as good at understanding the charts as humans.

[00:43:35] Guy Podjarny: So I love that sort of analogy and example. And I guess the next question will be on, well, who is this for? And how does it maybe change the sort of the profile of the individual? Because you were already alluding to today, there's a certain level of expertise for people that really, truly operate the DevOps dashboards.

[00:43:55] Guy Podjarny: So some of that is also getting down to contextualizing, I guess, the answer for people with different components. Just like as a human, you'd explain, you'd use different words to explain the same scenario to people with different levels of proficiency.

[00:44:10] Mirko: I mean, normally if troubleshooting came down to a few people in our organization who have a very-

[00:44:17] Mirko: At one, they need a very broad understanding of the overall system. Yeah. Because a lot of developers and people have only a deep understanding of their service, but not if you have a microservice environment. With a thousand services, there are only a few people who really understand how everything works together.

[00:44:35] Mirko: Yeah. And normally in troubleshooting scenarios, you need that understanding to narrow down where to look. Right, right. What could be the problem? Because I always say, or joke, if there is a big problem, systems normally look like a Christmas tree because everything is blinking, right? Because everything is somehow connected, so everything is red, and then it doesn't help anymore because if everything is red, you still don't know where it's coming from.

[00:45:00] Mirko: Right? Yeah. And then you need this expert knowledge of people who say, "Yeah, it's probably here because I know there's this database that's connected to everything else," right?

[00:45:09] Guy Podjarny: Yeah.

[00:45:10] Mirko: And I think that's why you only have these power users being really effective at observability, because you need that understanding.

[00:45:19] Mirko: Right. And I think that's where the agent comes in, because agents are good at understanding the wider scope and narrowing it down. And now we can enable everyone to basically have that knowledge about the overall system because the agent can give you that context. Right?

[00:45:35] Guy Podjarny: Right.

[00:45:35] Mirko: Yeah. And, and now everyone can troubleshoot.

[00:45:38] Mirko: I'm not saying that we are already there yet, but I think the ultimate goal would be to enable every developer, every SRE, and everyone who needs to have that knowledge to troubleshoot and understand the system quickly. Right, right. Yeah. Without having that expert knowledge or new people coming in.

[00:45:58] Mirko: Right. Which is also a problem in enterprises now, right? Now, these two people who have the knowledge leave the company.

[00:46:03] Guy Podjarny: Yeah. Yeah.

[00:46:04] Mirko: And as we all know, it's normally not documented in Notion, Confluence, or somewhere; it's in the heads of the people.

[00:46:11] Guy Podjarny: Yeah.

[00:46:11] Mirko: And now that knowledge is gone.

[00:46:14] Guy Podjarny: And so this to me comes back to that sort of context. We used the word context. You know, the human analogy is really the separation between intelligence and knowledge, right? And so, you want to enable intelligence. The agents kind of give you some intelligence.

[00:46:28] Guy Podjarny: You can look at a lot of logs. You can find the spikes. You can go, you can search, and you can find related information. But then there's also knowledge of what the system is, what it is, you know, what the past incidents have been, so that exactly, we know that it's always that sort of finicky database, you know, that's sort of in the corner. How do you build those?

[00:46:48] Guy Podjarny: I think one of the challenges is that people conflate intelligence and knowledge, and, you know, like I sometimes say, it's like, assume intelligence, not mind reading. You know, like there's no, you know, they can be super intelligent, but if you do not equip them with knowledge, it's anywhere between setting them up for failure and just being highly inefficient, like every time they're going to have to learn that.

[00:47:10] Guy Podjarny: I'm curious about the people side. So maybe like, as a bit of a closing, as we kind of run out of time here, so this is all good and well, and it's the future. Maybe bringing it down to today, what is the journey that you're seeing customers or walking customers through in terms of adopting this approach? Because of a bunch of this, there's a fair bit of change involved in what you've described, and people don't always like change.

[00:47:36] Guy Podjarny: And also there are limitations of the technology today. So it can't quite do everything amazingly yet.

[00:47:43] Mirko: I would say, first of all, it's really changing the way users work. There was a public statement by the CIO of The Telegraph in London. He made it actually on LinkedIn. He said that Dash0 and Agent0 are changing the way they do incidents.

[00:47:58] Mirko: The playbooks for incident resolution are changing because now the first place is to ask the AI and not anymore to go through the list of steps you do to troubleshoot. And I think so. So the agents are already changing the way we do incident resolution, et cetera, because they are super helpful.

[00:48:17] Mirko: Right. Probably also how coding agents are changing the way we code, right? We probably first ask an agent to do a suggestion, and then we iterate with it, or, like how we do a post on LinkedIn, we first ask Gemini or whatever. Right. So, I think that's already happening for sure.

[00:48:41] Mirko: What I would say today is that it's really a learning curve on both sides, right? I mean, we have a chat interface, and we, for example, also look heavily at what the users are asking our system, right? Because it's the first time that essentially a user can do something with the system that we haven't designed for them, right? Normally you can only use functionality that you have literally designed in the user experience, but now they can ask questions like, "Oh, give me a usage report of the dashboard."

[00:49:05] Mirko: We never thought about this. And actually, it turns out that the agent can answer that question. So we now know, "Oh, actually users are interested in usage of dashboards."

[00:49:14] Guy Podjarny: Right.

[00:49:14] Mirko: So we get feedback through that interface also to understand what kind of functionality we should bake into the tool, right?

[00:49:22] Mirko: Because the tool today is probably how ChatGPT was when they started; they saw people were asking coding questions, and then you could see, "Oh, maybe we build something like the coding agents." Right. It's a use case. And so we see that too. But at the moment I think it's learning on both sides, right?

[00:49:42] Mirko: Right. It's what works and what doesn't work. How does it change my procedures and my playbooks, right? I see customers already changing their playbooks on incident resolution, but I think over time it will become an integral part of observability. I think even today, we can't imagine anymore having observability without AI.

[00:50:03] Guy Podjarny: Yeah, yeah. No, I love that, and I guess it's an extreme version of what we're doing in general. When you're building a product, you should listen to your users. Yeah. But because you're providing users with an interface that is much closer to what they would say to a support person or to your rep, right?

[00:50:20] Guy Podjarny: Or to an engineer on the team, rather than the freeform version of it, you don't have to decipher what they meant from the clicks on the page or the dashboard they created. You're getting a more verbatim version of "this is the question I had. Can you answer it right, or can you promote that?"

[00:50:38] Mirko: Yeah. Especially if you never had this option before. Right? Yeah. They could never click on it. You would never get that answer without talking to the customer.

[00:50:45] Guy Podjarny: Right. Yeah, absolutely. So I think I'm excited to see this evolve. I mean, I find the world of DevOps in general is a bit cautious around AI, and observability feels like the pioneering part of it because it has a lot of data analysis and a lot of toil to take away.

[00:51:07] Guy Podjarny: So it's exciting to sort of see Agent0 and, in general, the AI-native approaches and the changes of the practices. And I'm very curious about closing the loop, and I guess as the systems become more reliable.

[00:51:22] Guy Podjarny: So, definitely will be closely tracking. Before I let you go here, though, I want to ask one personal question that I'd like to ask many of the guests here. Typically I would ask, if you had a son that was going to university, would you recommend that they go take computer science?

[00:51:37] Guy Podjarny: And when we're talking, it turns out you have a son that has just started studying. So I guess, how do you think about the career path in software engineering today? And do I guess, it sounds like you still think it's a good idea to go into a computer science degree today, you know?

[00:51:58] Guy Podjarny: How do you think, where do you think this is headed and what's a good thing for someone to do today if they're early in their career?

[00:52:04] Mirko: I mean, I definitely recommended it to my son, and the reason for it is that I think that the main thing you get trained doing a computer science or engineering study is that it trains your brain.

[00:52:17] Mirko: In understanding problems, math, et cetera. Right. I mean, I was a coding geek since I was a child, right? So I learned coding myself, not at university, before I studied computer science. Yeah. But what you learn there is you learn to solve hard mathematical problems, physics, et cetera, and I think that will not go away.

[00:52:36] Mirko: Right. I think having a brain that helps you analyze problems and structure things, I think that's still really relevant. And I just think he should do that to get that basic training and then figure out what he wants to do. And he actually loves hardware, and I think that's a very interesting spot in the future.

[00:52:56] Guy Podjarny: Yes.

[00:52:57] Mirko: Spot in the future, right? Robots, drones, and all these things are also a very interesting combination with AI and software. So, yeah, I think it's still relevant.

[00:53:09] Guy Podjarny: Yeah. I think it's super interesting, and I get different roles on it, and I absolutely agree about dealing with hard problems; you know, like learning to overcome adversity to begin with.

[00:53:21] Guy Podjarny: It's just a hard degree to get through, and not just life lessons of that. Yeah, hopefully you have some good teachers. You learn how to tackle problems. I guess that's maybe what I would add. I think the importance as it stands is that if you go for a computer science degree and you come out, if you didn't do any programming on the side, you're probably not a very good programmer by that time.

[00:53:37] Guy Podjarny: You know, you're not a... but, so you learn a lot. You need to learn alongside the degree. And I think that is probably more important than ever. Because I think where I put very little faith is in the university's ability to adapt anywhere near fast enough to teach the students the knowledge part of what they need from an AI perspective in the AI era.

[00:54:01] Guy Podjarny: So I guess the emphasis there is, you know, do that, but then, alongside the degree, learn to use Dash0 or, like, I know, get some experience with some sort of agentic development tools to... absolutely. Yeah.

[00:54:14] Mirko: I agree.

[00:54:15] Guy Podjarny: Mirko, thanks a lot for coming onto the show and sharing these insights, and I'm looking forward to seeing the future of observability in this AI era.

[00:54:24] Mirko: Thank you. It was nice being on your show here.

[00:54:26] Guy Podjarny: And thanks everyone for tuning back in, and I hope you join us for the next one.

Technical Deep Dive

Developer Experience

Chapters

Trailer

[00:00:00]

Introduction

[00:01:15]

Deep Dive into AI Native Observability

[00:02:31]

Understanding OpenTelemetry and Its Benefits

[00:04:06]

AI Agents in Observability

[00:08:44]

Building and Integrating AI Agents

[00:16:35]

The Future of AI in Observability

[00:22:12]

Auto-Resolution and Team Collaboration

[00:28:33]

User Experience and Collaborative Troubleshooting

[00:34:08]

AI Agents as Primary Interaction Points

[00:38:52]

Rethinking Dashboards and User Experience

[00:41:10]

Adopting AI in Incident Resolution

[00:48:31]

Career Advice for Aspiring Software Engineers

[00:52:45]

In this episode

"Charts are good for users, not good for agents. Agents look at the underlying data and do deep analysis."

Mirko Novakovic built Instana, sold it to IBM, and now he's building Dash0, rethinking observability for agents, not humans.

In conversation with Guy Podjarny, he explains:
• why OpenTelemetry turned out to be perfect for AI
• how UX changes when agents are your primary users
• why interactive collaboration beats static chat outputs
• the survival question for observability vendors in the AI era

Only 2-3 people in most companies can truly debug production. That knowledge lives in their heads and disappears when they leave. Mirko's betting agents will change that.

Why Context Engineering Matters for AI-Native Observability

The observability space is undergoing a quiet transformation. As AI agents become central to how developers troubleshoot and monitor production systems, the question of how those agents consume and understand telemetry data has become critical. Context engineering, it turns out, may matter as much in DevOps as it does in coding assistants.

Mirko Novakovic brings a unique perspective to this shift. As the founder of Instana (acquired by IBM) and now CEO of Dash0, he has spent years thinking about how observability platforms should evolve. In a recent conversation on The AI Native Dev podcast, he shared how OpenTelemetry became an unexpected foundation for AI-native observability, and why designing for agents first is reshaping product development.

OpenTelemetry as Context for AI Agents

Before OpenTelemetry (OTel), every observability vendor created proprietary data formats. Instana had its own agent, Datadog had another, and telemetry data lived in silos. OTel changed that by standardizing not just the format of logs, metrics, and traces, but also the tagging system through Semantic Conventions.

What Mirko and his team discovered was surprising: LLMs already understand OpenTelemetry. Because OTel is open source, well-documented, and widely adopted, foundation models have been trained on its documentation and sample data. When you feed a trace into Claude, it recognizes that host.name is a hostname, that HTTP status code 404 indicates a problem, and can reason about the relationships between services.

"OpenTelemetry turned out to be really useful because all the models by default understand the format," Mirko explained. "It is like code. It has a syntax, it has semantics. And so it can actually do interesting things and analyze telemetry data."

This matters for context engineering in observability. The richer and more standardized the context you provide to an AI agent, the better its analysis. A trace decorated with proper Kubernetes metadata, service names, and semantic tags gives an agent everything it needs to pinpoint anomalies. Missing that context, and the agent is working blind.

Building AI Agents for Root Cause Analysis

Dash0's Agent Zero platform includes multiple specialized agents, with "The Seeker" focused specifically on troubleshooting. When a 3 AM incident fires, The Seeker helps SREs identify root causes by navigating system dependencies, correlating logs with traces, and surfacing anomalies across millions of data points.

The key insight is that LLMs excel at different tasks than humans. An AI agent struggles to process a million traces looking for anomalies directly, but give it the right tools and it performs remarkably well. Dash0 built a triage feature that compares traces and identifies patterns, like noticing that all errors share a specific customer ID. This tool is exposed through an MCP server (/blog/what-is-model-context-protocol), allowing external agents like Claude Code or Cursor to invoke it.

"We provide that tool to the agent through an MCP server," Mirko noted. "And the AI agent can now use that triage tool autonomously. It will say, okay, there is a problem, let's figure out if there are any anomalies."

This pattern of building specialized analytical tools that agents can invoke represents a shift in how observability platforms should think about their APIs. The question is no longer just "Can a human read this dashboard?" but "Can an agent call this function and reason about the result?"

Designing Products Agent-First

Perhaps the most striking insight from the conversation was Dash0's approach to product design. Rather than building for human users and then adding AI features, they now design every feature by first asking: "Can an agent do this work?"

Consider dashboards. Traditionally, a service dashboard displays charts showing request volume, response times, and error rates. Humans are good at visually spotting spikes in charts, which is exactly why we built dashboards this way. But agents do not need charts. They can analyze the underlying 5,000 data points directly and identify anomalies faster than any human scanning a visualization.

"We created charts because we, as humans, are really good at looking at the chart and seeing a spike," Mirko observed. "The agent will actually look at the underlying data and do a deep analysis. So now the charts get useless, because you don't need them anymore."

This does not mean charts disappear entirely, but their purpose changes. In an agent-first world, the primary interface might be a textual summary: "Your service is operating within normal parameters, but two new error types appeared in the last hour." The agent has already done the analysis. The human's role shifts to validation and decision-making.

Democratizing Observability Expertise

A persistent challenge in large organizations is that only a handful of experts truly understand system-wide dependencies. When incidents occur, these are the people called into war rooms because they know that when everything turns red, the real culprit is usually that one finicky database connected to everything else.

AI agents appear well-suited to democratize this expertise. By understanding system topology and having access to historical context, an agent can guide any developer through troubleshooting, not just the senior SRE who has memorized the dependency graph.

"Agents are good at understanding the wider scope and narrowing it down," Mirko explained. "Now we can enable everyone to basically have that knowledge about the overall system because the agent can give you that context."

The implications extend beyond incident response. Dash0 also builds agents for removing toil, such as automatically updating dashboards across fifty instances when a service changes, or keeping alert rules synchronized. These are tasks that humans find tedious and error-prone, but that agents handle reliably.

Where This Is Headed

The conversation surfaced an existential question for observability vendors: if users primarily interact with observability data through external agents in their IDE or terminal, does the observability platform become just a database? The value would shift to wherever the agent lives, triggering a race to the bottom on storage pricing.

Mirko's response is to make the interaction layer compelling enough that users want to work within the observability tool itself. That means building collaborative experiences where humans and agents work together, not just chat interfaces that produce answers. The agent surfaces insights, but the human can filter, adjust, and request re-analysis. It is interactive, not transactional.

For developers building AI-native applications, this conversation highlights a broader pattern. The tools we build need context engineering strategies that make them consumable by AI agents. That might mean adopting standards like OpenTelemetry, exposing MCP servers for tool invocation, or rethinking UX entirely around human-agent collaboration.

The shift is already happening. As Mirko noted, customers are rewriting their incident playbooks to start with asking the AI agent rather than following manual checklists. Worth keeping an eye on as observability continues to evolve.

Technical Deep Dive

Developer Experience

Chapters

Trailer

[00:00:00]

Introduction

[00:01:15]

Deep Dive into AI Native Observability

[00:02:31]

Understanding OpenTelemetry and Its Benefits

[00:04:06]

AI Agents in Observability

[00:08:44]

Building and Integrating AI Agents

[00:16:35]

The Future of AI in Observability

[00:22:12]

Auto-Resolution and Team Collaboration

[00:28:33]

User Experience and Collaborative Troubleshooting

[00:34:08]

AI Agents as Primary Interaction Points

[00:38:52]

Rethinking Dashboards and User Experience

[00:41:10]

Adopting AI in Incident Resolution

[00:48:31]

Career Advice for Aspiring Software Engineers

[00:52:45]