CtrlK
BlogDocsLog inGet started
Tessl Logo

ainativedev/aidevcon-2026-ldn

AI Native DevCon 2026 London — all conference sessions as interactive skills

70

Quality

88%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

quote.mdtalk-kushwaha-benchmarking-agent-era/

Quotes -- Benchmarking the Agent Era: Measuring Performance Beyond the LLM

Short excerpts selected from the transcript for grounding answers. Preserve transcript artifacts when quoting.

Agent-era benchmarking

So you're going to be talking about benchmarking the benchmarking the agent era and you're ready to go. So I will leave you to it.

Source: L0007-L0011

Inference performance

three things. One, what does the agentic benchmark look agentic workload looks like? Second, what are the optimizations that already exist to run these agentic workloads efficiently? And third, what

Source: L0032-L0036

Workflow complexity

these applications were not using tools. Now in the agentic workloads what we are seeing is dozens of turns and I'll give you concrete examples of what what I mean. U the LLMs are getting called

Source: L0060-L0064

Tool calls

talk will mostly focus on the coding agent side of things. This workload might diff look different if you're talking about different kinds of agent but here the focus is primarily on the

Source: L0135-L0139

Context length

replicas of the same model that you're using in your agentic workloads. So turn one end up on replica A. If you do a roundroin kind of a setup, the turn two might end up on some other replica, but

Source: L0189-L0193

Real-world workload measurement

trajectory which is completely different from what we see in agentic workloads on the agentic workloads. Now what we miss there's no fixed shape as I saw the trajectory keeps growing as you are

Source: L0318-L0322

Agent-era benchmarking

So all this slide is trying to say in the agentic workload you can't just talk about mean median you have to look at distributions you have to talk about metrics in terms of distributions in

Source: L0379-L0383

Inference performance

representing your workload correctly especially in agentic workloads those gray regions where this stuff is being run on CPU is very important and that kind of helps you getting the right

Source: L0500-L0504

talk-kushwaha-benchmarking-agent-era

README.md

tile.json