CtrlK
BlogDocsLog inGet started
Tessl Logo

ainativedev/aidevcon-2026-ldn

AI Native DevCon 2026 London — all conference sessions as interactive skills

70

Quality

88%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

quote.mdtalk-obstbaum-willoughby-vibes-to-metrics/

Quotes -- From Vibes to Metrics: How to Actually Measure What Your AI Agents Do

Short excerpts selected from the transcript for grounding answers. Preserve transcript artifacts when quoting.

Output evals

They're going to be talking about from vibes to metrics and how to actually measure what your agents do. Over to you. >> Cool. So, what we're here to talk to you

Source: L0054-L0058

Trajectory evals

the levels and we correlate the levels with the output measurement that we have uh shown in in the beginning. So just when we look at uh okay why do we trust the four levels? So in terms of

Source: L0302-L0306

Agent instrumentation

different views on how to be assessing those metrics. one kind of top down looking at a correlational studies across kind of a whole bunch a whole big part of the industry, one bottoms up

Source: L0078-L0082

Skill activation

lot of time in thinking how could we even measure output and and subsequently productivity So what we found uh to work is um that we have the engineer, he writes the code and then we have a panel

Source: L0139-L0143

Compliance measurement

we're starting to see it now. So people that know how to orchestrate agents, people that know how to work with AI, they achieve significantly better outcomes.

Source: L0243-L0247

Coverage metrics

duplication goes down, code uh cognitive complexity goes down. So all metrics that we analyzed are actually now improving today with applying AI and and that wasn't always so like in the

Source: L0332-L0336

Output evals

On the instruction following, this is the rubrics, the metrics that are grounded specifically in what the skill is telling the agent how to do. So if you have your own internal design

Source: L0422-L0426

Trajectory evals

that unique flavor of how you as a company want your agents to be operating? So this is kind of a concrete example that we found that I found super interesting. So I use a lot of hugging

Source: L0495-L0499

talk-obstbaum-willoughby-vibes-to-metrics

README.md

tile.json