CtrlK
BlogDocsLog inGet started
Tessl Logo

ainativedev/latest-aidevcon-speakers-london-2026

AI Native DevCon 2026 London — all conference sessions as interactive skills

71

Quality

89%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Risky

Do not use without reviewing

Overview
Quality
Evals
Security
Files

outline.mdtalk-obstbaum-willoughby-evals-hard/

Outline - Why Evals Are Hard and How We're Solving It

Speaker

Simon Obstbaum and Rob Willoughby

Abstract

[inferred from filename and transcript] Why Evals Are Hard and How We're Solving It is an AI Native DevCon session covering AI evals, evaluation design, testing agents, measurement, quality gates, and agent reliability.

Thesis

[inferred] The talk's main contribution is its framing of AI evals, evaluation design, testing agents for practitioners working with AI-native software development.

Transcript Status

The source is timestamped speech-to-text output. Speaker labels, punctuation, and some technical terms may be imperfect. Use timestamps when citing.

Timeline

#Timestamp rangeSectionSummary
100:00-04:40Opening and framingI mentioned it before, and if you haven't, I've been in the room when I was talking about it.
204:44-09:35Main discussion 1The first piece that we covered was Repair By published a piece on AI engineers, found that roughly 10% of the engineers.
309:40-14:37Main discussion 2Looking out inside of the teams and looking at the individuals, we see that, you know, and I think everyone here in this room encountered someone.
414:40-18:57Main discussion 3yeah, it just it feels like so the tooling and instrumentation is essential to it.
519:01-23:05Main discussion 4But the things that you care about, the structure of these skills or changes, how it does it and how well it does it.
623:11-27:43Main discussion 5Unique testing is getting structure to what other countries are doing and that different is better.
727:47-31:57Main discussion 6So so there is a bunch of data on because under some regulation, no data.
832:02-36:34Closing pointsthere is a code structure quality that you unlock, or you've come to consensus or there's some, some idea that the quality of the code matters or has an influence in the outcome t…

Named Concepts / Search Anchors

  • AI Evals - Topic named in the talk metadata and used as a search anchor for transcript Q&A.
  • Evaluation Design - Topic named in the talk metadata and used as a search anchor for transcript Q&A.
  • Testing Agents - Topic named in the talk metadata and used as a search anchor for transcript Q&A.
  • Measurement - Topic named in the talk metadata and used as a search anchor for transcript Q&A.
  • Quality Gates - Topic named in the talk metadata and used as a search anchor for transcript Q&A.
  • Agent Reliability - Topic named in the talk metadata and used as a search anchor for transcript Q&A.

Useful Search Terms

  • AI evals
  • evaluation design
  • testing agents
  • measurement
  • quality gates
  • agent reliability

Open Questions / Limits

  • The outline is generated from timestamped transcript text rather than speaker-provided slides.
  • Some transcript terms may be speech-to-text artifacts.
  • For precise claims, inspect transcript.md around the relevant timestamp.

Duration Marker

Last observed timestamp: 36:34

talk-obstbaum-willoughby-evals-hard

README.md

tile.json