December 10, 2024

AI Evaluations and Testing: How to Know When Your Product Works (or Doesn’t)

In this enlightening mashup episode of AI Native Dev, industry luminaries explore the complexities and strategies of integrating AI into product development. Listen in to learn how leaders like Des Traynor, Rishabh Mehrotra, Tamar Yehoshua, and Simon Last navigate the challenges of AI, offering insights that are crucial for any developer working with AI technologies.

Listen to the episode

Episode Description

This episode of AI Native Dev, hosted by Simon Maple and Guy Podjarny, feature a mashup of conversations with leading figures in the AI industry. Guests include Des Traynor, founder of Intercom, who discusses the paradigm shift generative AI brings to product development. Rishabh Mehrotra, Head of AI at SourceGraph, emphasizes the importance of evaluation processes over model training. Tamar Yehoshua, President of Products and Technology at Glean, shares her experiences in enterprise search and the challenges of using LLMs in data-sensitive environments. Finally, Simon Last, Co-Founder and CTO of Notion, talks about continuous improvement and the iterative processes at Notion. Each guest provides invaluable insights into the evolving landscape of AI-driven products.

Resources Mentioned

Chapters

1. [00:00:00] Introduction by Simon Maple
2. [00:02:00] Des Traynor on AI Product Development
3. [00:13:00] Rishabh Mehrotra on the Importance of Evaluation
4. [00:21:00] Tamar Yehoshua on Enterprise Search Challenges
5. [00:34:00] Simon Last on Continuous Improvement at Notion
6. [00:49:00] Summary and Closing Remarks

Des Traynor's Perspective on AI Product Development

Des Traynor, founder of Intercom, provides an in-depth look into the complexities of integrating AI into traditional product development. According to Des, the integration of generative AI introduces a paradigm shift in how products are developed. He emphasizes that while developers love to ship code, the real challenge with AI models lies in understanding whether they work effectively in production environments. Des mentions the concept of "torture tests," which are designed to simulate the most demanding scenarios AI models might face in real-world usage. This rigorous testing is crucial to ascertain the performance and reliability of AI models in production.

Des also highlights the non-deterministic nature of AI, which requires ongoing evaluation and adaptation. In the episode, he states, "You have to do so much, whereas in typical boring bread and butter B2B SaaS… you're assuming that the mathematics worked or whatever." This underscores the need for continuous monitoring and testing to ensure AI systems perform as expected and adapt to changing user inputs and environments. He further elaborates on the need to shift from a deterministic mindset to one that embraces the spectrum of possibilities AI presents, which is a significant departure from traditional software development paradigms.

Moreover, Des discusses the importance of understanding the full lifecycle of AI products, from conception to deployment, and the subsequent need for iterative testing and refinement. He underscores the complexity of building AI products that can evolve over time, reflecting real-world data and user interactions. This dynamic approach requires a deep understanding of both the technical capabilities of AI and the practical implications of deploying these technologies at scale.

Guy Podjarny's Insights on Building LLM-Based Products

Guy Podjarny shares his experiences from Tessl and Snyk on developing LLM-based products. He discusses the inherent difficulties in evaluating AI products, particularly the challenges posed by their non-deterministic behavior. Guy emphasizes the importance of adapting CI/CD processes to accommodate the unique requirements of AI development. He notes, "Some of the tools are quite immature," highlighting the need for innovation and improvement in the tools and methodologies used for AI product development.

Guy also stresses the necessity of empowering developers to work in an AI-first environment. He believes that developers must embrace the ambiguity that comes with AI and learn to navigate it effectively. This involves a shift in mindset from traditional deterministic programming to understanding and managing the probabilities and uncertainties inherent in AI systems. Additionally, Guy highlights the role of continuous integration and continuous deployment (CI/CD) frameworks in streamlining AI development, emphasizing the need for robust testing environments that can handle the unpredictability of AI outputs.

Furthermore, Guy discusses the challenges of maintaining consistency and reliability in AI products, advocating for a holistic approach that combines both automated testing and human oversight. He stresses the importance of fostering a culture of experimentation and learning within development teams, where developers are encouraged to explore new techniques and methodologies to optimize AI performance and usability.

Rishabh Mehrotra on the Importance of Evaluation

Rishabh Mehrotra, Head of AI at SourceGraph, delves into the critical role of evaluation in AI development. He introduces the "zero to one" metric for testing models, which focuses on the effectiveness of evaluation datasets in determining model performance. Rishabh argues that "writing a good evaluation is more important than training a good model," emphasizing that without robust evaluation metrics, developers may not accurately assess the impact of model improvements.

Rishabh also discusses the importance of creating feature-aware evaluation datasets, which are tailored to specific use cases and environments. He points out that industry benchmarks may not always reflect real-world usage, and developers need to develop evaluations that align with actual user experiences and expectations. This approach ensures that AI models are tested against scenarios that truly represent the complexities and nuances of their intended applications.

Additionally, Rishabh highlights the significance of iterative evaluation processes, where continuous feedback and data-driven insights are used to refine and enhance AI models. He advocates for a dynamic approach to evaluation, where metrics are continuously updated and aligned with evolving user needs and technological advancements. This ensures that AI models remain relevant and effective in addressing the challenges of modern software development.

Tamar Yehoshua on Enterprise Search Challenges

Tamar Yehoshua, President of Products and Technology at Glean, explains how Glean manages enterprise search across sensitive data sources. She discusses the challenges of using LLMs as a judge and jury for evaluating AI responses, particularly in environments where data sensitivity is paramount. Tamar highlights the difficulty of ensuring non-determinism in enterprise environments, where users expect consistent and reliable outputs.

Glean addresses these challenges by employing suggestive prompts and structured templates to guide users, thus managing expectations and improving user experience. Tamar shares that Glean has dedicated teams for evaluation and uses LLMs to judge the completeness, groundedness, and factualness of AI responses, providing a nuanced approach to handling LLM outputs. This strategy ensures that AI-generated responses are not only accurate but also contextually relevant and aligned with user expectations.

Moreover, Tamar discusses the importance of transparency and user empowerment in AI-driven enterprise search solutions. She emphasizes the need for clear communication and guidance, enabling users to understand and trust the outputs generated by AI systems. By leveraging a combination of human expertise and automated evaluation tools, Glean ensures that its AI solutions are both robust and user-friendly, catering to the diverse needs of modern enterprises.

Simon Last on Continuous Improvement at Notion

Simon Last, Co-Founder and CTO of Notion, shares Notion's approach to logging failures and creating reproducible test cases. He describes an iterative loop of collecting failures, adjusting prompts, and validating fixes, which ensures continuous improvement of AI capabilities. Simon emphasizes the importance of privacy and user consent in data sharing for evaluation purposes.

Simon highlights the necessity of a repeatable system for managing AI failures and improvements. He states, "You need to make sure that those work and they don't regress," underscoring the importance of maintaining a robust evaluation framework that ensures AI models continue to meet user expectations and adapt to new challenges. This approach enables Notion to deliver reliable and effective AI-driven solutions that enhance user productivity and collaboration.

Additionally, Simon discusses the value of transparency and collaboration in AI development, advocating for open communication and feedback loops within development teams. By fostering a culture of continuous learning and improvement, Notion is able to refine its AI capabilities and deliver innovative solutions that meet the evolving needs of its users.

Evaluation and Testing Strategies Across Industries

The episode brings together common themes and strategies from all guests regarding AI evaluation and testing. A key takeaway is the balance between synthetic tests and real-world scenarios in ensuring AI product reliability. The discussions emphasize the role of human judgment and automation in refining AI models and outputs, highlighting the need for a comprehensive approach to evaluation and testing.

Furthermore, the guests underscore the importance of adaptability and resilience in AI development, encouraging developers to embrace new technologies and methodologies to optimize their AI solutions. By fostering a culture of innovation and experimentation, organizations can harness the full potential of AI to drive business success and deliver exceptional user experiences.

Full Script

**Simon Maple:** [00:00:00] You're listening to the AI Native Dev brought to you by Tessl

Hello and welcome to another episode of the AI Native Dev. And this is going to be a bit of a mash up episode where we're going to take a listen to a number of excerpts from previous episodes where we talk about whether we as organizations are ready to be able to say this product works better after these changes to my model and this really takes account of various things, including eval testing, evaluations, regression testing and so forth. This is gonna be a mashup between Des Traynor, intercom founder as well as Rishabh Mehrotra [00:01:00] who's the Head of AI at SourceGraph. Tamar Yehoshua, who is the president of products and technology at Glean. And lastly, but by no means last, Simon Last, the Notion co founder and CTO. We're going to start off with Des Traynor's take and one of the things that I love about Des Traynor is, of course he recognizes devs love to ship code and one of the things that is really important when we're using LLMs in our applications is we don't truly understand, we don't truly know if something really works until it's in production using real production data, real data that the users are throwing at it.

And one of the lovely quotes that Des mentioned in his section is that with sufficiently simple tests, Des could be as smart as Einstein. Because the tests are so simple, they could both obviously do so so well at it. In this section of the episode Des talks about this wonderful thing called torture test, which is in Des's words all the psychotic scenarios [00:02:00] that Fin, the intercom tool will get put into in the real world. So it's this set of torture tests that can really ascertain whether changes to the model changes to prompts, et cetera really do have an effect in production. So over to Guy and Des..

**Guy Podjarny:** We have the pleasure, dubious pleasure of building LLM based products at Tesla. I've had some of that at Snyk. It's different. They're annoying. They are hard to evaluate whether you're doing your job correctly or not. Sometimes costs are different. CICD is different. Some of the tools are quite immature.

And we're talking probably a year before that experience that I'm referring to here. How hard was it for a developer who previously didn't work in that surrounding to adapt it? How did you help the dev team? Not just, Hey, I reprioritize this feature or am I overblowing the change?

**Des Traynor:** The change is big.

It's actually been a subject of a lot of internal debate. I'll give the umbrella answer and [00:03:00] then we can get to specifics in a little bit. But generally speaking, everything about how we build software products changes when you bring in generative AI, probability based systems, or whatever you want to call it.

There's a classic product development systems like double diamond. It's like research, make decisions, build, like ship, and then see if it works, whatever. You have like extra diamonds all over the place when it's like AI, because It's, you don't start with here's the user problem. Let's solve it. You tend to start with what's technically possible to begin with.

And then you say, Hey, does this new technical capability, let's say, Hey, we can read PDFs in a second. Okay. Does that map onto any problems that we have to solve for our users? And then you find one of those. And then, so then you've worked out this thing can be built and it seems like it solves this problem.

Even then, everyone's DNA and software is, so build it and ship it and move on to the next feature. What's the big deal? And I think the problem is once these things go live, you still don't know if it works or not. So you have to do so much, whereas in typical boring bread and butter B2B SaaS, someone's, oh, I [00:04:00] wonder, can we merge expenses?

Okay. So you get two receipts, you have to combine them. Oh, okay. If you ship that feature and people use it, you're like, great. You're not sitting here going, Ooh, I wonder if it works. You're assuming that they're like the mathematics worked or whatever.

**Guy Podjarny:** Yeah.

**Des Traynor:** Whereas in all these systems, once it goes live, it goes through a whole other, like almost like product evolution, which is it actually working the way we thought it did.

And obviously we do our best to not just test these systems on what we would call happy paths. A happy path is like, how do I reset my password? And unhappy path is, hey, my account is stored in my husband's name and it's my credit card. And I need to move it to his credit card, but also we need to change address at the same time.

That's where like these systems struggle massively. They're all brilliant at like, how do I reset my password? In the same way that if me and Einstein sit down and take a maths test, and it's just, Basic arithmetic. I look as a, I'm a smart, yeah. So you have to have this higher pass filter to get through.

But nothing like reality bats last in AI it's the [00:05:00] last person to step up and reality tells you like, hey, Fin works great in all these circumstances, but here's a pocket that it doesn't work in. Here's an area that it that I can't reach or whatever. Yeah. So I think the first thing I'd say to the question of enabling devs, empowering devs to work in this LLM first world is like getting outta your head the idea that things just simply do or don't work because it's all a spectrum now getting out of your head that your core job is to just clarify or remove ambiguity. Ambiguity is all over the place. It's on your job is it's such a. There's like an art and science to like threading all the needles, defining the useful solutions.

And that's before we've actually sat down in front of a tech center and started banging out code. That's is in working out what problem we could solve, why we think it's solvable, like most of your prototypes in this era are like written by just prompting an LLM and seeing what it does with certain prompts and seeing what it does in response to certain asks.

And basically if you can get to a consistent, reliable set of behaviors, And then you can start to think about [00:06:00] how would we productize this? What are the API calls? And then what's the UI look like? And then how would we describe it? All that, but you have so much like pre work to do all that said, there is obviously around any given like product, if we just, I'll take a Fin because that's the one I know best. There is still a lot of bread and butter B2B SaaS shit you have to build. Someone has to import docs and somebody has to have permissions and it has to be charged for it. So there's a pricing feature. Like it's not all, it's not like you're an LLM engineer or bust, it's the secondary features that are fall into the classic.

Like our reporting engine still draws graphs, but that's not new. But.

**Guy Podjarny:** The just like any like engine, if you built any sort of proper product at the end of the day, the engine is 10, 20 percent of the surface of the product.

**Des Traynor:** Exactly that. And I think if there's a whole thin wrapper, thick wrapper debate that happens a lot in the industry.

And I think like Fin is very much like a thick wrapper. There's a lot around Fin that needs to be there to make it work really well, such that the actual, the pieces are like the little, like the resolution engine that sits inside of it. That is a part of the puzzle. It's a really important [00:07:00] high impact part.

But it alone isn't sufficient, it still needs the 16 other things, the vector search, the content import, and all that sort of stuff.

**Guy Podjarny:** So it sounds like a massive change. Sounds like the, a lot of the changes is in the sum, the total of the product. There is a good chunk of people that are still working on things that don't require them to change their, way of working still a good portion that did maybe in the total.

So I guess, are there some key takeaways? Like I'm sure you've tried things and some work, you had a little bit of a headstart cause they didn't have some AI capabilities you had.

**Des Traynor:** But we had an AI product live since I think 2017, 2018. So like we definitely had prior experience here and that was one of the reasons we can move fast because we are able to reuse a lot of our existing features.

For example, like the vector search engine, we'd already built one for a previous product, et cetera. So that's one of the reasons why we were like out of the gates so fast. If I was to really say like the biggest lessons I've observed, like speaking of what I'm skipping over here is I'm sure there are libraries and APIs and all sorts of stuff you need [00:08:00] to read and get good at, et cetera.

I think to actually be a performance engineer in this era requires some amount of ability to interrogate data or perform data science to measure outcomes. To give you a very simple example. When we make a change, we've run probably like 50 different AB tests on Fin's performance. And it's very tempting.

Everyone comes in who might join the group and say something like, Oh, I have a cool feature. I could add, how about we make it so that blah, like, how about we make it so that it'll always ask this question at the end of the thing and what you don't realize is like all of these things are different iterations of a resolution engine, whose job is ultimately to given a question, answer it.

And if you break that, it doesn't matter how cool your little addition was. Now, the thing is that's not a binary yes, no condition. It's like today Fin's performance is about 50%. So 50 percent of inbound support volume will be resolved by Fin. And if somebody rolls out a cool feature that brings that 50 down to 40, that's a problem.

**Guy Podjarny:** Yeah.

**Des Traynor:** [00:09:00] If they do that and don't even know to check or to realize, or to look at its performance over a trailing 30 days or whatever, that's an even bigger problem. So there's a very much, you need to work out. You have to understand the trick to working with these systems is to bottle up a probabilistic outcome and productize it.

And you have to work out who is allowed play with the probabilities of that. And understand that it's not like two different worlds. It's very easy for somebody in our messenger team to damage the performance of Fin, even though they don't write to that code base, but if they frame it wrong and they introduce Finn in the wrong way, it'll change what people write and if it changes what people write, then it'll change how Fin performs.

So it's really the extra skill required here is the ability to assess. Does this thing do the job it's supposed to do at a really high performance level? And you have to have some way to measure that, some way to test it, have bake offs of different engines against each other, some way to have an acceptance framework for when, all right, that's been live for 30 [00:10:00] days.

No one's complained. It looks like we've got a 4 percent increase in resolutions, ship it. It's like that, none of those skills really existed before in, I'd say in most companies, but definitely not in Intercom.

**Guy Podjarny:** Yeah, no, it sounds fascinating. I guess the good news is there's some analogy between what you're describing and the state of the art in terms of continuous deployment, right?

If you're working with flags, if you're in Facebook and you're shipping, I forget the name of their sort of feature system there, it's some gradual, a number of people, then there's some similarity of it, but this is much more en masse, much more predictable and pretty much does it kill almost substantially the value of a sort of CICD assessment for anything that isn't the more bread and butter SaaS piece?

How much having run with us now for a bit, how trustworthy or how revealing are the results of the synthetic tests or the CICDs type, the pre deployment tests that you have.

**Des Traynor:** We have two tests. We have one test we perform internally, which is expensive to run, which is like our torture test. So let's just say [00:11:00] it's all the psychotic scenarios that Fin will get put into in the real world, but we were, we want to make sure it doesn't like barf.

So we don't do this unless we have good reason, but let's just say like GPT 4.0 drops or something like that. 4.0, the tempting thing to do. And you'll see a lot of our competitors do this. It's just, Holy shit, new model, plug it in. We're good. And I think what that says to me is it speaks to the maturity of their rollout.

Like we also know the same API end points to ping. If you want 4.0 I think it's not, we weren't sitting here going, Oh, I wonder how do you do this? It says our approach to such a scenario is. Okay, let's, first of all, run this through the torture test. And our torture test is there to deliberately give us hard situations where we have a very clear outcome.

So for example, let's say Tessl is using Fin. Would you want Tessl bot to recommend competitors over Tessl? Would you want it to answer a tier three question? If it didn't know the answer, how would you want it to behave? Would you want it to make up an answer? A long variety, like genuinely hundreds of scenarios where most people agree on what the right [00:12:00] outcome is?

But like GPT 4. 0 might not agree with that. So this is where like you have to control and contain the behavior. So once it clears the torture test and we look at its performance, then we'll move to look, let's put this live. Let's see how it works in actual reality, which oftentimes is just

**Guy Podjarny:** In a gradual fashion, like you wouldn't turn it on for everybody.

You would graduate. You would. Yes, absolutely.

**Simon Maple:** That's a great segment from Des, and one of the things that I love at the end that he said was, it's super easy just to, drop in a new model into your application, but you really don't know what that means for users without that pre testing and the evaluation testing. How does that new model behave?

How is it going to change its answers to the same user question if you change that model? Really important to understand. That changing is easy, but it's not the smart thing necessarily to do without that pretests. So thank you very much to Des. Okay, next up, we have Rishabh, the Head of AI at SourceGraph.

Now, Rishabh is very passionate [00:13:00] about evaluations. Because he has a very deep history in machine learning. And one of the things that he talks about quite a bit in his piece here is about how you can test models with the zero to one metric. And one thing actually really strong takeaway that he leaves us with is how writing a good evaluation is actually more important than training a good model.

Over to Rishabh.

And testing is going to be very similar as well, right? Yeah. if someone's writing code in the, in their ID line by line, and that's maybe using a code generation or as well, like Cody, they're going to likely want to be able to have test step, step in sync. So as I write code, you're automatically generating tests that are effectively providing me with that assurance that the automatically generated code is working as I want to.

**Rishabh Mehrotra:** Yeah. That's a great point. I think that this is more like errors multiply. Yeah. If I'm evaluating something. After long writing it, then it's worse off, right? Because the errors, I could have stopped the errors earlier [00:14:00] on and then debugged it and fix it locally and then moved on. Yeah. So especially, so.

Taking a step back. Look, I love evaluation. I really, in machine learning, I started my PhD thinking that, hey, maths and like fancy graphical models are the way to have impact using machine learning. Right? And you spend one year in the industry realized, nah, it's not about the fancy models. It's about, do you have an evaluation?

You have these metrics. Do you know when something is working better? Yeah. So I think getting the zero to one on evaluation on these data sets, That is really key for any machine learning problem.

**Simon Maple:** Yep.

**Rishabh Mehrotra:** Now, especially when

**Simon Maple:** What do you mean by the zero to one, though? Yeah,

**Rishabh Mehrotra:** zero to one is, look at, like, whenever a new language model gets launched, right?

People are saying that, hey, for coding, LLM's Llama 3 does well on coding. Why? Because, oh, we have this human eval dataset and a pass at one metric. Let's unpack that. Human eval dataset is a dataset of 164 questions. Hey, write me a binary search in this code, right? So, essentially, it's like, you get a text and you write a function.

And then you're like, Hey, does this function run correctly? So they have a unit test for that. And if it passes, then you get plus one, [00:15:00] right? So now this is great. It's a great start. But is it really how people are using Cody and a bunch of other coding tools? No, they are like, if I'm an enterprise developer, if let's say I'm in a big bank, then I have 20, 000 other peers and there are like 30, 000 repositories, right?

I am not writing binary search independent of everything else, right? I'm working in a massive code base, which has been edited across the last 10 years. And there's some dependency by some team in Beijing. And there's a function which I haven't even read, right? And maybe it's in a language that I don't even care about.

I understand. My point is, the evaluation which we need for these real world products is different than the benchmarks which we have in the industry, right? Now, the 0 to 1 for evaluation is that, hey, sure, let's use Passit 1 and Humeval at the start on day 0, but then we see that you improve it by 10%, we have results when we actually did improve Passit 1 by 10 15%, we tried it online on Cody users, and the metrics dropped.

**Simon Maple:** Yeah.

**Rishabh Mehrotra:** And we're writing a blog post about it at offline online correlation. Yeah. Because if you trust your offline metric, pass it one, you improve it. You hope [00:16:00] that, Hey, amazing users are going to love it. Yeah.

**Simon Maple:** It wasn't true. The context is so different.

**Rishabh Mehrotra:** Yeah. The context is so different. Now this is, this means that I got to develop an evaluation for my feature.

And I got my evaluation should represent how my actual users using this feature feel about it. Just because it's better on a metric, which is an industry benchmark doesn't mean that improving it will improve actual user experience.

**Simon Maple:** And can that change from user to user as well? So you mentioned a bank there.

If five other banks, is it going to be the same for them? If it's something not in the FinTech space, is it going to be different for them?

**Rishabh Mehrotra:** That's a great point. I think the nuance you're trying to say is that, hey, one, are you even feature aware in your evaluation? Because Parset 1 is not feature aware, right?

**Simon Maple:** Yeah.

**Rishabh Mehrotra:** Parset 1 doesn't care about autocomplete or unitized generation or code fixings. I don't care what the end use case or application is. This is just the evaluation. So I think the first jump is. Have an evaluation data set, which is about your feature, right? The evaluation data set for unit test generation is going to be different than code completion.

It's going to be different than code edits. It's going to be different than chat. So I think the zero to one we were talking about five minutes earlier, you got to do [00:17:00] zero to ones for each of these features. Yeah. And that's not easy because evaluation doesn't come naturally. Yeah. And once you have it, then the question becomes that, hey, okay, once I have it for my feature, then, hey, can I reuse it across industries?

**Simon Maple:** Is it testing now? Is it guardrails that become the most important thing? And almost, I would say, more important than code, or is still code the thing that we need to care about the most?

**Rishabh Mehrotra:** Yeah. I think like, if I view this extreme of like, again, if I put my evaluation hat right, I think I want to be one of the most prominent proponents, vocal proponent of evaluation in the industry, not just in code AI, in the machine learning industry, we should do more evaluation.

So there, I would say that writing a good evaluation is more important than writing a good model. Yeah. Writing a good evaluation is more important than writing a better context source because you don't know what's a better context source if you don't have a way to evaluate it, right? So I think for me, evaluation precedes any feature development.

Yeah. If you don't have a way to evaluate it, then you don't, you're just shooting darts in the dark room, right? Some are going to land by luck. Now [00:18:00] there, in that world, right? I have to ensure that unit tests and evaluation is like ahead in terms of importance and just like code. That said.

I think overall, right? What's more important is like task success, right? Which is again, what is our success? You're, you're not just looking at unit test as an evaluation. You're looking at evaluation of the overall goal, which is, Hey, do I do this task right? And then I think if that's fair as an orchestrator, if I start treating these AI agents could be Cody, autocomplete or like, like any specific standalone agent powered by SourceGraph as well, probably.

So in those words, evaluation of that task, because you are the domain expert. Assume AGI exists today. Assume the foundation models are going to get smarter, smarter, like billions of dollars, trillions of dollars eventually into it. We train these fans, again, the smartest models and they can do everything.

But you are best placed to understand your domain on what the goal is right now. So you are the only person who can develop that evaluation of like, how do I know that you're correct? How do I know we're 90 percent correct, 92 percent correct? And again, right, the marginal gain on 92 to 94 is going to be a lot more harder than going from 80 [00:19:00] to 90, right?

It always gets safer. Like, I mean, there's going to be like an exponential hardness increase over there. So essentially the point then becomes purely on evaluation, purely on unit test, right? What makes us, what are the nuances of this problem, of this domain, which, the model needs to get right. And are you, are we able to articulate those and I'll be able to generate those unit tests or generate those guardrails and evaluations so that I can judge how the models are getting better on that topic.

Right. So the models are going to be far intelligent, great. But then what is success? You as a domain expert get to define that. Yeah. And this is a great thing, not just about coding, but also like any. domain expert using machine learning or these tools across domains, you know what you're using it for, right?

The other AGI tools are just tools to help you do that job. So I think the onus is on you to write good evaluation or even, I mean, maybe tomorrow LLM as a judge and like people are developing foundation models just for evaluation, right? So there are going to be other tools to help you do that as well.

Code foundation models for like unit tests, maybe that's the thing in six months from now, right? the point then becomes, what should it focus on? That's the role you're playing. [00:20:00] Orchestrating, but like orchestrating on the evaluation. Oh, did you get that corner piece right? Oh, you know what? This is a criticality of the system, right?

Again, right? The payment gateway link and the authentication link, some of these get screwed up, then massive bad things happen, right? So you know that. So I think that's where like the human in the loop and your input to the system starts getting crossed.

Great stuff from Rishabh. Next up we're going to be hearing from Tamar Yehoshua, who's the president of products and technology at Glean. And Glean's a really interesting company. Effectively does enterprise search across a vast number of sources, many of which have very sensitive data, including HR systems and various things within that enterprise.

So Glean as a product allow multiple models to be used and the user can select which one they choose to use. Glean, interestingly, obviously don't have access directly into that data because that data is very sensitive to the customer. So when it comes to thinking about how Glean would test and run evaluations [00:21:00] across multiple models with data that they don't have access to, it becomes a bit of a challenge and makes evaluation testing very hard.

Now, one of the really interesting things that was talked about in this episode was how Glean use LLMs as a judge and as a jury and it's a really interesting concept of not necessarily having employees from Glean look at the prompts and the queries that are being created and try to validate them.

But that's actually about having an LLM do that validation. So the LLM is determining whether those queries are correct. And of course, that determinism. is very hard as well to have in that jury perhaps lessons that non determinism effect of an LLM. So this is a really another great episode.

And so over to Guy and Tamar

**Guy Podjarny:** So I guess let's talk, indeed, keep promising, talking about the sausage making, let's do that now. I guess, what is it like building a product that works this way? The results of the product vary by the [00:22:00] data that comes in, the success rates are very different, and that's on top of the fact that the LLMs move at such a kind of lightning pace, and so you get new models all the time, and if I understand correctly, you actually even dont pick the model.

You have to work with models that the customer does. So maybe tell us a little bit about how do you interact with the models or just to say that, but what I'm really interested is you talked about the LLM as a judge. You had a post about this. Just how do you know that your product works?

**Tamar Yehoshua:** It's very difficult.

No, it's the non deterministic aspect is the most interesting and challenging. My first week at Glean, I was talking to the head of the assistant quality team, just to learn, what do you do and how does it work? And realize that a lot of their time was spent talking to customers. Who did not have the right expectations of what the, what it could do, or we're complaining that it was non deterministic.

I did this query once I did it again, and it didn't get the same thing. And we're starting to get used to ChatGPT and this, the [00:23:00] concept of non deterministic, but in an enterprise, you're a CIO, you buy software, you pay a lot of money for it. You expect it to have the same answer every time. And so one is getting our customers comfortable with what LLMs can do and what the boundary conditions are is part of what we have to do in our product.

And it can't just be in the marketing and in the enablement, you have to have some way in the product to take care of this as well, to understand what are the challenges going to be for people. So I explained the RAG architecture. So we have our way of evaluating the search, which. Most of the team came from Google search ranking built eval tools, just like Google search had.

So that kind of that's like bread and butter. Like, how do you eval search? We have a whole process for eval.

**Guy Podjarny:** This is not the user side. This is internal. This is to evaluate whether your search is working correctly. With information, the engineers, your engineers,

**Tamar Yehoshua:** this is our ranking, the ranking algorithms with a lot of [00:24:00] information on exactly what was triggered and what wasn't.

And what scores it had. It is a little bit more difficult because at Google, you could use third party raters to say, here's a change we're making. Is this change good or bad evaluated? Because this is enterprise data, we can't use third party raters. We can only use our tools and engineers to look at the data.

So that's a, uh, another wrinkle on top for, for enterprise. But, but going back to your question of what's different is understanding the user mindset when they're using this product, how do you help them through that to give them guardrails so they can better understand what the product can do, and then how do you evaluate it and how do you make sure that it's working as intended?

With this new technology that nobody fully understands why it's giving the answer that it does.

**Guy Podjarny:** Yeah. What are some examples of things that you would do in the product to, to help, to help people?

**Tamar Yehoshua:** A big one is suggest one of the things about Glean is that we understand your [00:25:00] organization. So in the connectors that we do, we also have a Workday connector, Active Directory, so we understand who you are, who's, who are your peers, who's on your team, so we can suggest prompts that people in your team has been, you have been using, Oh, you're a PM. Here's a prompt from another PM that the things that they've been doing that you might want to try. And so we can suggest or generic suggestions. We've been experimenting with all different ones, but that can help guide people into areas of this is the right, this is ways that you can get value.

And that's going to be really important. And we want to do a lot more of that. And the Glean apps was a big way of doing that as well. Here's some more structured prompts and triggers. This is where, if you have an IT question, you'll go here. If you have a, want to build a customer brief, here's a Glean app for building a customer brief.

So that helps people go to things that somebody has curated. Somebody like the 5 percent of people in the company really [00:26:00] understand how to work with prompts and LLMs. They're going to do that and they're going to help. And we're going to be doing more and more of that. And I hope that is a way that's whole nother angle of how we are doing that.

And then we also have just teams dedicated to eval. And understanding what changes need to be made, what don't, how to evaluate new models. Um, as you mentioned, we, customers can decide what model they want to use. We validate the model. We certify, I should say, if Gemini 1. 5 Pro comes out. So we will certify it for our customers before we enable them to use it.

But we have let our customers pick OpenAI, Anthropic or Gemini for the LLM aspect of the work. And so that's another thing that's tricky also is working with the different models,

**Guy Podjarny:** but how, so I, I understand it like the suggest notion or the idea of disseminating actually probably a useful practice for any product, not a LLM specific [00:27:00] one, which is take the forerunners and provide an easy way to disseminate their sample uses to the rest of the organization. But on the other side in terms of what happens if it fails, what happens if hallucinations are a thing in this world? So ranges from it didn't okay. Maybe you're good on the search side. So it finds the relevant data.

You feel pretty good, but didn't understand it. Did it process it correctly? Did it present it correctly? How do you evaluate it when you certify what types of? tools are at your disposal to know when you're using a new model or even just evolving your software that it got better.

**Tamar Yehoshua:** So first of all, in the product, customers can do a thumbs up, thumbs down.

Obviously we get more thumbs down than thumbs up because that's just the nature of people. But we, but that's helpful until all those queries go come back to us so that we know here's the setup. of bad queries. We also evaluate things like in search, it's easy. Did somebody click on it? Did they find what they were looking for in the system?

It's trickier because they might have gotten the answer or not gotten the answer. But for example, if they [00:28:00] try a couple of queries in the system and go to search afterwards and then find the document that they needed, we know that the assistant didn't give them the answer. So we have some, we have a metric.

They index the satisfaction, right? Index from search and the assistant. So we look at that and we measure that very heavily of how many bad queries did we get? How many thumbs down did we get? So that's one.

**Guy Podjarny:** Yeah. And that's posted. Those are all things that, yes, that's the refer people using the product.

**Tamar Yehoshua:** Right. That's the proxy for how well are we doing? And if we're doing hill climbing to improve, is it going up or going down? And then evaluating is super tricky. So we have, as you mentioned, we've started using LLM as a judge, and there are many ways that the LLM could go wrong, or the whole assistant could go wrong.

It could pick the wrong query to send to the retrieval engine. The retrieval could not find the document, it could find the wrong one, or it could miss some something. And then in the generative, not be a complete answer. It could [00:29:00] not be grounded in the facts. It might pull in public data instead of the data that you had.

And then, so you've got the completeness. The groundedness and the factualness. So we've been using LLMs to judge our answers for the assistant in these different areas. So completeness is one that we get the most thumbs down. If it's not a complete answer, we'll get thumbs down for it. And then we've correlated the thumbs down with LLM as a judge and the completeness.

So that's the most easiest for an LLM to then evaluate the results of the LLM.

**Guy Podjarny:** Yeah.

**Tamar Yehoshua:** And so we have completeness. We have grounded. Did it tell you which, which context it came from? And then the factualness is the hardest and the factualness. What you need is a golden set, essentially

**Guy Podjarny:** being the, you did not hallucinate.

**Tamar Yehoshua:** Grounded is more of the hallucinations. Okay. We don't have a big problem with hallucinations at gleam because of the RAG based architecture and because we do the citations. So it, the [00:30:00] groundedness is the most aligned to, um, to hallucinations, but sometimes it's not grounded in an enterprise stock because it might be like the stock price that you asked for, for a company, and it might just be public knowledge.

The factfulness is, was it, LLMs are very confident. So they'll say with great confidence that something is correct. And then a user will not thumbs down those because they'll just assume it's correct. And those are the most dangerous. And that what we're doing is trying to actually have a golden set.

We're having an LLM extract queries from documents and then measure the effectiveness of are we finding them. So we get the golden set. And then we're so this 1 actually, we're still working on the fact how to best measure it. But the best part that we've done now is now we have a repeatable process for LLM is judge.

We have an eval framework of how we use, and we can just turn the crank when a new model comes out and we can get an evaluation across [00:31:00] these metrics for new models. And for changes as we're making changes in the code, we can evaluate them more easily. Is it perfect? No, but it's a lot better than, you know, engineers manually going and looking at every query and

**Guy Podjarny:** that golden set needs to be created per customer.

**Tamar Yehoshua:** So we use a lot of times we use our data for some of these golden sets to make sure, but we do have in our eval, we actually run queries in our customer environments in their deployments because we can't look at their content. But we can run things and evaluate them and get the results of the evaluation.

**Guy Podjarny:** So that's interesting. And that's a good learning. So you can't look at their data. You have to make sure that your check did not cause problems with their data or see, at least try to assess if that's the case. So what you're agreeing with them is that you'd be able to run some processes that are not actually functional or run a bunch of these tests on their platform with their data, but you won't access the data. It would just get the results of like thumbs up, thumbs down, [00:32:00] different version of the thumbs up, thumbs down to say, yeah, it feels good for you to, to deploy this, uh, this new version or to upgrade to this new model.

**Tamar Yehoshua:** We're very cautious of making sure we have very strict agreements of how we're handling customer data, but we absolutely run regressions and which it was a, it's an interesting process that we've gone through.

**Guy Podjarny:** Yeah. And so we'll talk about LLM as a judge, which is the, the, the LLM looks at the answer and says the result I see in the golden set seems sufficiently familiar, sufficiently similar to the result of the product, the live product gave right now. And then you, I found interesting in the blog post that you're, you're You wrote about this, there's this notion of LLM jury, which has the risk of taking the analogy a bit too far, but I guess is that, do you want to say a couple of words about that?

**Tamar Yehoshua:** Uh, it's exactly what it sounds like. Just multiple, you want to assess not just one voice, but multiple voices to make sure that you're aligning.

**Guy Podjarny:** And I guess partly a means to deal [00:33:00] with the fact that LLMs themselves, like the evaluator itself is a random or is a non deterministic entity. So oddly actually quite aligned to the reason there is a judge and jury in the actual human judicial system.

**Rishabh Mehrotra:** And last up we have Simon Last, Notion co founder and CTO. Now Notion AI is, of course, one of the most impressive running real world applications of AI. And in the snippets that we're going to be sharing with you today, one of the interesting things here is about how Simon details how Notion effectively log failures that are occurring with exact reproduction capabilities and what this allows them to do is effectively build up a data set of failures with regressions.

Now there's an iterative improvement loop here whereby one can collect the failures, adjust prompts, rerun the evaluations, and then validate the fixes against those evaluations. So it's a really interesting loop [00:34:00] of how there's a continuous improvement with Validation from the evals as well and of course one really important thing is there needs to be an opt in from a privacy point of view.

It needs to be that opt in for data sharing for evaluation purposes only that allow and ensure rather that privacy and segregation from test to production data. So here's Simon, co founder of Notion and CTO

**Guy Podjarny:** Yeah, I'm very curious to dive in because indeed some things are the same and you just want to tap into them and some things are quite substantially different. But I guess still on the organization, do you think about the team that is originally building it as a platform team? How did you consider that division now that you've been building this for a couple of years and you have, it sounds like multiple teams working it, how do you think about the shared infrastructure, if you will, between those or the shared learnings or evaluation methods as we'll touch on in a sec?

How do you think about, I guess, enabling these other teams to adopt AI more easily?

**Simon Last:** Yeah, that's a good question. Yeah. Initially we didn't think of it that [00:35:00] way. It's more just like a end to end product team. The only goal is just to

**Guy Podjarny:** just ship them.

**Simon Last:** Yeah. Just to ship useful stuff. And then we're learning along the way how to do evals.

And then our goal is to make it more of a platform team. So we're trying to take what we build and expose it as something reproducible. I would say it's actually pretty tricky because in many ways we don't really do any. Training of models. It's all about taking the best models out there and packaging them into the product and making everything work well, which involves doing logging in the right way, like setting up your prompt, doing logging in the right way, doing evals, all that stuff.

And it's interesting. Most of it is actually, there's not that much technically challenging. I would say about the platform aspect. A lot of it is more about the best practice of how you do it and the knowledge of. What steps you take and even how to think about it. Yeah, it's a tricky thing to get other people to work on without context.

What we've had most success with actually is when someone new wants to work on AI, we've had a lot of success with just having them join the AI team temporarily. And then they just join our standups and [00:36:00] get in the weeds with us every day. And then quickly pick up all the context around like how to do evils, how to.

Yeah, we found it challenging the other way where we just like give them like the code pointers to all the different layers of the stack.

Yeah, here's a thing that would run the inference or whatever it is, but

Yeah, and the code isn't that complicated. I would say the complexity is actually more in like the best practice of how you do things and the mental model, I don't even think about it.

**Guy Podjarny:** I think that makes perfect sense. I actually had Patrick Dubois was a big kind of DevOps luminary on the podcast. And we talked about how the analogies work from the DevOps era. And a lot of it is indeed about platform teams and reuse, but also about embedding and about someone's walking a mile in your shoes a little bit and learning both for empathy, as you might get annoyed later on when the product doesn't work quite as you want, but for skill sharing.

And in security in DevSecOps, we use the same approach. I think we're running out of letters. We can add into the DevSecOps kind of combo. I think we need to find a new strategy there, but I think the approaches themselves are work quite well. I love that approach and it makes perfect sense to me. Maybe let's, indeed, [00:37:00] I want to talk a bit more about the skills and those gaps, but maybe let's first describe.

These are what has been repeated oftentimes in this probably said most succinctly by Dez at Intercom in an early episode here is that it's really hard to know whether your product works when you're helping on AI. So I guess what has your experience been when you're building these capabilities? How do you think about evaluation, about testing, about knowing whether It hits a threshold but also knowing whether it hasn't regressed.

**Simon Last:** Yeah, it's super hard. It's a really different experience than the pre AI world. Most of my experience is in building Notion, building products and it's been honestly really painful. It's very exciting and fun and like I'm pretty obsessed with the new technology but it's very painful and I've missed the old world.

In the old world I could have an idea. And it might take me longer than I thought, but I could definitely ship it. But in the AI world, I can often have an idea and then be surprised that it didn't work in some way I didn't expect. I would say, I guess the way I think about it is, it's a two fold challenge of, [00:38:00] One is, for the situations that you test, you need to make sure that those work and they don't regress.

And then there's this nebulous space of things that you haven't even tested. And it can be arbitrary, and then filling enough of that space so that you're confident that it's gonna match the distribution of what users are gonna request enough that you're reasonably confident, you can never like, like, like fully match it.

Yeah. I guess the way I think about that is, yeah, you need like a repeatable system or engine around this. And the key pieces are something like you need, you need to set up good logging, such that interactions can be called up again. If it fails in some way, you can go look at that. And it's extremely important to be able to exactly reproduce.

This is the failure situation, the log has to be an exact reproduction of the error. And then you should be able to just rerun that inference with the exact same input. That's really important. If you can't do that, you're totally screwed.

**Guy Podjarny:** And that creates probably some challenges. Like [00:39:00] I fully relate to the full logging, even more so than in regular instrumented systems.

How do you work around the kind of the privacy concerns or some of the user data concerns around it?

**Simon Last:** Yeah. So, so we take that really seriously. What we do is we have an opt in early access program. Okay. So when you use Notion AI, we show you this little pop up and the default is no, but you can opt into sharing data with us and we don't use it for training.

It's just for evaluation. So basically that allows us to see the logs. Like if you thumbs down, we can actually see the log with the input and then we can add it to an eval data set and we segregate out the prod data to make sure that it's not contaminated with other sources. But yeah, that's been extremely helpful.

Let people opt in and our user base is large enough that even if a pretty small percentage opt in, it's still quite a lot of data.

**Guy Podjarny:** So logging is one.

**Simon Last:** Yeah, so logging is one. So yeah, I would say the next step is around collecting the failures into data sets. That's pretty important. So just like organizing them in some reasonable way such that for the given task, you have some data set or data sets that are [00:40:00] like, these are all the cases that we care about and all the known previous failures.

And then the next bit is being able to evaluate that a given failure is successful. still happening or not. And that can be really hard. There's many ways to do evals. I think about you can have evals. Definitely best use deterministic when you can, and then you can use like a model graded eval when not.

For model graded evals, they work best when it's as concrete and specific as possible. I've seen people fail A lot doing modigrade evals, but when it's like too generic or the task is too difficult, I think I said this in the last one, but if your modigrade eval is not like robustly easy for the model, then you have to have an eval for your eval and you just created like a new stack of problems,

**Guy Podjarny:** an infinite loop.

So it relates to the, the data piece. So the data sounded like. Going further back, you have to create a set of test cases for the things you want to work. And then a set of cases for problematic ones. I guess that's synthetic. That's at the beginning. Why is your Delta feature? And then you need to have [00:41:00] access to real world scenarios.

And you curate out of that data set of failure cases with the proper user permission, like amidst the users that opted in to help improve the system. I guess, do you accumulate over there good cases as well? Does the data set mostly grow in the negative sense of don't do this or Is it the same one? Is it basically, you're like, here's a scenario, here's the correct and here's the incorrect answer?

**Simon Last:** I tend not to care about good cases that much. I think the, the main focus is on, cause what's the actionable value of that? Maybe it's nice to have a few of them just as a sanity check, but, but to me, the main goal is to collect concrete regressions and then fix them and then make sure that they stay fixed.

And then over time you're ratcheting up. All these things that used to not work are now working, and then you're growing a set of things that continue to work, and that's ideal. You'll often find issues that you just can't really fix, and so, that's the, that's part of the

**Guy Podjarny:** Yeah, we'll come back to that.

That's really interesting, because, you know, a product works at this percentage of Yeah,

**Simon Last:** models have limits. There's many issues that we can [00:42:00] fix, and we're not a foundation model company. It's like, alright, we'll just wait until the next model comes out, and hopefully it'll fix this one. So you want some kind of It could be just like a, like a human grade eval, although that can, that'll quickly get out of control.

Then the really key next thing is you need to enter this loop where you find some new regression, you add it to your dataset. Then you change the prompt in some way. Then you rerun the example to the dataset. And then you need to be able to decide whether it's, like, better or worse. And ideally that's as automated as possible.

And then ship that change. And then there's just this constant loop of new example, improve the prompt.

**Guy Podjarny:** And I think on the eval side, Notion is pretty open or flexible in what you can do with it. There are a lot of degrees of freedom of it. It's easy to think about non deterministic evaluations that you can have.

Notion is, you know, you're looking at text and unstructured data and the likes. What's an example of a structured evaluation that you can do in a Notion AI capability?

**Simon Last:** Yeah, a lot of them, like, anything to do with formatting the output in some way. [00:43:00] Maybe it's XML, maybe it's JSON. That's ideal. That's ideal.

I think about if you're having some issue, the best way to solve it is by solving it with a validator. Make it so that the invalid output is impossible by structuring the output in some way where you can deterministically say that the bad output is incorrect. That's like the best way to solve any problem.

And it's the easiest way to evaluate it. Another. Deterministic eval that we use all the time is, I love using many classifiers as, as part of the flow. Classifiers are really great. I, by classifier, I just mean some inference that outputs like an enum, like, like set of possible values. Classifiers are really great.

Cause they're really easy to eval. Uh, because the output. You

**Guy Podjarny:** classify the result of the, of the product's activity. Like you classify the output from the LLM.

**Simon Last:** No, actually as part of the flow. So. For example, one thing we need to do as part of the chat experience is decide whether to search or not, just as an example.

So it's, should I search or not search? Like, yes or no, basically. These are the classifiers that have more than two possible values, but yeah, those are really great [00:44:00] in large part because they're so easy to evaluate. Because you just have this grown truth output, and then you just compare the actual versus expected.

It's really great for that reason. You can collect a big data set. And you can get a score.

**Guy Podjarny:** And I guess what has been the driver behind the choice not to fine tune, or actually, you know what, I might be jumping to conclusions here. You said you're not training foundation models of it. Do you find yourself fine tuning the model?

**Simon Last:** Yeah, so we've tried fine tuning extensively. I've personally banged my head on it a lot. I think I need to ask OpenAI, but I feel like I'm, if not the highest, maybe like 99th percentile, like number of fine tunes created. Yeah. We try fine tuning quite a lot. I would say, yeah, the big issue with fine tuning is that you're making your job like a hundred times harder at least because I need to collect this whole data set, the actual fine tuning process can be difficult.

Especially if you're using a lot slower, you have to wait for things to train. And then, it's really hard to debug issues. You [00:45:00] basically, like, create this black box where any of your new examples, as you do successive fine tuning runs, any new examples can poison the entire dataset. You can mess up the model.

So, it really requires you to have extremely good evaluations, and it's really hard to debug issues. So, I've had issues with fine tuning where, you know, I literally would spend weeks trying to figure out What the problem was and just extremely hard. Yeah, I would say I'm honestly not that bullish on Companies outside of the foundation model companies doing fine tuning and if I like meet a startup and they say they're doing fine tuning I actually now think of that as like a negative update It's a lack of

**Guy Podjarny:** experience you buy into the promise, but it doesn't actually Yeah, it sounds cool.

**Simon Last:** And I got bit by this bug too. Like, it sounds so cool. And like, I think, as engineers, we want to have more control. We want to do a technically cool thing. Um, you know, the technically, like, powerful thing. And I was totally susceptible to that. And it was really fun to do it. Um, but, I don't know. I'm a huge fan of in context learning.

[00:46:00] Like, it's gonna get better. Yeah, there's a lot of reasons to do it. Another big reason not to fine tune is, if you're not a foundation model company, you really want to just be using the best model. And the progress is so fast. And so if something doesn't work now, there's a good chance it's going to work in the near future.

And if you're fine tuning, you're really just locking yourself in to this process. Slow, complicated process that makes it hard to update. And context learning is amazing because if it's a new model, I can switch the next day.

**Guy Podjarny:** And I guess in that case, that was my next question is like, how often do you try different models and how rigorous a testing process do you feel is necessary to make that switch?

Because those are probably new features or new features. You control the timeline. A new model comes along. I guess you still control the timeline of how quickly you, you roll it out. How do you think about the trade off between wanting to tap into the new hotness versus, I don't know, potentially slow, or how confident do you feel about, Hey, like the next one came out.

Can I sort of slot that in and get it out within the week?

**Simon Last:** Yeah, that's a good question. I would say the speed at which you can do it is really [00:47:00] dependent on the quality of your evals. How good do your evals, like, how good are they at telling you confidently that this change will improve the overall experience?

Yeah. I would say that's one of the reasons why. Evals are so important. Probably changing prompts is more important than changing models. That's going to happen much more often. Uh, but the same evals work for both. Right. Um, but yeah, we try to, yeah, we've definitely invested a lot in really making sure the tools are right.

And, and the evals are, are, are really good, but also I would say as a products company, I think it's pretty important not to be just like jerked around constantly by new models coming out. I think it's important to like, think about like. Really understanding the task that you uniquely really care about a product and your alpha should be really deeply understand the task and what it means to produce a good result and the product experience around that and Like watching new capabilities and releasing them as they [00:48:00] come out.

I think it's really about you and your product and like thinking about the models as like a means to an end to, to enable that. And very often a new model, like just, it's like shiny and cool, but like maybe it just doesn't really get you much benefit for the tasks that you care about.

**Guy Podjarny:** Yeah. I'm curious now. What's your favorite example? Is it just that? Oh, it's an app. Yeah. So it's a. It's a pretty simple app, but it touches all the pieces that, uh, that I think are interesting. So it's a calorie tracking app. All it is, it's a table with the food and the amount of calories, and then there's an input.

And you can just type whatever you want, and then it's supposed to extract out each food item and the calories added to the table. So it involves a database, a backend, a frontend, and then it also has to write a prompt and call into a language model to do the extraction. And last time I tried, it actually did get the initial version.

And then I asked for just one small feature change and then it didn't work from there. They've made huge improvements and it's like super impressive. But yeah, I feel like we're not quite there yet still.

**Simon Maple:** And a massive thank you to Simon there. And of course, [00:49:00] Tamar, Rishabh. and Des at the very start of the episode and, of course my co host Guypo for helping host many of those sessions this has been a very interesting episode for me personally, just because it's been really nice going through some of the past episodes and also, picking out some of those learnings that are consistent and common from many leaders who have kindly given the time to share their thoughts and learnings with us.

Yeah let us know what you think. If there's other topics that you feel are common across episodes, let us know and we'll create some further mashups of them till then. Thanks very much for listening again and tune into the next episode.

**Simon Maple:** Thanks for tuning in. Join us next time on the AI Native Dev brought to you by Tessl.

Podcast theme music by Transistor.fm. Learn how to start a podcast here.

‍

AI Evaluations and Testing: How to Know When Your Product Works (or Doesn’t)

Episode Description

Resources Mentioned

Chapters

Des Traynor's Perspective on AI Product Development

Guy Podjarny's Insights on Building LLM-Based Products

Rishabh Mehrotra on the Importance of Evaluation

Tamar Yehoshua on Enterprise Search Challenges

Simon Last on Continuous Improvement at Notion

Evaluation and Testing Strategies Across Industries

Full Script

Be the first to try Tessl

You’re signed up!