Navigating AI for Testing: Insights on Context and Evaluation with Sourcegraph

July 23, 2024 Simon Maple

In this episode, we dive deep into the world of AI testing with Rishabh Mehrotra, an AI and ML expert from Sourcegraph. Learn about the complexities of AI in code generation, the role of machine learning models, and the critical importance of evaluation and unit testing. This conversation is packed with insights that are crucial for modern developers.

See this content in the original post

Episode Description

Join us as we explore the intricacies of AI testing in modern development with Rishabh Mehrotra from Sourcegraph. Rishabh shares his journey in AI and ML, discussing the evolution of AI models and their impact on coding. He introduces Cody, Sourcegraph's coding assistant, and explains its features, including code completion, code editing, bug fixing, and unit test generation. The episode emphasizes the importance of effective evaluation metrics and the role of custom commands and Open Context in enhancing AI accuracy. Rishabh also highlights the need for human oversight in guiding AI systems, ensuring they deliver high-quality, reliable code.

Resources Mentioned

Chapters

[00:00:15] Introduction [00:01:43] The Big Code Problem [00:04:09] Evolution of AI and ML in Coding [00:07:11] Features of Cody [00:13:13] Importance of Evaluation [00:16:36] Custom Commands and Open Context [00:20:35] The Future of AI in Development [00:26:22] Human-in-the-Loop

Full Script

[00:00:15] Simon Maple: On today's episode, we're going to be talking about all things AI testing. So we're going to be dipping into things like the context that models are going to need to be able to create good tests. Things like evaluation. What does it mean for generated code? And then look into deeper into AI tests about when they should be done.

[00:00:34] Simon Maple: Should they be done early? Should they be done late and more automated? joining me today, Rishabh Mehrotra, from Sourcegraph, those of you may know Sourcegraph better as the tool. Cody,welcome to the session. Tell us a little bit about yourself first of all.

[00:00:47] Rishabh Mehrotra: Thanks, Simon. This is really interesting topic for me. And so I work with Sourcegraph. We do, we've done like coding, code search for the last quite a few years and I mean we tackled the big code problem like if you look at enterprises. Enterprises have a lot of developers which is great for them.

[00:01:02] Rishabh Mehrotra: They also have like massive code bases. look at a bank in the US. I mean they would have like 20 30,000 developers. They have like more than that like 40,000 repositories, right? So this big code problem is a huge pain, right? So what we've done as a company is like we've spent quite a few years in tackling CodeSearch.

[00:01:17] Rishabh Mehrotra: And then last year. We launched Cody, which is a coding assistant, and essentially Cody lives in your IDE, it makes, tries to make you a better productive developer, and there's a bunch of different features, and my role right now at Sourcegraph is to lead the AI effort, so I've done a lot of like machine learning on the consumer side, like looking at Spotify recommendations, music, short video recommendations, my PhD was on search, so Which again ties well because in LLMs you need context and we are gonna talk about, yeah, a lot of that as well.

[00:01:43] Simon Maple: How many years would you say you've been in AI? Yeah, ML generally.

[00:01:45] Rishabh Mehrotra: Yeah. So I think the first research paper I wrote was in 2009 and almost 15 years ago. And this was like NLP. Again, this was not like, deep learning NLP, this is more like traditional NLP and I've been in the world where in like the domain experts, you handcraft these features and then embeddings came in and like they washed all of it away.

[00:02:03] Rishabh Mehrotra: And then these like neural models came in, they washed like custom topic models away. And then these LLMs came in and they washed a bunch of these rankers away. Yeah. So I think like I've seen some of these like waves of Hey,if you create like specific models, which are very handcrafted, maybe a simpler, larger scale, like generalizable model will see.

[00:02:20] Rishabh Mehrotra: Some of these models. I've seen a few of these waves in the last few years.

[00:02:23] Simon Maple: And way before it was cool.

[00:02:24] Rishabh Mehrotra: Yeah, exactly. I think like I keep saying to a lot of like early PhD students, like in 2010s, a lot of the PhD students would get their doctorate by just creating latent variable models and doing some Gibbs sampling inference.

[00:02:34] Rishabh Mehrotra: Nobody even knows that 15 years ago, this is what would get you a PhD in machine learning,

[00:02:38] Simon Maple: right? Yeah.

[00:02:38] Rishabh Mehrotra: So again, things have really moved on.

[00:02:40] Simon Maple: So you can say to that, you can say you weren't there, you weren't there in the early days. Yeah, you weren't there when you were writing the Gibbs model.

[00:02:44] Rishabh Mehrotra: Sampler code and see, We didn't even ever have TensorFlow, PyTorch, bunch of these. it's, I think like this is, I think like we've seen this, it's Cody and a lot of these other Gen AI coding tools, they're trying to make us as developers. Work at higher and higher abstractions, right?

[00:02:58] Rishabh Mehrotra: I started my research or ML career not writing like PyTorch code, right? I wrote again, like Gibbs sampler in C, right? And then I'm not touching C until it's like really needed now. So a lot of these like frameworks have come in which have abstracted the complexities away and made my life easier.

[00:03:15] Rishabh Mehrotra: And again, we're seeing, again, I've seen it as an IC over the last 10 15 years, but also that's also happening in the industry, right? You don't have to be bothered by low level primitives, and you can maybe tackle higher level things, and that's what exactly Cody is trying to do, that How do I remove toil from your developer's life, and then make them focus on the interesting pieces, and the creative pieces, and like really get the architecture right?

[00:03:36] Simon Maple: So let's talk a little bit about some of the features that you mentioned, because Cody is obviously not, it's, when we think about a tool like Cody, it's not just code suggestion, there's a ton of different things, and you mentioned some of them, we'll be talking a little bit more probably about testing today,but when we think about, ML and the use of ML here.

[00:03:55] Simon Maple: how does it differ from, the various features that a tool like Cody has to offer? Does it need to do things differently when it's chat versus code completion versus testing?

[00:04:04] Rishabh Mehrotra: Yeah, that's an excellent point. let me take a step back, right? Let's not even talk about coding assistant. Let's go back to recommendations,

[00:04:09] Rishabh Mehrotra: and let's look at Spotify or Netflix, right? People think that, hey, if you're doing, Spotify is famous for, oh, it knows what I want. It suggests nostalgic music for me. So people in the wild love the recommendations from Spotify. I'm not saying this just because I was an employee at Spotify, but also as a user, right?

[00:04:24] Rishabh Mehrotra: I like Spotify recommendations.

[00:04:25] Simon Maple: I have to say Netflix is there for me. I've never binged listened anything on Netflix or anything on Spotify, right? But Netflix has kept me up till the early hours questioning my life. Yeah.

[00:04:36] Rishabh Mehrotra: I think there's a series, right? like Netflix for videos, you look at Spotify for music, you look at TikTok, Instagram for short videos.

[00:04:42] Rishabh Mehrotra: So again, the mediums are like, it used to be hours of content to like minutes of music to like seconds of short video. But then the point is, we love these products, but this is not just one ranker. Netflix is not just one ranker, right? It's a bunch of different features, a bunch of surfaces and user touch points.

[00:04:59] Rishabh Mehrotra: And each touch point is powered by a different machine learning model, different thinking, different evaluation, right? So my point is, this is how the industry has evolved over the last 10 15 years. Most of these user centric applications, which people love and like hundreds and whatever, like 400 million users are using it monthly.

[00:05:14] Rishabh Mehrotra: They are a mix of different features and each feature needs a design of what's the ML model, what's the right trade offs, what are the right metrics, what's the right science behind it, what's the right evaluation behind it. So we've seen this and now I'm seeing exactly the same at Cody. If you look at Cody, so Cody lives in your IDE.

[00:05:29] Rishabh Mehrotra: So if you're a developer, if you're using VS Code or JetBrains, again, if you're writing code, then we can CodeComplete, right? So Autocomplete is an important feature. That's like very high volume, right? Why? Because it's not that, again, right? a lot of users use it, but also when you're writing code in a file, you will trigger autocomplete like maybe hundreds of times.

[00:05:45] Rishabh Mehrotra: Because you're writing like one word. you're writing like some syntax in line. It'll like trigger. And then. If you like what like a ghost text, which is like recommended, you can just tab and like it selects, right?

[00:05:55] Simon Maple: Yeah,

[00:05:56] Rishabh Mehrotra: so then it tells you write code in a faster way. Now this is, I would say that this is something on the like extreme of hey, I have to be like very latency sensitive and like really be fast, right?

[00:06:05] Rishabh Mehrotra: Because when you're on Google, you're writing a query, Google will do like auto suggestions, right? So can I complete your query? So we've been, like, using some of these features for the last decade as users, on the computer side.

[00:06:15] Simon Maple: And this is really important as well, because this interaction that you guys are talking about here, it's a very emotive one.

[00:06:20] Simon Maple: You will piss people off if you do not, if you do not provide them, because, this is supposed to be an efficiency play, because if you don't provide them with something that they can actually, that will improve their workflow, and they can just tab and say, yeah, this is so cool, because I'm just tabbing, and accepting.

[00:06:35] Simon Maple: Then they're going to, they're going to get annoyed to the extent that they will almost reject that kind of tool. Exactly.

[00:06:40] Rishabh Mehrotra: This is where, quality is important, but latency is also important. So now we see that there's a trade off.

[00:06:44] Simon Maple: Yeah.

[00:06:44] Rishabh Mehrotra: That, hey, yes, I want quality, but then if you're doing it like 400 milliseconds later, Hey, I'm already annoyed.

[00:06:50] Rishabh Mehrotra: Why? Because the perception of the user on when they're competing, right? They don't want to wait. if I have to wait, then I'd rather go to code edit or like chat.

[00:06:57] Simon Maple: That is, that's a great point. Is latency, does latency differ between the types of things that Cody offers? Because I would guess if a developer is going to wait for something, they might, they'll be likely to wait a similar amount of time across all, but are they different?

[00:07:11] Rishabh Mehrotra: Yeah, not

[00:07:11] Rishabh Mehrotra: really.

[00:07:11] Rishabh Mehrotra: Let me go back to the features, right? So Cody has, autocomplete, which helps you, complete code when you're typing out. We have, CodeEdit, CodeFix. It's hey, there's a bug. You can write a command or, select something and say, hey, Cody, fix it. Now, when you're fixing, then people are okay to wait for a second or two, as it figures out, and then they're gonna show a diff and, oh, I like this change and I'm gonna accept it.

[00:07:28] Rishabh Mehrotra: Now, this is gonna span maybe, 3, 000 milliseconds, right? 3 to 4 seconds. Yeah. Versus autocomplete, no, I want everything, tab, complete, selected within, 400, 500 millisecond latency. again, just there, the differences are popping up. Now we've talked about autocomplete, code edit, code fix, then we could go to chat as well, right?

[00:07:43] Rishabh Mehrotra: Chat is hey, I'm typing in query. It'll take me like a few seconds to type in the right query and select the right code and do this, right? So the expectation of 400 milliseconds is not really the case in chat, because I'm asking maybe a more complex query. I want you to take your time and give the answer.

[00:07:58] Rishabh Mehrotra: Versus like unit test generation, for example, right? Unit test is hey, write the entire code. And make sure that you cover the right corner cases. Unit test is like great coverage and like you're not just missing, important stuff. You're making sure that the unit test is actually quite good.

[00:08:10] Rishabh Mehrotra: Now there, I don't want you to complete in 400 milliseconds, take your time, right? Good code. I'm waiting. I'm willing to wait a long time. Yeah, let's take a step back. What are we looking at? We're looking at a few different features. Now similar, right? Netflix, Spotify, it's not just one recommendation model.

[00:08:24] Rishabh Mehrotra: You could do search, you could do podcast, you could do hey, I want this category content, a bunch of these, right? So similarly here, in Cody, you have autocomplete, code edit, code fix, unit test generation. You have a bunch of these commands. You have chat. Chat is an entire nightmare. I can talk about hours on like Chat is like this one box vision, which people can come up with like hundreds of intents.

[00:08:44] Rishabh Mehrotra: Yeah. That's each intent. It's a nightmare for me as an engineer to say that, are we doing well on that? Because in autocomplete, I can develop metrics around it. I can think about, okay, this is a unified, this is a specific feature. Chat may be like masking hundreds of these features just by natural language.

[00:09:00] Rishabh Mehrotra: So we can talk a little bit more about chat as well over a period of time. But coming back to the original point you mentioned, for autocomplete, people will be latency sensitive. Yeah. For unit age generation, maybe less For chat, maybe even less Yeah. What that means is the design choices, which I have as an ML engineer are different.

[00:09:15] Rishabh Mehrotra: In autocomplete, I'm not going to look at like 400 billion parameter models, right? I want something which is fast, right? So if you look at the x axis latency, y axis quality, like I don't want to go the top right? Top right is high latency and high quality. I don't want high latency. I want to be in the like, whatever, 400, 500 end to end millisecond latency space.

[00:09:34] Rishabh Mehrotra: So there, small models kick in, right? And small models we can fine tune for great effect, right? we've done some work. We just published a blog post a couple of weeks ago on, hey, if you fine tune for Rust, Rust is like a lot more, has a lot of nuances, which most of these large language models are not able to capture.

[00:09:49] Rishabh Mehrotra: So we can fine tune a model for Rust and do really well on auto completion within the latency requirements, which we have for this feature. Yeah. So then these trade offs start emerging, essentially.

[00:09:57] Simon Maple: How does that change? If you, if the. output that Codey, say, was going to provide the developer would actually be on a larger scale.

[00:10:04] Simon Maple: So we talked about, when we're talking about autocomplete, we're really talking about a one liner complete. But what if we was to say, I want you to write this method, or I want you to write this module? obviously then, you don't want that, you don't want the developer to, accept an autocomplete, or look at an autocomplete module, or function, and think, this is absolute nonsense.

[00:10:22] Simon Maple: It's giving me nonsense quickly. Presumably then they're willing to, they're willing to wait that much longer.

[00:10:27] Rishabh Mehrotra: Yeah, I think there's, that's a really good point, right? That, look, people use Cody and, not just Cody and not just in coding domain, right? We use Copilot across different, I mean, sales Copilot, marketing Copilot, finance risk Copilot.

[00:10:37] Rishabh Mehrotra: People are using these agents or assistants and for various different tasks, right? And some of these tasks are like complex and more sophisticated. Some of these tasks are like simpler, right? Yeah. So let me, let me just paint this like picture of how I view this, right? When you pick up a topic to learn, right?

[00:10:53] Rishabh Mehrotra: Be it programming, you don't start with like multi threading. You start with hey, Do I know the syntax? Can I instantiate variables? Can I go if else? Can I do a for loop and can I do switch and then multi threading and then parallelism? Yeah. So when we as humans learn, there is a curriculum we learn on, right?

[00:11:06] Rishabh Mehrotra: We don't directly go to chapter 11. We start with chapter 1. Similarly, I think like the lens, which I view some of these agents and tools are, it's okay. you're not at like chapter 12 yet, but I know that there are simpler tasks you can do, and then there are like medium tasks you can do, and then there are like complex tasks you can do.

[00:11:21] Rishabh Mehrotra: Now, this is a lens, which is, which I found to be like pretty useful, because when you say that, Hey, for autocomplete, for example, I don't want, again, my use case probably for autocomplete is not hey, tackle a chapter 12 complexity problem. No, for that, I'll probably have an agentic setup.

[00:11:35] Rishabh Mehrotra: Yeah. so this curriculum is a great way to look at it. The other great way to look at things are like, let's just call it like left versus right? So on the left, we have these tools which are living in your IDE, and they're helping you write better code, and like complete code, and like really, you are the main lead.

[00:11:47] Rishabh Mehrotra: You're driving the car, you're just getting some assistance along the way, right? Versus things on the right are like agentic, right? That, hey, I'm going to type in, here is my GitHub issue. Send me, create a bootstrap, me a PR for this, right? There I want the machine learning models to take control.

[00:12:01] Rishabh Mehrotra: Not full autonomy. Quinn, our CEO, has an amazing blog post on, levels of code AI, right? You start with, level 0 to level 7, and some of these, human initiated, AI initiated, AI led, and that gives us a spectrum to look at autonomy from a coding assistant perspective, which is great.

[00:12:15] Rishabh Mehrotra: I think, everybody should look at it. But, coming back to the question, autocomplete is probably for, I am still the lead driver here. Help me. But, in some of the other cases, I'm stuck. Take your time, but then tackle more complex tasks. Now the context and the model size, the latencies, all of these start differing here, right?

[00:12:34] Rishabh Mehrotra: When you're writing this code, you probably need for autocomplete local context, right? Or like maybe if you're referencing a code from some other repository, bring that as a dependency code and then use it in context. If you're looking at a new file generation or like a new function generation, that's hey, you gotta look at the entire repository and not just make one changes over here.

[00:12:52] Rishabh Mehrotra: You have to make an entire file and make changes across three other files, right?

[00:12:55] Rishabh Mehrotra: So

[00:12:56] Rishabh Mehrotra: even, Where is the impact? The impact in autocomplete is like local, in this file, in this region, right? And then if you look at the full autonomy case or like agentic setups, then the impact is hey I'm gonna make five changes across three files in two repositories.

[00:13:09] Rishabh Mehrotra: Yeah, right So that's that's the granularity at which some of these things are starting to operate, essentially.

[00:13:13] Simon Maple: Yeah, and testing is gonna be very similar as well, right? Yeah If someone is if someone's writing code in the in their IDE line by line and that's maybe using a code generation as well like Cody, they're gonna likely want to be able to have test, step, step in sync.

[00:13:28] Simon Maple: So as I write code, you're automatically generating tests that are effectively providing me with that assurance that the automatically generated code is working as I want to.

[00:13:38] Rishabh Mehrotra: Yeah, that's a great point. I think this is more like errors multiply. Yeah. if I'm evaluating something after, long writing it, then it's worse off, right?

[00:13:46] Rishabh Mehrotra: Because the errors, I could have stopped the errors earlier on and then debugged it and fix it locally and then moved on. Yeah. So especially, so taking a step back, look, I love evaluation. I really in machine learning, I started my PhD thinking that, Hey, maths and like fancy, graphical models are the way to have impact using machine learning, right?

[00:14:05] Rishabh Mehrotra: And you spend one year in the industry, you're like, nah, it's not about the fancy models, it's about do you have an evaluation? Do you have these metrics? Do you know when something is working better? Yeah. So I think getting the zero to one on evaluation on these datasets, That is really key for any machine learning problem.

[00:14:20] Simon Maple: Yep.

[00:14:21] Rishabh Mehrotra: Now, especially when

[00:14:21] Simon Maple: What do you mean by the zero to one though?

[00:14:23] Rishabh Mehrotra: Yeah, zero to one is look at like whenever a new language model gets launched, right? People are saying that, hey, for coding, LLMs, Llama3 does well on coding. Why? Because, oh, we have this human eval data set, and a pass at one metric.

[00:14:34] Rishabh Mehrotra: Let's unpack that. Human eval dataset is a dataset of 164 questions. hey, write me a binary search in this code, right? So essentially it's like you get a text and you write a function and then you're like, Hey, does this function run correctly? So they have a unit test for that. And if it passes, then you get plus one, right?

[00:14:50] Rishabh Mehrotra: Yeah. So now this is great. It's a great start, but is it really how people are using Cody and a bunch of other coding tools? No, they're like, if I'm an enterprise developer, if let's say I'm in a big bank, Then I have 20, 000 other peers and there are like 30, 000 repositories. But I'm not writing binary search independent of everything else, right?

[00:15:08] Rishabh Mehrotra: I'm working in a massive code base which has been edited across the last 10 years.

[00:15:11] Rishabh Mehrotra: And

[00:15:12] Rishabh Mehrotra: there's some dependency by some team in Beijing and there's a function which I haven't even read, right? And maybe it's in a language that I don't even care about or understand. My point is the evaluation which we need for these real world products is different than the benchmarks which we have in the industry, right?

[00:15:28] Rishabh Mehrotra: Now, the zero to one for evaluation is that, hey, sure, let's use Passit 1 and Humeval at the start on day zero, but then we see that you improve it by 10%, we have results when we actually did improve Passit 1 by 10 15%, we tried it online on Cody users and the metrics dropped. Yeah. And we're writing a blog post about it at Offline Online Correlation.

[00:15:47] Rishabh Mehrotra: Yeah. Because if you trust your offline metric, pass it one, you improve it, you hope that, hey, amazing users are going to love it. Yeah. It wasn't true.

[00:15:54] Simon Maple: The context is so different.

[00:15:55] Rishabh Mehrotra: Yeah, the context is so different. Now, this is, this means that I got to develop an evaluation for my feature. and I got, my evaluation should represent how my actual users using this feature feel about it.

[00:16:06] Rishabh Mehrotra: just because it's better on a metric, which, is an industry benchmark, doesn't mean that improving it will improve actual user experience.

[00:16:12] Simon Maple: And can that change from user to user as well? So you mentioned a bank there. If five other banks, is it going to be the same for them? If it's something not in the FinTech space, is it going to be different for them?

[00:16:21] Rishabh Mehrotra: That's a great point. I think the nuance you're trying to say is that, hey, one, are you even feature aware in your evaluation? Because Parset 1 is not feature aware, right? Parset 1 doesn't care about autocomplete or unit test generation or code fixing. It's I don't care what the end use case or application is, this is just the evaluation.

[00:16:36] Rishabh Mehrotra: So I think the first jump is have an evaluation dataset which is about your feature, right? The evaluation dataset for unit test generation is going to be different than code completion, it's going to be different than code edits, it's going to be different than chat. So I think the zero to one we were talking about five minutes earlier, you've got to do zero to ones for each of these features.

[00:16:52] Rishabh Mehrotra: Yeah. And that's not easy because evaluation doesn't come naturally. Yeah. and once you have it, then the question becomes that, Hey, okay, once I have it for my feature, then, Hey, can I reuse it across industries?

[00:17:02] Simon Maple: Yeah.

[00:17:02] Rishabh Mehrotra: Can I reuse it across like users? And I think we've seen it.

[00:17:05] Rishabh Mehrotra: I've seen it in the traditional recommendation space. Let's say, most of these apps, again, if they got like seed funding last year or maybe series A, they're at what, like 10,000 daily active users, 5,000 daily active users today. One year from now, they're going to be 100k, 500k daily active users, right?

[00:17:20] Rishabh Mehrotra: Now, how representative is your subset of users today, right? the 5,000 users today are probably early adopters. And if anything, scaling companies in the last 10 years, what it has told us is the early adopters are probably not the representative set of users you'll have once you have a mature adoption.

[00:17:35] Rishabh Mehrotra: What that means is the metrics which are developed and the learnings which I've had from the initial A B test may not hold one year down the line, six months online, as and when the users start increasing, right? Yeah. Now, how does it link to the point you asked? Look, there are heterogeneities across different domains, different industries.

[00:17:50] Rishabh Mehrotra: luckily, there are like homogeneities across language, right? if you're a front end developer, A versus B versus C companies, a lot of the tasks you're trying to do are like similar. And a lot of the tasks which the pre training data set has seen is also similar.

[00:18:03] Rishabh Mehrotra: Because really, again.

[00:18:04] Rishabh Mehrotra: There are cases wherein you're doing something really novel, but a lot of the junior development workflow probably is more like things which like hundreds and thousands of engineers have done before. So the pre trained models have seen these before, right? So when we're fine tune the model for us, that's not where we saw advantages.

[00:18:18] Rishabh Mehrotra: Because yeah, you've seen it before, you're going to do it

[00:18:20] Rishabh Mehrotra: well.

[00:18:21] Rishabh Mehrotra: Coming back to the point I mentioned earlier, there's going to be a curriculum, right? You can do simple things well. You can do harder things in Python well, you can't do harder things in Rust well, you can't do harder things in MATLAB well.

[00:18:30] Rishabh Mehrotra: So my goal of fine tuning some of these models is that, hey, I'm going to show you examples of these hard tasks. Not because I just want to play with it, but because some of them are adopters, right? Some of, we have a lot of enterprise customers using us, right? And paying us for that, right? I get my salary because of that, essentially.

[00:18:46] Rishabh Mehrotra: So essentially, I want those developers to be productive. They're trying to tackle some complex tasks in Rust, which maybe we haven't paid attention when we were training this Llama model or like this Entropic model. So then my goal is how do I extract those examples and then bring it to my loop, training loop, essentially.

[00:19:01] Rishabh Mehrotra: Yeah. And that's where right now, if let's say one industry is struggling, we know how the metrics are performing, right? That's what evaluation is so important. We know where we suck at right now, and then we can start collecting public examples and start focusing the models to do well on those, right?

[00:19:17] Rishabh Mehrotra: Yeah. Again, let me bring the exact point I mentioned. I'm going to say it a hundred times. We have done it before. If you spend 20 minutes on TikTok, you're going to look at what, 40 short videos? Yeah. If you spend five minutes on TikTok or Instagram reels, you're going to look at like 10 short videos, right?

[00:19:32] Rishabh Mehrotra: Yeah. In the first nine short videos, you're going to either skip it or like it or follow a creator, do something, right? So the 11 short videos is like really personalized because I've seen what you're doing in the last five minutes and I can do real time personalization for you. Yeah. What does that mean in the coding assistant world?

[00:19:45] Rishabh Mehrotra: Look, I know how these models are used in the industry right now, and how our enterprise customers and our community users are using it. Let's look at the completion acceptance rate or autocomplete. Oh, for these languages and these use cases, we get a high acceptance rate. We show our recommendation, people accept the code and move on.

[00:20:00] Rishabh Mehrotra: But in these advanced examples, oh, we're not really doing that. So the question then becomes, oh, this is something which maybe we should train on, or we should fine tune on, and this establishes a feedback loop. Yeah. that look at what's not working. And then make the model look at more of those examples and create that feedback loop which can then make the models evolve over a period of time.

[00:20:20] Simon Maple: And so I think there's two pieces here if we move a little bit into the, into what this then means for testing. The more that evaluation, like almost like the testing of the model effectively, gets better. It effectively means that the suggested code is then more accurate.

[00:20:35] Simon Maple: As a result, the, you need to rely on your tests slightly less. I'm not saying you shouldn't, but slightly less because the generated code that is being suggested is a higher quality. When it then comes to,the suggestions of tests afterwards, does that follow the same model in terms of learning from not just the, what the user wants to test, but also from what is being generated, is there work that we can do there into building that test suite?

[00:21:01] Rishabh Mehrotra: Yeah, I think that's an interesting point. unit test generation is, It's a self contained problem in itself, right? Yeah. One, I think let's just establish the fact that unit test generation is probably one of the highest value use cases. Yeah. We got to get right in the industry.

[00:21:15] Simon Maple: Why is that?

[00:21:15] Rishabh Mehrotra: Around code.

[00:21:16] Rishabh Mehrotra: And that's because, I mentioned, thatyou will stop using Spotify. If I start showing shitty recommendations to you, if I don't learn from my mistakes, you're going to keep skipping like short videos or music or podcasts, and you're going to go elsewhere, right? I don't get value because I spend, I'm running, I'm jogging.

[00:21:31] Rishabh Mehrotra: I want music to just come in and I don't want to stop my running and hit, like skip. Yeah. Because that's, that's like dissatisfaction, right? Yeah. one of the things which I did at Spotify and like in my previous company was, Like, I really wanted people to focus on dissatisfaction.

[00:21:44] Rishabh Mehrotra: Yeah. Because satisfaction is oh, users are happy. Yeah. That's not where I make more money. I make more money by reducing dissatisfaction. Where are you unhappy? Yeah. And how do I fix that?

[00:21:52] Simon Maple: And even there, if I stop you there, actually, just quickly, I think there are different levels to this as well in terms of the testing, right?

[00:21:57] Simon Maple: Because it's at some point it's, at its most basic, does this code work? And then as it goes up, it's, does this code work well? Does this code work fast? Does this code, is this really,making you happy as a user? And so,I think, where would you say we are right now in terms of the level of test generation?

[00:22:15] Simon Maple: Are people most concerned with the, this code is being generated or, when I start creating my tests, how do I validate that my code is correct in terms of, it compiles, it's covering, my basic use cases that I'm asking for and doing the right thing.

[00:22:32] Rishabh Mehrotra: Now this, we can talk an hour about this.

[00:22:33] Rishabh Mehrotra: look, I think this is a long journey.

[00:22:35] Simon Maple: Yeah.

[00:22:36] Rishabh Mehrotra: Getting evaluation, getting unit test generation right across the actual representative use case in the enterprise.

[00:22:42] Rishabh Mehrotra: That's the nightmare of a problem. Yeah. Look, I can do it right for writing binary tests, binary search algorithms, right? If you give me like, if you give me a coding task, which has nothing to do with a big code repository and like understanding context, understanding what the 5,000 developers have done, sure, I can attempt it and create a unit test.

[00:22:58] Rishabh Mehrotra: Because this, there's a code and there's a unit test. This lives independently, right? Yeah. They live on an island. They're happily married. Amazing. Everything works. But this is not how people use coding. This is not how like the enterprise or like even pro developer use cases are. The pro developer use cases are about like, Hey, is this working correctly in this like wider context?

[00:23:15] Rishabh Mehrotra: Yeah. Because there's a big code repository and like multiple of them. And that's the wider context where you're actually writing the code and you're actually writing the unit test. Now I would bring in this philosophical argument of, I think unit test generation, I would look at it from an adversarial setting.

[00:23:29] Rishabh Mehrotra: What's the point of having the unit test? It's not just to make yourself or your manager happy that, Oh, I have unit test coverage. Unit tests are probably like a guardrail to keep, bad code from entering your system. Yes. So what is this? This is an adversarial setup. Maybe not intentionally adversarial.

[00:23:44] Rishabh Mehrotra: Adversarial setup is hey, somebody's trying to make bad things happen in your code. And somebody else is stopping that from happening, right? And again, if you start looking at unit test generation from this adversarial setup, that look, this is a good guy, right? the unit test is gonna prevent bad things from happening in future to my code base.

[00:24:01] Rishabh Mehrotra: That's why I need good unit test. Yeah. Now this back, now this is a good guy, right? Who are the bad people right now? In the last, up until the last few years, the bad people, not intentionally bad, but the bad actors in the code base were developers. Yeah. Now we have AI. right now, I am, as a developer, writing code, and if I write shitty code, then the unit tests will catch it, and I won't be able to merge.

[00:24:19] Simon Maple: If that same developer writes those unit tests. Exactly. We'll get to that, right? I'm yet to see a developer who is not a TDD fan, who absolutely lives for writing tests and building a perfect test suite for the code. Yeah,

[00:24:33] Rishabh Mehrotra: yeah. again, right? there's a reason why test case coverage is low across all the repositories and stuff, right?

[00:24:39] Rishabh Mehrotra: it's not something which Again, I think, like, Beyang, CTO, he loves to say that,the goal of code is to remove developer toil. Yeah. and how do I make you do a lot more happier job, focusing on the right creative aspects of, architecture design or, system design and remove toil from your life, right?

[00:24:53] Rishabh Mehrotra: Yeah. A bunch of developers, for better or worse, start looking at unit test generation as, maybe it's, not as interesting. Let's unpack that as well. Not all unit tests are, like, boring, right? Yeah. Writing stupid unit tests for stupid functions, we shouldn't even like probably do it or like I will let like machine learning do it essentially.

[00:25:08] Rishabh Mehrotra: But the nuance are like, here's a very critical function. If you screw this, then maybe the payment system in your application gets screwed and then you lose money. And then you don't, if you don't have observability, then you lose money over a period of time. And then you're literally costing company dollars, millions of dollars, if you screw this code, essentially.

[00:25:25] Rishabh Mehrotra: So the point is, not all unit tests are the same because not all functions are equally important, right? There's going to be a distribution of some of those are like really, really important functions. You got to get an amazing unit test right. I would rather, I, if I have a limited budget, that if I have two principal engineers, I would make sure that the unit tests of these, really critical pieces of component in my software stack are written by these engineers, or even if they are written by like these agents or AI solutions, then at least like they're wedded from some of these.

[00:25:52] Rishabh Mehrotra: But before we get there, let's just look at the fact of the need for unit tests, not just today, but tomorrow. Yeah. Because right now, if you have Primarily developers writing unit tests or like some starting tools. Tomorrow, a lot more AI assistants. I mean, we are building one, right? we are trying to say that, hey, we're going to write more and more of your code.

[00:26:09] Rishabh Mehrotra: Yeah. What that means is, if in the adversarial setup, unit tests are like protecting your code base. the potential attacks. not intentional, but the bad code could come in from humans, but also like thousands and millions of AI agents tomorrow.

[00:26:22] Simon Maple: Yeah. And you know what worries me a little bit here as well is, in fact, when you talked about that, that, those levels of autonomy, on the far left, It's much more interactive, right? You have developers who are looking at the lines of code that's suggested and looking at the tests that get generated.

[00:26:36] Simon Maple: So it's much more involved for the developer. As soon as you go further right into that more automated world, we're more in an environment where larger amounts of content is going to be suggested to that developer. And if we go back to the same old, that story where If you want 100 comments on your pull request in a code review, you write a two line change.

[00:26:58] Simon Maple: If you want zero, you provide a 500 line change, right? And when we provide that volume, whether it's, hey, I'm going to build you this part of an application or a module or a test suite based on some code, how much is a developer actually going to look in detail at every single one of those, right?

[00:27:14] Simon Maple: And I think this kind of comes back to your, point of what are the most important parts that I need to look at, but yeah, it revolves a little bit more around what you were saying earlier as well, whereby tests are becoming more and more important for this kind of thing.

[00:27:29] Simon Maple: And as code gets generated, particularly in volume,what are, where are the guardrails for this? And it's all about tests.

[00:27:37] Rishabh Mehrotra: I love the point you mentioned, right? That, look, as more and more code gets written, like my, the cognitive abilities of a developer to look at every change everywhere, it just takes more time, takes more effort, takes more cognitive load, right?

[00:27:48] Rishabh Mehrotra: Yeah. Now, coupled in the fact that if you've been using the system for a few months, then there's an inherent trust in the system. Now this is when I get really scared. You look at when I started using Alexa in 2015, it would only get weather right.

[00:28:02] Rishabh Mehrotra: Google Home, Alexa, it won't do any of the other things right.

[00:28:05] Simon Maple: And who can get weather right?

[00:28:07] Rishabh Mehrotra: Yeah, weather prediction is a hard enough machining problem. DeepMind still working on it and still getting it right and doing it in London. yeah, I would pay for a service which predicts weather, but the point is we've used as a society, like these conversational agents for a decade now.

[00:28:20] Rishabh Mehrotra: Yeah, we were just asking like crappy questions. Yeah. Because we trusted, I ask you a complex question, you don't have an answer, right? I forget about you for the next few months. But then we start increasing the complexity of the questions we ask. And that's great because now the Siri and the Google Assistant and Alexa was able to tackle these questions, right?

[00:28:36] Rishabh Mehrotra: And then you start trusting them because, hey, oh, I've asked you these questions and we've answered them well. So then I trust you to do these tasks well. And again, right? If you look at people who use Spotify or Netflix, their recommendations in the feed, they have more trust on your system.

[00:28:50] Rishabh Mehrotra: Because most of these applications do provide you a way out, right? If you don't trust recommendations, go to your library. Go do search. Search versus recommendations is that push versus pull paradigm, right? Recommendations, I'm going to push content to you. If you trust us, you're going to consume this, right?

[00:29:05] Rishabh Mehrotra: If you don't trust the system, if you don't trust the recommendations, then you're going to pull content, which is search, right? Now especially as and when, you've seen, there's a distribution of people who don't search at all, right? They're like, they are, we live in that high trust world and they're going to, they're going to like, like a recommendation, same, right?

[00:29:19] Rishabh Mehrotra: Google, who goes to the second page of Google, right? who goes to Google now? Sorry. But essentially the point is once people start trusting these systems, the unit test generation is a system which I start trusting, right? And then it starts tackling more and more complex problems, and then it's oh, I start, I stop looking at the corner cases, and then, that code was committed six months ago, and that unit test is there, and then code on top of it was committed, three months ago, and then there's probably a unit test which I didn't write, the agent wrote, Now, this is where like complexity evolves over a period of time and maybe there's a generation of unit tests which have been written maybe with less and less of me being involved.

[00:29:56] Rishabh Mehrotra: Yeah. the levels of code AI, like that means like your involvement is not at the finer levels, like higher, higher up. Now this assumes it works well if the foundations are correct and everything is robust and we have like good checks in place. Yeah. Again, right? whatever could go wrong would go wrong.

[00:30:11] Rishabh Mehrotra: Yeah. So then the point is, in this complex code, wherein, the series generations of unit tests, generations of code cycles, edits have been made by an agent, then, things could go horribly wrong.

[00:30:22] Simon Maple: So, what's the solution?

[00:30:23] Rishabh Mehrotra: The solution is pay more respect to evaluation, right? look, you gotta, You've got to build these guardrails is just like a very, harmless way to, for me to say that like unit test generation is important, not just for unit test generation today.

[00:30:36] Rishabh Mehrotra: Yeah. But for unit test generation and code generation tomorrow. Yeah. So I think like the kind of metrics we need, the kind of evaluation we need, the kind of robust auditing of these systems and auditing of these unit tests. So far, again,I don't have a huge 10 year experience in the coding industry because I've worked on recommendations and use syntax systems before.

[00:30:52] Rishabh Mehrotra: But for me, it was like, Hey, What is your test coverage in a repository? that's the most commonly way, look, way to look at like repository and what's, what I, how advanced are you in your testing communities? Yeah. And that doesn't cut it. are you covering the corner cases?

[00:31:05] Rishabh Mehrotra: What's your, again, what's the complexity? What's the severity?

[00:31:08] Simon Maple: So do we need automated tests for our tests? Yeah, exactly. Or do we or do we need people Yeah. Humans to look at.

[00:31:13] Rishabh Mehrotra: We need both, right? we need human domain experts. Again, this is, and this is not just a coding problem.

[00:31:18] Rishabh Mehrotra: Yeah. Look at, look, millions of dollars are spent by Entropic and OpenMind on ScaleAI. ScaleAI has they've raised like a lot of money recently, billion dollar evaluations because we need domain experts to tag. Yeah. This is also a nightmare. at Spotify and in search, I have a PhD in search.

[00:31:31] Rishabh Mehrotra: My search was like, I'll show users some results and they're going to tag this is correct or not. I can't do this in coding. Because crowdsourcing has been a great assistance to machine learning systems for 20 years now. Because I can get that feedback from the user. Now to get feedback on a complex Rust code, where am I going to find those crowdsourced workers on ScaleAI or Amazon MTurk, right?

[00:31:52] Rishabh Mehrotra: They don't exist. You're not going to pay them 20 an hour to write, Give feedback. They're like thousand dollar hour developers, right? We don't even have a community right now around crowd sourced workers for code because this is domain specific, right? So my point is Again, this is not all dooms worthy, in the sense like, this is an important problem we got to get right.

[00:32:14] Rishabh Mehrotra: And it's going to be a long journey. We're going to do evaluations and we're going to do like generations of evaluations. I think the right way to think about is like paying attention to unit tests, but also like evaluation of unit tests. Yeah, and also evaluation, take a step back, like multiple levels of evaluation, right?

[00:32:28] Rishabh Mehrotra: You're going to evaluate hey, what,is, Are we able to identify the important functions, like criticality of the code, right? And then look at unit test generation from that lens. Now immediately, one of the solutions, I think when we met over coffee, we were talking briefly about it. If let's say I generate 20 unit tests, right?

[00:32:46] Rishabh Mehrotra: Now, I want my principal or some respected, trusted engineer to vet at least some of them. Now, their day job is not just to write unit tests, right? Their day job is to maintain the system, advance it. So they're going to probably have limited budget to look at the unit tests I've generated. Now the question becomes, if this week I was able to generate 120 unit tests, And you are a principal engineer in my team. You're not going to look at 120 unit tests and pass them. You maybe have two hours to spare this week. You're going to look at maybe five of these unit tests. Now this becomes an interesting machine learning problem for me. The 120 unit tests I've created,

[00:33:22] Rishabh Mehrotra: which, what is a subset of five, which I need your input on? Now, this is one way to tackle these problems. So reduce the uncertainty, right? machine learning and certainty models,we've done it for 20 years in the industry. So like, how certain am I that this is correct? And if I'm certain, then sure, I won't show it to you.

[00:33:39] Rishabh Mehrotra: Yeah. Maybe I'll show one off just to make sure that I get a feedback that I thought I'm certain you, did you confirm or did you reject that? And then I'll learn, but then I'm going to go on an information maximization principle. that what do I not know and can I show that to you? So what that means is it's a budget constrained subset selection problem, right?

[00:33:57] Rishabh Mehrotra: That I've generated 120 unit tests, you can only look at five of them. I'm going to pick up these five and see. Now we can do it like I can pick up these five and show it to you. Or I can pick up one, get that feedback, see what I additionally learned from that one. And then look at the 119 again and say that knowing what I know just now,

[00:34:15] Simon Maple: what else would I show?

[00:34:16] Rishabh Mehrotra: what is the next one?

[00:34:16] Simon Maple: Yeah. And previously you mentioned about, the most important parts of the code to test, right? Because there could be critical paths, there could be other areas, right? Do you know what? If there's a bug, it's far less of a problem. Who provides that context?

[00:34:28] Simon Maple: Is that something the, is that something the LLM can like, decide, or is that something like you said the principal engineer should spend a little bit of their time saying that these are the main core areas that we need to be bulletproof. Other areas. Yeah. I would rather spend my time on these than those others.

[00:34:43] Rishabh Mehrotra: Yeah. I think, I probably am not the best person to answer like who currently provides it. Yeah. As a machine learning engineer, I see there are signals. Yeah. If I look at your system, then there are like, where is the bugs raised? Where was the severity, right? for each bug, for each incident, there was a severity report.

[00:34:57] Rishabh Mehrotra: What's good? What code was missed? What code wasn't missed? Yeah. What were the unit tests created by people so far, right? So again, I view it from a data observability perspective, like knowing what I know about your code, about your severity, about your issues, about your time to resolve some of these. Then I can develop a model personalized to your code base on what are the core important pieces, right?

[00:35:18] Rishabh Mehrotra: So I can look at just code. Yeah. Which is Hey, which functions are calling which, there's a dependency. And we do a lot of we have amazing engineers on, who are just like, compiler experts in the company, right? Yeah. And one of the reasons I joined Sourcegraph was because look, I bring in the ML knowledge, but I want to work with domain experts, right?

[00:35:33] Rishabh Mehrotra: And Sourcegraph has hired amazing talent over the last few years and it compiles, it compounds, right? So there are these compiler experts internally, Olaf and a few others being one of them, and they would do like really like precise intelligence on the code and find out the dependency structure and all those, right?

[00:35:48] Rishabh Mehrotra: And that will give me like just content understanding of your code base, right? And then if a function is deemed important from those like graphical links, essentially, like how many of in nodes and out nodes are coming to your code, to your function, and a lot of people are calling it, then that means there's a lot of downstream dependencies.

[00:36:01] Rishabh Mehrotra: So there's this way of looking at it, right? But this is just pure code, but I do have observational data. Observational data means that I do know what were the severities, and where the sev zeros and sev ones were caused, and where the red, really critical errors have happened in the last few months, and where is the probability of those happening right now?

[00:36:20] Rishabh Mehrotra: Plus, where are you writing unit tests right now? Yeah. That gives me an additional layer of information. I already have parsed your code base and understood what's going on. I have that view, but I also now look at the real world view of that data coming in. Oh, you know what? There was a dependent, there was a self zero issue caused by this piece of code over here.

[00:36:37] Rishabh Mehrotra: And maybe there's, that, that meant, now the question is, Can I go back one day before? Can I predict that this is, if I had to predict that one error is going to pop up tomorrow, which part of the code base where this error will be popping up, right? Now that is a prediction I can make one day before, right?

[00:36:51] Rishabh Mehrotra: And I can start training these models now. Now this is, again, we talked about earlier, right? At the start of the podcast that look, Each of these features are different ML models, right? We just talked about two ML models. One, if I have 120 unit tests, what is the subset of 5 I showed you? That could be an LLM, that could be like a subset selection.

[00:37:08] Rishabh Mehrotra: Subset selection have known solutions for, I mean, and like, theoretical guarantees and performance, submodeler, subset selection and all, I've implemented them at 200 million monthly active users scale at Spotify. So we can tackle this problem, but this is a new problem. This is not like an LLM solution, right?

[00:37:23] Rishabh Mehrotra: Second one is here, right? Can I predict where the critical bug is going to be and then use that to identify critical components and then use that to chain and put like a unit test in there.

[00:37:32] Simon Maple: So it's essentially human context, right? We talk about context a lot when we're talking about actual source files and how we can make, do some code completion against so many parts of our project.

[00:37:43] Simon Maple: But thinking about that almost behavioral context.

[00:37:46] Rishabh Mehrotra: Exactly. And again, I'll be literally like the broken record on this. We've done this before. And look, when you upload a short video, like I look at Short video has zero views right now. I look at hey, is this high quality? who's the author?

[00:37:58] Rishabh Mehrotra: Again, right? I look at the content composition. Why? Because zero people have interacted with it so far. Give it an hour, 10 million people would have interacted with that short video. Now I know which people will like, which people will not, right? So if you look at the recommendation life cycle of any podcast, We're going to upload this podcast.

[00:38:13] Rishabh Mehrotra: The life cycle of this podcast will be, there is something about the content of this podcast and there's something about the observational behavioral data of this podcast, which developers, which users liked it and not liked it, skipped it, streamed it, and all that, right? So we have designed recommendation systems, As a combination of like content and behavior, right?

[00:38:31] Rishabh Mehrotra: Same here. When I say unit test generation, it's hey, I can look at your code base, right? And I can make some inferences. Now, on top of that, I have additional view on this data, which is like, what are the errors? And where are you writing unit tests? Where are you devoting time? That gives me additional observation data on top of my content understanding of your code base.

[00:38:47] Rishabh Mehrotra: Combine the two together and like better things will emerge essentially.

[00:38:51] Simon Maple: And we've talked about unit tests quite a bit. In terms of testing in general, there are obviously different layers of that testing. And I think when we talk about the intent changes heavily, right? So the higher you go, when we're talking about the integration test and things like that, the intent is really, and again, when we go back to context, the intent is about the use cases.

[00:39:10] Simon Maple: A lot about the intention of what a user will, how a user will use that application. When we also think about the,the areas of the code base, which are extremely important, those higher level integration tests, the flows that they go through, through the application will, show which areas of code are most important as well.

[00:39:31] Simon Maple: In terms of from our developer audience, when people, we talk to, when we want to provide advice or best practices in terms of how should a developer be thinking about using AI into their general testing strategy, What's a good start today, would you say, in terms of introducing these kind of technologies into people's processes and existing workflows successfully today?

[00:39:51] Rishabh Mehrotra: Yeah, I think the simplest answer is start using Cody.No, I love it. I think look, I, even before I joined Sourcegraph, Cody helped me. I interviewed, basically, Cody, Sourcegraph, we do an interview of, hey, here is a code base, it's open source, look at it, try to make some changes, and I love that, it's I, psychologically, I was bought in even before I had an offer, because you're making me, do cognitive work on your code repository as part of the interview, it's you just spend one hour, instead of just chatting, you look at the code and make some changes, right?

[00:40:18] Rishabh Mehrotra: Yeah. my point is,yeah, use Cody, but essentially, an interesting point over here is, if you're trying to adopt Cody or any of the other tools,for test generation, you're gonna, what are you gonna do? You're gonna try to use the off the shelf feature, right? It's hey, generate unit test, see where it works, where it doesn't work.

[00:40:33] Rishabh Mehrotra: Now, Cody provides something called custom commands. Custom command, edit code, unit test generation, these are all commands, right? Cool. What a command is. What is the LLM feature? Let's just take a step back. LLM feature is I want to do this task. I need some context. So what I'm going to do is I'm going to generate an English prompt, right?

[00:40:49] Rishabh Mehrotra: And I'm going to bring in a context strategy that what are the relevant pieces of information I should use, which is Hey, here's a unit is in the same folder, for example, or here's a dependency, which you should be aware of, bring in that context, write an English prompt and then send to the LLM, right?

[00:41:01] Rishabh Mehrotra: That's a very naive, simplified way of looking at what is an LLM feature. So Cody provides an option of doing custom commands. What that means is I can see hey, this doesn't work as great for me. Why? Because of these nuances. Let me create a custom command. Now you're a staff engineer at this company, you can create a custom command and that, oh, this is better now because in an enterprise setting, you can now share this custom command with all your employees, essentially, right?

[00:41:23] Rishabh Mehrotra: You say that, hey, you know what? This is a better way of using, the unit test generation because I've created this custom command and everybody can benefit essentially, right? Now, what makes you write a better custom command. And even if you forget about custom commands in Cody, what makes you get a better output out?

[00:41:37] Rishabh Mehrotra: This is where the zero to one evaluation comes in. Like, where are you currently failing? What sort of unit tests are we getting right? What sort of are we not getting right? What about this is interesting, right? What about your code base is interesting? Now, the question then becomes, can I provide that as context?

[00:41:53] Rishabh Mehrotra: Or can I provide, can I track that? But where are we? Where is it failing? Where is it not failing? And then there are a few interventions you could do, right? You can change the prompt in your custom command, or you can create a new context source. Now, this is a great segue for me to just mention one thing, which is open context.

[00:42:08] Rishabh Mehrotra: So I think Quinn, our CEO, he literally started this work as an IC. one of the other impressing things about Sourcegraph is you look at the, you look at the GitHub history, commit histories of the founders,are they running a company? Or are they like, coding up these things?

[00:42:18] Rishabh Mehrotra: And it just blew me away. Like when I first got it. came across that. But essentially Quinn introduced, and then a lot of the teams worked on it, something called Open Context. Yeah. Which is in an enterprise setting, you have so much context, which we may not be, nobody can get right? Yeah.

[00:42:32] Rishabh Mehrotra: Because you're going to plug in thousands of different heterogeneous context sources is going to help you get a better answer. so Open Context is a protocol designed to add, you can add a new context source for yourself. And there's a protocol and Cody will, Cody and even the other, because it's a protocol, a lot of the other agents or tools around can use.

[00:42:49] Rishabh Mehrotra: So essentially what that means is, if you're writing unit tests, and if you know that this is where it's not working, you're going to make a change in the prompt, add a custom command, you're going to add some other examples, and then you're like, hey, maybe I should add a context source, because, oh, I have this information, like we talked about, right?

[00:43:02] Rishabh Mehrotra: Where are the errors coming in from? Now that sev zero is probably not something which you have given access to Cody right now. But then because of the OpenContext, OpenCTX, you can add this like context source and then make your solutions better, right? What are you doing here? You're doing this, you're doing like ML, Applied Machine Learning 101 now for your feature, right?

[00:43:20] Rishabh Mehrotra: So again, this is exactly where you need like a 0 to 1, 5 examples where it doesn't work right now. And if you make this prompt change or context change, it's going to start working, right? So you have done this mini 0 to 1 for your own goal, right?

[00:43:33] Rishabh Mehrotra: And.

[00:43:34] Rishabh Mehrotra: I think there's a meta point over here, which is forget about coding assistants.

[00:43:39] Rishabh Mehrotra: I think we're all transitioning to that abstraction of working with an ML system, working with an AI system. I hate to use the phrase AI. I, from the machine learning world, I'd rather say machine learning, but again, the audiences probably buy that more, some of the audiences, but essentially. A lot of us are starting to work with these AI systems and I think the way we start interacting with them are gonna be like a bit more orchestrated, right?

[00:44:02] Simon Maple: Yeah.

[00:44:03] Rishabh Mehrotra: Try this. Let's figure out where it's working. Great. I'm going to use it. Let's figure out where it's not working. Okay, cool. Then I'm going to either give that feedback or do something and adjust my workflow a bit and then make it work better on those, right? Yeah. So I think we're all starting to be that applied scientist one, right?

[00:44:18] Rishabh Mehrotra: In some way or the other. And this is not just like you as an engineer. If you're a domain expert, if you're a risk analyst, you want to create these plots, or if you're a sales assistant using a sales copilot, you are working with an agentic setup of ML, and you want to see where it's working, where it's not,

[00:44:35] Rishabh Mehrotra: and what changes do I want to make? Again, you have done it. when you add like a plus or double quotes in Google, you get those words, right? What is that? You're adapting, right? You know where what's going to work. You're going to start adding those tricks. And now, because more and more of your daily workflow is going to be around these agents and systems and you start developing these feedback loops yourself.

[00:44:54] Rishabh Mehrotra: So I think like what we are trying to do as ML Engineers in the product are similar philosophically to what you are trying to do to use these products essentially. So I think like I would, a lot of my friends and other people ask me like, Hey, like what's happening? Like I'm a domain expert.

[00:45:07] Rishabh Mehrotra: I was in another panel a few days ago on AI versus employees, and the question there was like jobs and all those, right? again, not to bombard the conversation around those, but essentially, If we start acting as orchestrators of these systems, then we start developing intuitions around where it's working, where it's not, and then we start putting in these guardrails.

[00:45:25] Rishabh Mehrotra: And those guardrails are going to help in unit test Generation.

[00:45:28] Simon Maple: And I think that's important because as, our audience are all going to be somewhere on that journey from it being fully interactive, to it being fully automated. And people are going to maybe want to be somewhere on that journey, but they will progress from one end to the other of that.

[00:45:44] Simon Maple: As we get into that more and more automated space, I remember you saying earlier, when we talk about, the budget of a human in terms of they have an amount of time, if you have 120 tests, you want to provide them with five, how would you place the importance of a developer's time in the future when things get

[00:46:02] Simon Maple: to that more automated state how would you place that, that relevance on a developer focusing on code versus focusing on testing versus focusing on something else? Where is it most valuable to have that developer's eyes?

[00:46:15] Rishabh Mehrotra: Yeah, so I think, let's take a step back, right? you're a developer, I'm a developer, and there's a reason I have a job, right?

[00:46:21] Rishabh Mehrotra: I have a job to do. That job means there's a task to complete, right? The reason why I'm writing this unit test, I don't get paid money just to write a better unit test, right? I get paid money to, again, not me specifically in my job, but essentially as a developer, I get paid money if I can do that task right?

[00:46:35] Rishabh Mehrotra: And if I can at least spend some time in making sure that in future, my load of doing the task is easier and the system is like helping me downstream. With that high level view, right? The what's happening? What's happening is rather than focusing on unit test generation or like code completion as Oh, I care about this because I care about the silo.

[00:46:55] Rishabh Mehrotra: No, I care about this because I care about the task being completed and I can, if I can do this task 10x quicker, then what's my path from today spending five hours doing it to 20 minutes doing it, right? And this is where I mentioned you're gonna be, we're all gonna be orchestrators, right? Look at the music orchestra, right?

[00:47:11] Rishabh Mehrotra: You have the symphony and you're gonna, there's an orchestra and like you're hand waving your way into amazing music, right? Art gets created, right? Again, that's the goal, right? Cody wants to make sure that we allow users, developers to start creating art and not just toil. Now, I can say this in English, right?

[00:47:27] Rishabh Mehrotra: But then I think a good developer would just imbibe The spirit of it, which is that, look, my role is the sooner I can get to that orchestrator role in my mindset, the more I start using these tools exactly right, rather than being scared of oh, it's writing code and I won't, again, you mentioned it is going to be like, you might want to be somewhere on that spectrum, but the evolution, technological evolution will march on and we're going to be forced to and do some parts of that.

[00:47:51] Simon Maple: And it could be not just technology, but individuals wanting to be, individuals love writing code or like problem solving. It's like that will need to change, depending on where, what technology offers us. But I guess where, what's the, if we push further to that right, what's the highest risk to what we're delivering, not being the right solution, is it?

[00:48:15] Simon Maple: Is it testing now? Is it guardrails that become the most important thing? And almost, I would say, more important than code. Or is still code the thing that we need to care about the most?

[00:48:25] Rishabh Mehrotra: Yeah, I think like, if I view this extreme of like, again, if I put my evaluation hat right, I think I want to be one of the most prominent proponents, vocal proponent of evaluation in the industry, not just code AI, in the machine learning industry, we should do more evaluation.

[00:48:41] Rishabh Mehrotra: So there I would say that writing a good evaluation is more important than writing a good model. Yeah. Writing a good evaluation is more important than writing a better context source because you don't know what's a better context source if you don't have a way to evaluate it, right? So I think for me, evaluation precedes any feature development.

[00:48:56] Rishabh Mehrotra: Yeah. If you don't have a way to evaluate it, then you don't, you're just shooting darts in the dark room, right? Some are going to land by luck. Now there, in that world, right? I have to ensure that unit tests and evaluation is like ahead. in terms of importance and just like code. that's it.

[00:49:11] Rishabh Mehrotra: I think overall, right? what's more important is like task success, right? Which is, again, what is success? You're not just looking at unit tests as an evaluation, you're looking at evaluation of the overall goal, which is, hey, do I do this task right? And then I think if that's where as an orchestrator, if I start treating these AI agents could be Cody, Autocomplete, or like any specific standalone agent powered by Sourcegraph as well, probably.

[00:49:34] Rishabh Mehrotra: So in those words, evaluation of that task, because you are the domain expert. Assume AGI exists today.Assume the foundation models are going to get smarter and smarter, like billions of dollars, trillions of dollars eventually into it. we train these fan again, the smartest models and they can do everything, but you are best placed to understand your domain on what the goal is right now, right?

[00:49:55] Rishabh Mehrotra: so you are the only person who can develop that evaluation of like, how do I know that you are correct? How do I know whether you're 90 percent correct, 92 percent correct? And again, right? the marginal gain on 92 to 94 is going to be a lot more harder than going from 80 to 90, right?

[00:50:07] Rishabh Mehrotra: It always gets safer. Like, I mean, there's going to be like an exponential hardness increase over there. So essentially the point then becomes purely on evaluation, purely on unit test, right? what makes us, what are the nuances of this problem, of this domain, which, the model needs to get right?

[00:50:21] Rishabh Mehrotra: And are you, are we able to articulate those? And I'll be able to generate those unit tests or generate those guardrails and evaluations so that I can judge oh, the models are getting better on that topic, right? So the models are going to be far intelligent, great. But then what is success? You as a domain expert get to define that. Yeah. And this is a great thing. Not just about coding, but also like any, Domain expert using machine learning or these tools across domains, you know what you're using it for, right?

[00:50:45] Rishabh Mehrotra: The other AGI tools are just tools to help you do that job. Yeah, so you it's I think the onus is on you to write good evaluation or even I mean maybe tomorrow LLM as a judge and like people are developing foundation models just for evaluation, right? So they're gonna be other tools to help you do that as well code foundation models for like unit tests maybe that's the thing in six months from now, right?

[00:51:06] Rishabh Mehrotra: The point then becomes what should it focus on? That's the role you're playing. Orchestrating, but like orchestrating on the evaluation. Oh, did you get that corner piece right? Or you know what, this is a criticality of the system, right? the, again, right? the payment gateway link and the authentication link, some of these get screwed up, then massive bad things happen, right?

[00:51:21] Rishabh Mehrotra: So you know that. So I think that's where like the human in the loop and your inputs to the system starts getting crossed.

[00:51:26] Simon Maple: Amazing. Rishabh, we could talk for hours on this. I know this has been really interesting. I love the deep dive, like a little bit below into the ML space as well.

[00:51:35] Simon Maple: I'm sure a lot of our audience will find this very interesting. thank you so much. Really appreciate you coming on the podcast.

[00:51:39] Rishabh Mehrotra: No, thanks so much. This was great. This was a fun conversation. Yeah. it could go on for hours. Hopefully the insights are like you said. Thank you.

Podcast theme music by Transistor.fm. Learn how to start a podcast here.