August 1, 2024

Monthly Roundup: AI Tool Effectiveness, Context, Fin, and AI Autonomy.

Join hosts Simon Maple and Guy Podjarny as they delve into the intricacies of AI tool evaluation, the importance of context in AI code generation, and the evolving landscape of AI development tools.

Listen to the episode

Episode Description:

In this insightful episode of the AI Native Dev podcast, hosts Simon Maple and Guy Podjarny discuss the key themes and learnings from their recent podcast episodes. From evaluating the effectiveness of AI tools to understanding the critical role of context in AI code generation, Simon and Guy cover a wide range of topics that are crucial for developers working with AI technologies.

Resources:

Snyk - Developer-first security solutions.
Tabnine - AI code completion tool.
Intercom - Customer support and messaging platform.
Dosu - AI tool for open source project maintainers.

Chapters:

[00:00:17] Introduction
[00:01:08] Getting Back into the Rhythm
[00:03:08] Evaluating AI Tools: Promises vs. Reality
[00:09:14] The Consumer Perspective: Trust and Reliability
[00:12:19] The Importance of Context in AI Code Generation
[00:21:08] Handling Hallucinations in AI Responses
[00:23:01] Categories and Scope of AI Development Tools
[00:28:46] The Role of Data in Enhancing AI Tools
[00:39:11] Upcoming Changes and New Formats

Full Script

[00:00:17] Simon Maple: And welcome back to the AI Native Dev on today's episode, we're doing another special episode. Guy and myself, Simon Maple, are back together again, and we're doing a monthly roundup. So welcome to the roundup, Guy.

[00:00:28] Guy Podjarny: Welcome to you as well, Simon. Our first monthly, roundup we'll see, find the rhythm really for these things, how we,talk about all the sort of the fun episodes, the learnings from the month on it, get an opportunity maybe to voice some opinions.

[00:00:42] Guy Podjarny: We really try to use the time for the opinions of the smart people we're getting onto the podcast, but we do have some opinions of our own.

[00:00:50] Simon Maple: Yeah, absolutely. And talking about getting back into the rhythm. We had that long gap between a previous podcast and currently this one.

[00:00:57] Simon Maple: How did you find doing quite a few sessions? I think we've done, we've recorded close to 10 sessions already, many of which are already out. Some are coming very soon after this episode. So how did you find getting back in?

[00:01:08] Guy Podjarny: It's like riding a bicycle. But it's actually, it's super fun.

[00:01:11] Guy Podjarny: I always find, what's not to like really about this. It's an opportunity to find interesting people and ask them. questions that you find interesting.

[00:01:20] Simon Maple: Yeah.

[00:01:20] Guy Podjarny: And, and just get to air that and share that with the world on it. And it's super fun. and continuously humbling to see people willing to share their learnings and all that it's, it's a live world.

[00:01:31] Guy Podjarny: There's a certain amount of knowledge is power or some advantage to it. So I always find it gratifying. There's like an element of gratitude, in, in having people share it. but also. Every single conversation, I have a hard time finishing it because I have more things that I want to ask.

[00:01:49] Simon Maple: Do you know what? I find it really fun to just chat to smart people. And I, I use this podcast as a bit of an excuse, really, because there are a ton of smart people. It just sounds weird, me just like creeping into their DMs saying, Hey, can we just chat for an hour about cool stuff? Yeah, but this, having this podcast is really nice to be able to say, Hey, I've got a podcast and we're going to talk about this.

[00:02:08] Simon Maple: Would you love to come on? And just.

[00:02:10] Guy Podjarny: It's a good social experiment. It is.can you do something like that and like never air the podcast? No, we're building this podcast. How long can you drag that on? Having someone come along on your podcast, with air quotes,to listen.

[00:02:22] Simon Maple: No one will trust us again, guy.

[00:02:23] Guy Podjarny: Yeah, that's true. That's true. It might be a short lived strategy.

[00:02:26] Simon Maple: Absolutely. Yeah, we had fun. We had fun this, this month. And what we're going to do, the format of this monthly episode is really going to grab some themes, really, from a number of sessions and episodes that, that have been aired, this month, and we'll just talk about some of our learnings by theme and we'll jump into some quotes and talk about how, the various, the various people who appeared on the podcast, spoke to these themes.

[00:02:49] Simon Maple: And yeah, it should be a little bit of fun and a nice recap, really, of the month. So our first topic, our first theme then, Guy, really was about, how well these products work. Obviously there are a ton of different AI tools in, in, in various categories, which we'll talk about in just a sec as well.

[00:03:08] Simon Maple: Some promise far too much. Some promise well and actually deliver, with nice roadmaps in the future of how they can improve and so forth. But there are various pieces to this, I guess. One major piece is, the evaluation in terms of understanding whether a model,answers to a prompt work well or not,

[00:03:27] Simon Maple: and in, in various different scenarios, what was your take on that?

[00:03:31] Guy Podjarny: Yeah. I think what shown through very clearly is that it's very hard to know whether these products work. And it's interesting how consistent that was, and I think you don't really learn to appreciate LLM based product. It's very, part of it is that it's very hard to even just know up front whether it worked.

[00:03:50] Guy Podjarny: So even if you know what is it that you're trying to achieve, the answer is not always as robust, whether you're generating code or tests or whatever it is. Did you generate the right thing? Did you understand the right question? All of those are very hard to do. They're even harder to do in breadth.

[00:04:07] Guy Podjarny: And I think what became very clear is that it's also hard to anticipate how they will react in the next one. I think Des said it best when he described the difference between their sort of good case and what they call the torture test.

[00:04:23] Simon Maple: Everyone should have a set of torture tests in their test suite.

[00:04:26] Guy Podjarny: I think now it's like it's a necessity. We're going to have a torture test in Tessl, just for the name of it. Yeah, but it was interesting because it was describing both how, first of all, it's hard,to even define the original set of tests, but those are the good cases, and those are, as hard as they are, the easiest thing to do.

[00:04:42] Guy Podjarny: And also you can do them maybe relatively quickly. The real challenge comes when you run through these sort of torture tests of someone coming along and wanting, seven different versions of,of bits of information, or giving you 20 instructions in one command. Yeah. But then that even those are not enough, and that in practice you, to his words, don't know if your product works until you ship it.

[00:05:03] Guy Podjarny: Yeah. and that's super, super hard. Yeah. And this came up,in really a whole bunch of conversations, including some we haven't aired yet. And it's something that we're feeling at Tessl as well, on the product front on it, trying to say, okay, if we were to generate some piece of code, if we were to try and be on the journey to generate some piece of code, Did we generate it correctly or not?

[00:05:26] Simon Maple: Do you feel the sentiment is the same? Do you feel like we're all experiencing the same pain? Or are we experiencing that differently? So for example, from what Des was talking about there with the torture tests really trying to be, really trying to have a set of tests that is most lifelike to production.

[00:05:42] Simon Maple: Is this the same scenario that everyone is really faced with or are there variations?

[00:05:47] Guy Podjarny: I think when you're trying to mimic intelligence, it's very hard to know whether you had an intelligent answer or not.

[00:05:53] Simon Maple: I've been doing this all my career, Guy. Yeah.

[00:05:55] Guy Podjarny: And you've still not managed to mimic intelligence.

[00:05:58] Simon Maple: I think I'm closer than ever.

[00:06:00] Guy Podjarny: Yeah, you fool sometimes on the, yeah, I'm making those attempts myself as well. But when you're trying to mimic intelligence, it's hard to know whether you indeed, hit the target. Hit the correct chord, right? Did you, did you get it right? And the broader the set of possibilities, the harder it is to get it right.

[00:06:15] Simon Maple: And how close your torture test is to reality as well, trying to almost, rerun, existing prompts and existing what users are doing to actually create that set of torture tests yourself.

[00:06:23] Guy Podjarny: And you can never really, if you were to contrast this to a WYSIWYG or like some form of UI, there is a finite set of possibilities.

[00:06:31] Guy Podjarny: And so you're even, never going to include that. The other thing that Des actually phrased well was the notion of a change. It's like some, when a new, like in this world, it's not just your capabilities that change, it's also the underlying technologies, right? So you get a new version of an LLM comes along, whatever state of the art that is,

[00:06:48] Guy Podjarny: and,as a tool provider, because of the difficulty in, in assessing it, you're faced with sort of these two options, none of them are terribly exciting. And one is, you just skew optimistic and you throw it in and says, Hey, let's hope this works,and it definitely is short term gain.

[00:07:05] Guy Podjarny: But it might be, very risky.

[00:07:07] Simon Maple: It's almost a marketing gain, right? it's who is, it's that first market. And I think one of the things that Des said was, when a new model comes out, he sees lots of competitors and lots of others snap into that new model.

[00:07:18] Simon Maple: It's okay, 4o is out. Let's snap into it. Let's snap into 4o. It's a really easy thing to change, but it's a hard thing to actually be able to assess, without snapping it into production, whether it's actually going to improve, be the same, or even degradate the behavior or the functionality.

[00:07:35] Guy Podjarny: It's also tempting to just throw it out there. and that makes it hard on the consumer side, to be able to take those tools, because to begin with, as a consumer, you also don't know if they work, but then if people swap these models under you, just because, in some cases, they are very good, it makes it very hard.

[00:07:51] Guy Podjarny: We have, Amy on our team at Tessl, who leads AI engineering, she talks a lot about the jagged edge, that is of AI. They're amazing at some times, and then they're like horrible at other, at other times, and that's very felt. But I guess the other option is not that exciting, which is go cautious and start rolling it out, because, it, you can't fully trust some test suite that you run.

[00:08:10] Guy Podjarny: and team have built a torture test. . And even that one is not good enough, but that's also expensive dollars wise time-wise. Yeah. So I think this notion of how well, or how hard it is to assess how well your product works is a big deal. And I think. Sourcegraph, Rishabh really highlighted that as well, in terms of the testing and Absolutely. So the creation of code and the search and finding the right context

[00:08:31] Simon Maple: Absolutely. And context. Yeah, context was king really there? But I think, with the breadth of the various tools that actually both Peter and Rishabh talked about, both at TabNine and and Sourcegraph,

[00:08:41] Simon Maple: the evaluation we covered a lot. And of course, Rishabh, is the head of AI and lives in that ML space and so much of his work is really about that evaluation to,to really provide that model like that really is providing that output, which is relevant to, to the input and context is a big piece of this,

[00:08:58] Simon Maple: but we, he was talking about the kind of zero to one on that and how useful that was. but yeah, there are a ton of that and, yeah, very valuable.

[00:09:06] Guy Podjarny: I think it's interesting to take this now back into the consumer side, and I think that very much shown through as well, which is as a builder of those tools.

[00:09:14] Guy Podjarny: Okay, only a subset of people listening to this might be actively building LLM based tools, although more and more of us might be in that position later on. But as a consumer of those tools, you get the same experience as well, and even for the people that are at the forefront and are building these LLM tools, it is hard for them to actually take those on and embrace them for these same reasons.

[00:09:36] Guy Podjarny: It's because they take those tools and they can't trust, really, that they would work well. And, some of that manifests in, once again, we keep referring to Des, he was just brilliant in his sort of, quotes. and his talk about being able to, to embrace tools,

[00:09:50] Guy Podjarny: but also, Amir touched about, touch of that,a fair bit when he talked about the different tools and different domains and, where is it that you can, maybe, be comfortable with, with those tools? And I guess you've been feeling it in your own tools that you've been feeling for this podcast and, for Tessl.

[00:10:05] Simon Maple: For this podcast as well. I mean, so we use a ton of kind of like AI tooling, both in the editing and AI powered. AI, sorry, AI powered tooling, absolutely, yeah. to, to edit and actually build content around this. We did a ton of trying to actually identify where the most interesting shorts, and quotes could be.

[00:10:23] Simon Maple: To be honest, I've, in some areas, they've been great time savers. In others, they give me an answer, but it's never the right answer. It's like shorts is the great or, put in quotes , the 5 most interesting, 10 most interesting pieces from a podcast. It very rarely picks the 10 that I would pick.

[00:10:38] Simon Maple: And I think. I don't know whether it's just the intent. It doesn't understand the intent. It could be from a prompt, perspective. But it's not a time saving for me anymore. So I still, there's still a balance here between me putting my kind of like knowledge and intelligence into this and then using the AI assisted tooling to actually assist me really and time save where I can be helped.

[00:10:57] Simon Maple: But yeah, I feel like in this space, it's not quite there yet.

[00:11:01] Guy Podjarny: Yeah, and it's frustrating, right? Because We want to lean in, we want to be early adopters, we're believers in the destination. And we've invested up front, right? You spend a lot of time trying to both pick tools, even build some tools, write some things on it,

[00:11:14] Guy Podjarny: and they're so close. They're so tantalizing. There's little moments of amazement, as you run them. But, on average,they really, they don't quite make the cut.

[00:11:24] Simon Maple: No.

[00:11:25] Guy Podjarny: I think you have to, what can you do really as a consumer? I think you have to keep trying.

[00:11:29] Guy Podjarny: You have to moderate your use, so don't rely on these tools working, but I think you need to come back to the well again and again to, to always, to find the cases in which it works. But fundamentally, I think today the cases in which it works the best is when it still works as an assistant.

[00:11:47] Guy Podjarny: So it offers you something. Sometimes it would save you time, but don't build your plan assuming that it would work.

[00:11:55] Simon Maple: Yeah,I'm so with you on that kind of like revisiting the well because it's amazing how fast the, where we were six months ago, we wouldn't even make it close to where we are today.

[00:12:03] Simon Maple: We wouldn't even think that they could work. Absolutely. So I think revisiting, making sure that our assumptions of where we were six months ago are actually still true and where we can actually improve where we are today.

[00:12:14] Guy Podjarny: So what was your,what's your kind of next take away from the bulk of the episodes that we had this month.

[00:12:19] Simon Maple: So I think code generation was one of the first categories that we talked about. And one of the, one of the really interesting themes that kind of came up again and again, was context. And I think context is really king in terms of being able to take. understand the intent, and then actually provide a valuable answer based on the right context.

[00:12:41] Simon Maple: I think there's one, topic where I was talking with, with Peter from Tabnine, where we covered, how much is too much context? What is the right amount of context? And it's interesting that you can give too much context so that actually you're providing so much information. It's hard for the LLM to actually provide the right answer because it's almost flooded with too much information.

[00:12:59] Simon Maple: But providing the right amount of context will allow it to actually be far more accurate in, in, providing that answer. I think Des was also, he talked about bad context as well as good context, right?

[00:13:09] Guy Podjarny: Yeah, I love kind of that other quote, yet another, great phrasing from Des,

[00:13:14] Guy Podjarny: but he talked about how a common experience for Intercom customers that try out Fin, their sort of autonomous AI agent, is that they find that Fin, mirrors or reflects their knowledge base very well, but lo and behold the knowledge base itself is out of date and it turns out that there's a bunch of voodoo knowledge that happens, shadow knowledge that happens within the support team that helps support reps know what information there is up to date and what isn't and that Fin might lack that subtlety and so they need to go and fix that and so first of all, I thought I think that's interesting It's very easy to understand and I think everybody can relate to the fact the knowledge base is out of date your docs are out of days all of those things as you start building all those that's significant.

[00:13:55] Guy Podjarny: I think a bit more Subtle is to think about your code and to think about how I guess he says is, crap in crap out you know. If your code is full of bad practices that you actually don't want to maintain, then lo and behold, if you train on your code, you might be replicating practices

[00:14:13] Guy Podjarny: that are not really what you desire. So it's interesting, this sort of balance of the default sort of perspective, maybe, in the world of LLMs and AI is give me more context. I will be able to understand your needs more. I will be able to personalize and customize it better. But at the same time, there's a certain element of flooding and too much and not being able to separate the important context from the less important one.

[00:14:40] Guy Podjarny: And there's an element of, be careful what you wish for, right? Yeah. If you give me too much history, I might make history repeat itself.

[00:14:47] Simon Maple: Yeah, absolutely. And actually there's a couple of ways we can see this. When we think about code completion, obviously training needs to happen on existing code and it's pulling,

[00:14:56] Simon Maple: you know, many, many tools pull from open source libraries. open source libraries, there's going to be good code, there's going to be poor code, there's going to be security problems, and so forth. It's going to be trying to replicate that code that it's seeing. But also, when we think about existing issues in our own code, where we're providing context on top of that training, it's going to effectively see similar, anti patterns, problems, coding issues,security potentially issues that it could replicate because, hey, I saw you did that over there and you want to do something similar, let me use a similar style of code.

[00:15:27] Simon Maple: And yeah, crap in, crap out. It's exactly the same, it's exactly the same theme.

[00:15:31] Guy Podjarny: And you're, it's interesting to think that this happens at multiple layers. There is the notion of these systems have been trained on the world's code, so problems or patterns that repeat in the world's code might repeat in the very models themselves in the neural network.

[00:15:47] Guy Podjarny: But also, as you go down the rabbit hole and you customize and you personalize to your environment, and if you train on a customer's code base, you might replicate patterns over there. And, I think, actually, one of the latest studies that you did, Simon at, Snyk, showed how if there's a tab open, with, with a code pattern that happens to be vulnerable, then Copilot is more likely to replicate it.

[00:16:10] Guy Podjarny: So, at any, as local or as global a resolution, These LLMs, they replicate, they build on data, and so we need to be, mindful of that.

[00:16:18] Simon Maple: Yeah, absolutely, and there was one interesting other piece going, thinking more about the kind of context that we give an LLM. Speaking with Devin, and he built the Dosu tool, which, it's amazing for open source maintainers and users in terms of, maybe they create an issue, a user was to create an issue against an open source project, and it may take time for that maintainer to even start looking at that issue and performing a response, maybe it's user error, maybe it's something else, and there's context, which allows Dosu to be able to provide an answer within minutes to that user.

[00:16:51] Simon Maple: Now provided the right context, it can say, Oh, here's the answers, but it's in the documentation. Or in fact, one of the, one of the really interesting pieces where it wasn't in documentation, but there were actually test cases, which were saying, this is what I expect. And so Dosu would look at that and say, okay, this is the expected behavior.

[00:17:08] Simon Maple: I can tell the user this is what you should be doing. However, there's also bad context whereby it was inferring behavior from the code, and actually while that behavior is a possible workaround, it isn't actually the way that maintainer wanted that person to use.

[00:17:24] Simon Maple: So there was a case whereby it's actually using the context, but not in the right way that the maintainer or the user wants that to be.

[00:17:31] Guy Podjarny: Yeah, a great example of a case in which, yes, this works today and this is the pattern of what happens today, but I'm trying to change that. I'm trying to get people to do it, the right way.

[00:17:39] Guy Podjarny: So it's, the sort of the core principle is whenever you make, more information available, more options available to anything, then selection and sorting becomes a problem. Discovery of the correct, of the better,object on it. It could be like, Whatever, it's Snyk. We talked about vulnerabilities.

[00:17:55] Guy Podjarny: Hey, if I tell you about too many vulnerabilities, they might be all right, but there's too much for me to handle. I want to know which ones are right. It might be about discovering Spotify of sort of songs, and you want to pick the songs that are most right for me. And to an extent, if you give me more knowledge about your code, if you give me more knowledge about your system, suddenly the problem is no longer availability of knowledge, but rather prioritization, saying which knowledge is more important.

[00:18:18] Simon Maple: And the input of how actually a user will actually, not just provide the context, provide their request. That's going to be massively different depending on the tool usage or where that is actually being run, whether it's in an IDE, whether it's in a PR or so, or somewhere else. There was a lot of discussion that happened, I think, particularly with the Des chat.

[00:18:34] Simon Maple: Again, we keep going back to that Des session, but, around, the different type of input with maybe chat UI, as a way to, to interact versus maybe an autocomplete, much more quicker, line by line, versus something that you actually build up, with a chat UI. I found that really interesting.

[00:18:50] Guy Podjarny: The, basically, it does make the argument that to an extent, completion is a form of chat. You are, just instead of, telling me what you want in words. You are telling me by providing me more context as I type. and I found that super interesting. I'm not entirely sure I agree. I'm still digesting it.

[00:19:10] Guy Podjarny: I think to an extent, maybe the meta understanding is that any interaction with LLM is about the choice of what information are you providing me? How much did you invest in giving me context? We keep coming back to that word, but context and alongside context intent. So how much did you tell me to understand your surrounding, but also how much did you explain to me what is it that you want in words or in terms or in other data structures that I can understand? And the better you do that. the better the LLM is able to give you the result that you want. And,in a chat context, you can be more explicit in what you're asking, but you're providing maybe less context, and the model is such that you're expecting the LLM to give you an answer and give it right away versus a completion environment in which your intent is implicit.

[00:20:01] Guy Podjarny: You're typing along, you're providing information and, if before your language and how you phrase it, maybe your prompt engineering skills in the chat, came to, to shine, and if you were auto completing, it's almost like how well Are you completing that code? Are you, writing a comment and then expecting a code completing, tool to, to provide you with that text, which is really quite close to chat at that point?

[00:20:23] Guy Podjarny: Are you just typing along code and you're expecting it to infer? So they're basically all just variations on it. And I think what's interesting there is to think, if you're building a solution that leverages it, because this was all in the context of, is it easier to think about introducing LLMs into a tool that is already chat based or not, and whether in the context of code, pull requests that are maybe a little bit more chatty, asynchronous chat, are more amenable to success versus code completions, and it was challenging that legitimately.

[00:20:52] Guy Podjarny: But in, in the context of that, if you're building some LLM capability, really the question you should ask yourself is, What are the correct ways for a user to provide me with their intent? What is it that they want me to do and with the context necessary for me to be successful?

[00:21:08] Simon Maple: And I think in chatting with Rishabh, about, the various pieces that, that Cody did, the differences in how you would use or how you would request information from an LLM here. for example, in code completion, you would expect an answer very quickly. However, if it was something beyond that and you're actually getting more back, maybe it's a test generation or something like that.

[00:21:31] Simon Maple: It allows the LLM to actually have far more time to generate that answer. And he, there's a really interesting conversation about how fast, what is that latency that you need before,in code completion, if you surpass that latency, it would frustrate a developer. Whereas in other cases, you can actually spend much, much more time and get a much more accurate answer for a bigger task, and that's acceptable.

[00:21:57] Simon Maple: And I think when we think about the chat UI versus autocomplete and things like that, it provides you know, the differences in how long, and what accuracy, it can provide you with. Yeah. So many variables. Yeah. Additionally on that, when we talked about intent, which is really important, one of the, one of the topics that came up, was around the intent of a question to a ChatGPT versus the intent of some, a much more specific tool. And it's much, much harder when you actually ask something of ChatGPT because the breadth of what a ChatGPT can expect to get is obviously going to be much wider than if you're asking something in an IDE. It's going to be a coding based question, etc.

[00:22:39] Simon Maple: And this really brings us to the kind of scope. of what you would expect, from these AI assistants. And I think one of the things that we talked about is, particularly with Amir, there are really nice categories that we talked about of, it's great for a tool to exist in this AI code assistant, as well as, a documentation tool would be great, as well as a testing tool.

[00:23:01] Simon Maple: But actually some of the vendors that we've been talking to, they're actually quite broad in what they offer as well, right?

[00:23:06] Guy Podjarny: It's actually a little bit funny because you talk to Amir, who is looking at it from a consumer perspective, from an investor perspective, and he looks at the markets and he has, really valuable delineations of saying, this is a category or a class of capabilities that people can do.

[00:23:20] Guy Podjarny: And I think very useful and they make sense to me, right? You're going to have your sort of coding system, your code completions. You're going to have your AI documentation. You're going to have your AI tests. You're going to have,your operations, stuff.and then you talk to, I think pretty much every one of the vendors that we spoke to says, Oh, we do all of those.

[00:23:35] Guy Podjarny: And, they don't straight up say we do all of them, but, I think when they come down to the technology, there's so much overlap and so much, so many open questions around what is the right scope of these products, and so whether it is, around, Cody that indeed started from code search and they understand that code and, but now we're talking about coding assistance and they do,they do testing, a lot of the conversation was around the testing,

[00:24:00] Guy Podjarny: tabnine, the same, they find, they will index the sort of the code, and they have some tests, and they, and you look around, and I think it's, I think it's really interesting. To me, it shows two things. One, the technologies are very intertwined. And once you are able to understand the code, understand what needs to be done, then all sorts of paths open and all of them try to grab it.

[00:24:25] Guy Podjarny: And the second is that it's, there's no set expectations yet from a customer perspective on what do they want to get, where? So there's no, nobody has had the time to demonstrate that they are so dramatically better at this slice of functionality and that is so much more valuable that the fact that they are so much better means, something that is worth a best of breed solutionand a lot of the value is still in that 80/20.

[00:24:53] Guy Podjarny: It says hey I can generate documentations with LLM,maybe I can already get to there. And so I think there's a lot ofopportunity, like a lot of these vendors are just trying to seize a broader, broader scope. And I think time will tell, I don't know exactly where the lines would be drawn, and I think it's still useful for us, as we talk about development, to talk about discrete tools, regardless of whether single vendors will provide them, talk about discrete pieces of functionality.

[00:25:22] Guy Podjarny: But it is, it's quite hard, to, to find vendors now that contain themselves.

[00:25:28] Simon Maple: Consolidation, this is interesting because, even with our histories as well, when we think about the tooling companies we've been involved in, consolidation is a really key, important thing. for example, if I was a developer that wanted to do all of these things anyway, and I will want to do them, whether it's testing, whether it's documentation, whether it's, the code completion, the AI assistance, that help there, would I want three tools, or would I just want one consolidated tool, that does it all for me, and if you have different vendors that are slightly better at different capabilities, is it actually, nicer for me to have that one experience that, that does it all, and I guess the autonomy here. also plays a part in this, as to whether I'm using it directly or whether some process is happening in the background for me.

[00:26:14] Guy Podjarny: I think autonomy plays a role. In general, the question is really, how much value do you get from a tool that is better? And I think we just don't know. As consumers, we don't know the answer to that yet, and so, the main question that we ask is, do we want AI to generate, say, documentation for us, right? Or do we want AI to generate tests for us?

[00:26:38] Guy Podjarny: That's the question that we ask. And then if we have a tool that is already available and also if we've, especially if we've ended up training it in our code base or do something like that, then,it's available to us and it is convenient for us to use that. And because we don't know what good looks like yet, nobody does, then we default to think, okay, this is, this represents what I'm seeing here as I try to generate tests is now coloring my

[00:27:02] Guy Podjarny: sense of how much, can be done. And also because a lot of these capabilities, they have some anchor of understanding the code that is shared amidst these different capabilities and then beyond that core anchor, I think many of them end up being a relatively, shallow layer on top of the core LLMs and then that additional understanding and so it's a relatively low lift so small lift for these companies to provide them, and so I guess if I was to wager I would think that we'll see a lot of that consolidation and all in one come out and I think in some cases we'll find that things are very tightly intertwined.

[00:27:39] Guy Podjarny: I think we heard this a little bit on Cody and, spoiler alert, some of this comes up in the conversation with, with Itamar from Codium, that, that will air soon. In those conversations, you hear a lot about, for instance, the relationship between testing and code generation. Okay, if you understand the code and you know what is correct, can you generate the right code to it?

[00:27:58] Guy Podjarny: I can buy into that. at the same time, other domains, maybe documentation, will come out as something that has enough sophistication to it, enough unique needs to it, that we can see that a certain subset of tools are better at it. You know what I mean? But we just don't know.

[00:28:14] Guy Podjarny: So I think there will be consolidated and then some will break out. And then we'll have the regular consolidation device, which is okay. Now we have a whole set of tools after a bunch of years. And yeah, some are known as best of breed, but we'll have the regular decisions on what's we do. We would know what good looks like.

[00:28:29] Guy Podjarny: Yeah. And we'll make a conscious decision whether we care or do we prefer the fewer vendors.

[00:28:34] Simon Maple: Yeah. And do you know what I think is really going to be important here? I think data is going to be king here. In terms of the data of understanding behavior of how things are using the application.

[00:28:46] Simon Maple: Let me get into a couple of examples. Let's say you had. an AI tool, which was a monitoring tool, for example, can take production data about where code paths, which code paths are the hot code paths. perhaps you've got behavioral data that show which areas of the code base are being changed most by developers.

[00:29:04] Simon Maple: Perhaps you've got, testing data about which tests, which areas of the code are actually finding the greatest number of issues, or security data about vulnerabilities.All of this data can like really help us in terms of the other tools recognize where they should, where they should focus on more. An example, if I have an AI, tool, or AI assisted tool that, that looks at monitoring data and understands the code flows and where my users are going through, it understands the golden paths, the hot paths. That's where my AI testing tool needs to focus. If my AI testing tool is identifying issues and actually code fixes are occurring, that's It because of a certain type of flow, my AI, code completion assistant, my AI assistant there, needs to make sure it's not suggesting code, in that similar vein.

[00:29:56] Simon Maple: So I feel like there's a lot of data here which actually makes the value of that consolidation really useful if that data is being shared.

[00:30:04] Guy Podjarny: Yeah, I agree. And I think, I guess if I was to use a startup parlance over here, I'll talk about power, from The Seven Powers by Hamilton Helmer, where he talks about all sorts of strategic, differentiators or advantages.

[00:30:16] Guy Podjarny: And that data is one of those in which you can, it's really about how much better situated you are to deliver value to your customer. And so data is one aspect. Do I already have data that is valuable and therefore I can do something better? Sometimes it's just distribution. It's just, okay, I'm already installed on everybody's desktops.

[00:30:38] Guy Podjarny: And maybe that gives me access to data, but it also just, gives me an opportunity to offer you that tool. sometimes it is just a commercial relationship. Sometimes it is, an investment that has been made in indexing the code. I guess it's a form of data that is technically available to some other vendor as well,

[00:30:56] Guy Podjarny: but it requires the customer to have invested time in connecting it to all the right places, and customizing and tuning it, or defining policies. So all of these things are advantages. And I think, at that stage we're coming back to the classic, startup game, which is do, do the disruptors acquire distribution before the incumbents acquire innovation?

[00:31:19] Guy Podjarny: Do we,what happens first? And I think that's interesting, but I think the, In, in, if we go back to talk about the scope of these AI dev tools, to me, the primary conclusion from the conversations, and it was almost the conflict between the very first or second episode that we had around Amir, so in such a great fashion, dividing, these different tools and then how every single vendor you talk to is just messes this up and almost like unwilling to accept, that, that bundling I think goes to show that,we need to assess them, but we should expect that the tools, when we go to them, They don't quite fall in line until some lines clarify.

[00:31:57] Simon Maple: Yeah, which I think is fine, right? Because we, we will have needs in these different categories and there will be tools and vendors that will address multiple of those needs. Exactly. Expecting these tools to only fit in one category is probably a dream, or maybe a hallucination guy, which leads me incredibly nicely to see how smooth that was, guys.

[00:32:15] Simon Maple: Incredible. I am. I am. Or at least maybe that's the artificial intelligence piece of me. That's a, yeah.

[00:32:23] Guy Podjarny: I'm going to ignore the teleprompter. That was a pure human.

[00:32:29] Simon Maple: Close enough at times. Yeah. So hallucinations,and how much we are happy to accept that an AI tool will hallucinate, and provide us with something that is incorrect or a hallucination.

[00:32:45] Simon Maple: There was a really nice quote, I think, but a very interesting discussion, from Des, that you spoke with about how willing you are to accept hallucinations. And it will provide you with other value, but, the pain, of course, is you get some crap for hallucination, which is either meaningless or nonsense or whatever.

[00:33:05] Guy Podjarny: Right, or just straight up deceiving, and I think, so it's interesting, I find, In general, Intercom is much further along than most providers in building LLM based tools. And I find a lot of their learnings to be just a step ahead in terms of having faced reality, and made it work.

[00:33:23] Guy Podjarny: And not surprisingly, Des was describing how they contained a lot of the, hallucinations or a lot of their work was to contain hallucination so that if you're coming to a place and you're interacting with a support bot, then you know when you get an answer, you'd be able to trust it. But it was very clear that if you allowed Fin to hallucinate more, you can resolve more cases.

[00:33:45] Guy Podjarny: Yeah. If you can get more creative. And the problem is that how do you get more creative, without losing trustworthiness. and I think it's really interesting. I got, I got my product mind spinning a little bit to say, are there ways, are there patterns as an industry that we can do, that we can find that help us, roll that or engage the customer in that conversation, right?

[00:34:08] Guy Podjarny: Would customers accept that they ask,Fin, say, a question, and Fin gives them that, says they don't, it doesn't know, or provides them an answer, and that does address their doing it, and it asks them, Do you want me to try and get more creative? Keep in mind, my answers might become less accurate, or even, sometimes hallucinated.

[00:34:28] Guy Podjarny: And it's, I don't know, it's interesting. It doesn't seem entirely obvious to me that the answer is no, that customers will be unwilling to accept it. At the same time, sometimes, maybe it's too much complexity, and also maybe it's a time to switch to a human, and maybe that's okay.

[00:34:44] Simon Maple: And that was really interesting, when you said an AI answer, or, an answer that comes from an LLM, saying, I don't know.

[00:34:54] Simon Maple: And in my chat with Devin, and by Devin, I mean the founder of Dosu,not the, not

[00:34:59] Guy Podjarny: Must be so awfully confusing for Devin. If you're confused and, if you're listening to it and you haven't listened to the, the episode, Devin is the CEO and founder of Dosu, which is an AI tool that has nothing to do with it.

[00:35:12] Simon Maple: I didn't just do a podcast with the AI software engineer,

[00:35:15] Guy Podjarny: And it's, yeah, alongside it being one of the most, known, in AI engineering, because it was introduced as a tool named Devin, that is an AI engineer. yeah, it must be awfully confusing.

[00:35:23] Simon Maple: So Devin the person, the human intelligence person.

[00:35:25] Simon Maple: The human, yeah, person Devin. Yeah, Dosu Devin. now, One of the really interesting things that he mentioned is that LLMs don't like answering, I don't know. You have to really be clear in the training with them that there needs to be some limit or some bar whereby if they don't have a level of confidence that is above that bar, it should say, I don't know, you need to talk to a human or something like that.

[00:35:55] Simon Maple: Now,the interesting piece there is, really, if it doesn't know, it tries to make up an answer. It doesn't like saying no, if it can't, and actually there was a really interesting thing that I was doing at Tessl as well, where I was,in the blog creation stage, where we're looking at building up content.

[00:36:13] Simon Maple: I thought, wouldn't it be nice if I wrote a little, a little agent in Crew AI, which is one of the tools that we use for our content building. And that tool, what that agent did is it looked at the Tessl blog and tried to identify additional resources, additional content pieces, and it said, Oh, here's some related content that you might want to look at.

[00:36:33] Simon Maple: Now, of course, no is a perfectly acceptable answer there. There is no related content, but it didn't want to give me the answer no. So what it did was, even before we launched the blog, It would come back with five or six different blogs that didn't exist. Great titles, by the way, as well. With URLs saying, here is an amazing piece of related content.

[00:36:52] Simon Maple: It couldn't find an answer. but it wanted to please me. It wanted to delight me with an answer. So it generated all these blogs that didn't exist and said, here is the, here is your answer, and I'm sure it was just as confident, with that answer as if it was real or whether it was hallucinating and making these up.

[00:37:06] Guy Podjarny: I think it's, it's, It's fascinating and learning how to deal and manage with that. It's interesting, like one bit there, I think it was also like making up some copilot features, like it was very much like inventing new reality, not just new content, and it's interesting sometimes to think, Hey, those things that it's hallucinating are things that should exist.

[00:37:24] Guy Podjarny: Yeah. So that was already, interesting.

[00:37:26] Simon Maple: One of my, one of my friends, Liran Tal, he was asking an LLM, usage of a specific library, and it suggested, oh yeah, you should use the API here. Of course, that library didn't have an API at all, but it provided all the information about this is what the API looks like.

[00:37:40] Simon Maple: This is how you invoke it, and then there was two things. Liran was first of all, like a little bit frustrated that, hey, this doesn't exist. You're just hallucinating this up. But then the second piece was, maybe it should have that API. Because you know what? This would have been really a valuable part.

[00:37:53] Guy Podjarny: I would have been very happy to have it. So I think that's really interesting. and yeah, and just cap that. Indeed, in their maturity journey, I didn't see it in action, but as I was commenting that Fin has an L, it tells you how confident it is, in its answer, and it almost balances this, I don't know, if I was to try and guess, here's the information I have, and, here's the, so it shouldn't make up information, but at least low confidence information, so there are UX patterns that we can figure out.

[00:38:21] Simon Maple: Yeah, and I think that's, us as humans, need to be more comfortable with that as well in understanding, okay, this, I am talking to an LLM, agent in the background. It has provided me with something. I recognize that, the temperature's up, it will hallucinate more, but the confidence is really important there.

[00:38:35] Simon Maple: So I recognize, I will get a better, interaction with it, but I need to be mindful of the confidence it has.

[00:38:41] Guy Podjarny: And it's a challenge for the UX folks because, if we don't figure it out, then you end up either needing to be entirely accurate, which in turn reduces the usefulness of the LLMs, or we're sometimes deceiving our users, which is clearly not an awesome

[00:38:56] Guy Podjarny: thing to do.

[00:38:57] Simon Maple: Guy, this has been great. It's been a really strong month of sessions, a really great kickoff for the podcast. A couple of things that are going to be changing, actually, we're looking at and, we are adjusting based on what we've learned and feedback that we've had around the podcast.

[00:39:11] Simon Maple: My sessions are going to be changing just slightly, whereby I'm going to have, I'm going to be releasing two sessions per guest, back to back on the same day. And actually the Devin session, Devin's session rather, showed this actually, whereby we have a chat, first of all, about some of the topics that we're going to, that we're covering.

[00:39:28] Simon Maple: And then we're going to do a hands on session. So we're actually going to go deeper. It's going to be a,hands on screen share. So better with video versus just audio. But this is going to be a very much more hands on dev focused session. So keep a lookout for that, going forward.

[00:39:42] Simon Maple: You've already recorded another couple of sessions. What can our listeners look forward to coming up?

[00:39:46] Guy Podjarny: First of all, I would say that, if you're listening to this on the podcast, these sessions that Simon is describing will be,especially best viewed,in a visual. But we do intend for the conversation that comes before that to be very kind of audio podcast friendly as well.

[00:39:59] Guy Podjarny: Absolutely. And we would welcome feedback. We're still at contact@tessl.Io. If you have thoughts about how you want us to change that, and then we've had. Since, the joy of recorded podcast, we did record a bunch of great episodes that are coming up. I will mention, Itamar, I've already alluded to that.

[00:40:15] Guy Podjarny: Itamar, the co founder and CEO of, Codium AI. we'll talk about AI testing and more, and, Simon will add some. some real world,some live session of it, and had a great conversation with Jason Warner, previously the CTO of GitHub, where they built, he was there as they built, Actions, and Copilot, and more, and has done a lot of great things, and right now is running one of the top foundation models in, that are code focused, called poolside. And we talk a lot about what is code generation, why does it work, where is it headed, generic versus code specific, lots and lots of great insights from Jason. those and many, not many, but a few more episodes coming, this month. it'll be another great month.

[00:40:54] Guy Podjarny: Tune in.

[00:40:55] Simon Maple: Yeah, absolutely. Guy. Pleasure doing our first monthly roundup.

[00:40:58] Guy Podjarny: Indeed. This was fun. And, I hope you'll join us for all the episodes this coming month.

[00:41:02] Guy Podjarny: Amazing.

Podcast theme music by Transistor.fm. Learn how to start a podcast here.

Monthly Roundup: AI Tool Effectiveness, Context, Fin, and AI Autonomy.

Episode Description:

Resources:

Chapters:

Full Script

Be the first to try Tessl

You’re signed up!