August 27, 2024

TDD and Generative AI in Action! A hands on demo with Bouke Nijhuis

Join us in this insightful episode where we explore the fascinating intersection of AI and test-driven development (TDD) with expert developer Bouke Nijhuis. Discover how AI can automate code generation, saving you time and ensuring your code meets predefined specifications.

Listen to the episode

Episode Description

In this episode of the AI Native Dev podcast, Simon Maple sits down with Bouke Nijhuis, a seasoned developer known for his innovative approaches in integrating AI into development workflows. Bouke provides a hands-on demonstration of how AI can generate code from TDD-style tests, walking us through the entire process from simple examples to more complex real-world scenarios.

Bouke showcases a tool he developed that automates many of the manual steps involved in this process, significantly enhancing efficiency and productivity. Throughout the episode, Bouke shares his insights on overcoming challenges, using local and cloud-based large language models (LLMs), and integrating with existing libraries and frameworks like Spring Boot.

Whether you're a developer looking to streamline your workflow or simply curious about the future of AI in software development, this episode is packed with valuable insights and practical demonstrations. Don't miss out on learning from one of the leading minds in AI-driven code generation.

Resources

Chapters

[00:00:22] Introduction and Welcome - Simon Maple introduces the episode and guest Bouke Nijhuis.
[00:00:57] Understanding the Workflow - Bouke explains the basic workflow of generating code from TDD-style tests.
[00:05:00] Initial Demonstration - Bouke demonstrates generating code for an odd-even test using ChatGPT.
[00:09:52] Tackling More Complex Scenarios - Bouke showcases generating prime numbers and integrating libraries.
[00:12:19] Automating the Process - Introducing the automated tool for generating code and running tests.
[00:17:28] Improving the Tool with Existing Tests - Bouke discusses using the tool to improve and refactor existing code.
[00:24:24] Adding New Features - Demonstration of adding new features by writing additional tests.
[00:26:06] Overcoming Challenges with Cloud Models - Bouke talks about the benefits and challenges of using cloud-based LLMs for more complex tasks.
[00:32:28] Integrating with Frameworks - Using a Maven plugin to integrate generated code with Spring Boot.
[00:38:01] Conclusion and Next Steps - Final thoughts, key takeaways, and resources for further exploration.

Full Script

[00:00:22] Simon Maple: Hello, everyone, and welcome back to the AI Native Dev. And hot on the heels of our last episode, we're back, chatting, with Bouke Nijhuis. Welcome, Bouke.

[00:00:32] Bouke Nijhuis: Thank you Simon.

[00:00:33] Simon Maple: This is a follow up on the previous chat that we had, and we're actually gonna go hands on.

[00:00:36] Simon Maple: We're gonna screen share and we're gonna show how, code can be generated from, basically some TDD style tests, whereby we're writing tests in advance, generating code from those tests, and then validating that the code is doing what we expect, so effectively, the tests are forming somewhat of a specification for our component or our application.

[00:00:57] Simon Maple: We're going to start fairly simple and then jump into more of a real world, application. but, Bouke, tell us a little bit more about,the style,the way of working in which your, your process, is going to take it.

[00:01:11] Bouke Nijhuis: Yes, I will. Currently on the screen, a little flow diagram and,on the left, we have the tests.

[00:01:18] Bouke Nijhuis: Those are the things we write as humans, provide those tests to a gen AI. We give them to a gen AI, we ask them to come up with an implementation that will pass this test. Then we move to the middle box, which say implementation. So the GenAI gave us an implementation. Then we run the test against its implementation.

[00:01:36] Bouke Nijhuis: If all tests pass, we have production ready code. As you can see in the schema, there are two loops. let's start with a small loop on the GenAI. We asked GenAI to come up with an implementation, but sometimes it doesn't really listen and it comes up with an explanation, and we cannot run tests against an explanation and sometimes Gen AI comes up with non compiling code.

[00:01:57] Bouke Nijhuis: That's also a problem So we ask it again until we get a proper implementation that compiles then sometimes, not all tests pass then we follow the no arrow from test pass. We extract the errors from the tests that do not pass and we feed them back to Gen AI.and we asked Gen AI to come up with a better implementation in which all tests will pass.

[00:02:18] Bouke Nijhuis: Both loops get five iterations and normally that's more than enough. But sometimes it's not enough and then it says I cannot find an implementation for this test.

[00:02:27] Simon Maple: Okay, cool. In terms of, we're going to be showing this working from a fairly simple, odd even test going up more complex.

[00:02:34] Simon Maple: For this, tell us a little bit about some of the technologies that we're using. Are you using a local model today, or are you going to be in the cloud?

[00:02:40] Bouke Nijhuis: Yeah, very good question. I always start off with a local model. So I'm using a tool called OLlama. OLlama is like the Docker for large language models, local large language models, and it works pretty fine.

[00:02:51] Bouke Nijhuis: We're going to use, Llama 3.1. That's the model I'm using recently. And that's the one model that works best for this use case. If the local Llama 3.1 model isn't capable of solving the problem, we will move to a cloud LLM because the cloud LLMs are just better.

[00:03:11] Simon Maple: Yeah. Okay, cool. I guess there's a third flow here. If we are, following a true TDD model, we write certain tests, let it generate, and then iterate on that ourselves. We might write another test, and another test. Is that the way you envisage this to work?

[00:03:29] Simon Maple: And, once you hit done, you go back, and then there's another iteration on this to say, actually, now let's add that next bit of functionality and get some failing tests that we try and get passing with the new implementation.

[00:03:40] Bouke Nijhuis: Yeah, that's how it works. So in the old days, when we didn't do this, when you would like to add functionality, you would just add more code and then preferably at first test and add more code.

[00:03:49] Bouke Nijhuis: But now in this idea, you're not allowed to touch the implementation anymore. So if you want to add a feature, you add more tests. Rerun the tool until we get to the done stage. Yeah, we have a working invitation that will also pass the new test.

[00:04:02] Simon Maple: Yeah. Okay, cool. And I guess when there's a test failing, we as developers want to have a look at the test and go straight to the code and take a look at that ourselves.

[00:04:10] Simon Maple: Do you still fall into that trap or do you just say, Oh, go on, do it a few more times and, and let's see if you can do it.

[00:04:16] Bouke Nijhuis: Yeah. So in the beginning, I did. also because I wasn't sure if the tool I'm going to show up worked properly, right? And nowadays, I'll just leave it to the AI.

[00:04:24] Bouke Nijhuis: There still, there will be, every time when I run the tool, there will be a file called generator.log, and you can see all the steps that the tool takes. You can also see which, test failed. You can see what the new prompt is for the GenAI. So you can, I use it as a debug tool. If something goes wrong with the tool, I can see what happens.

[00:04:40] Bouke Nijhuis: Why didn't it work as extended? But for people just starting this tool, it's really interesting to see what was happening, what is going wrong, because that still happens. So yeah, it becomes less and less once you get used to this way of working.

[00:04:52] Simon Maple: Yeah, gotcha. Okay. Let's jump in. why don't we start with the odd even test,

[00:04:56] Simon Maple: and I think you're going to be using ChatGPT,for this to generate the code,

[00:05:00] Bouke Nijhuis: right? Yeah, correct. So I would like to start really simple. I would like to start with an odd even test. this is Java. I use Java because that's the language I'm most familiar with, but for people who do not know Java, no problem.

[00:05:11] Bouke Nijhuis: If you know any other programming language out there, you should be capable of following along.So we have a test and, what we're gonna test is odd even. So we expect an object to be called odd even, and we expect it to have a method called is even. Now we all know if you provide even numbers to an is even method, it should return true.

[00:05:33] Bouke Nijhuis: If we provide uneven numbers, it should provide false. So this is the test case that we're going to give to the AI. For now, we're going to do everything manually. I'm going to copy this. I'm going to go to ChatGPT. I'm going to ask it, please provide an implementation that passes this test. And I'm going to paste the code over here and press enter.

[00:06:00] Bouke Nijhuis: The idea is that, ChatGPT comes up with an implementation. Now first you get some interesting text. And then we can just copy the code. Pretty difficult to copy if it moves. So it's copied. Oh, it's in Dutch. No, it doesn't matter. We go over here, we go to the file that should be tested. So the test subject, notice that I have a dummy implementation, which always returns false.

[00:06:18] Bouke Nijhuis: This is incorrect. I'm going to copy. Select everything, paste the code from ChatGPT here, I go back to the test, I run the test, and I hope everything becomes green. As you can see over here, we have a green test. ChatGPT was capable of writing an implementation that passes this test.

[00:06:37] Simon Maple: I wonder ,if we was to change the add evens, and the is even to be something that doesn't refer to the word even, and same with the, odd, even test, not odd, even, and is even, do you feel like it would have done the %2 or do you feel like it would have hard coded those, those responses?

[00:06:59] Simon Maple: So effectively said, if it's zero, do this. If it's two, do this.

[00:07:03] Bouke Nijhuis: Yeah, very good question. actually I tried that. I, I did the same test and I removed all, recognizable words. I, just so this would become an A test. It will become B, it will become C. So there were no, human readable words,a bit logic in there, a bit meaning in there.

[00:07:17] Bouke Nijhuis: It's still capable of coming up with, a solution. Yeah. but I'm pretty sure it helps. Using the right words gives more input for the AI to come up with the right implementation. Yeah. yeah, we can do, we can go off script and do a little test right now. I don't know how much time you want. Yeah, let's do it.

[00:07:33] Simon Maple: Let's go off script. let's, yeah, let's go off script.

[00:07:35] Bouke Nijhuis: Let's see what happens. Nothing bad happens when you go off script, right? So we're going to change this into a test and we're going to change this and B test. Going to change this into C.We can, yeah, let's do everything.so now it's C and we're going to change this,method. So there isn't, Oh, and maybe we should change the package as well. Shift F6.what does it ask? Yeah. Let's do everything. And then I say, Package. Okay. I don't think there is any recognizable. This looks good. Yeah. So let's see what happens if we do this. So I'm going to copy this and go to ChatGPT.

[00:08:31] Simon Maple: Let's do a clean context. Yeah, I was going to ask. About this one, right? So that's this one. Please provide an implementation that passes. And I'm going to paste,anonymized code, no more references to other events, and look at that, it already sees, Yep, odd numbers, yep.

[00:08:56] Bouke Nijhuis: So it knows this, and it comes up with a proper solution as well. So yeah, I can run it, but this will work.

[00:09:01] Simon Maple: And of course, not that it understands this code, or understands what you're trying to do, but there are patterns of tests and implementations that testing for 2, 4, 6, 8,, and 1, 3, 5, 7, and that's how it's, that's how it's matching that pattern. But it's absolutely, if you was to add odd and even in the test as well, it will infer that's what you're trying to do and make it easier to pattern match. But that's a nice test.should we go back to that script?

[00:09:27] Bouke Nijhuis: Let's go back to the script, first I'll show you a really simple odd even test.

[00:09:31] Bouke Nijhuis: Then we showed you, it doesn't need a word spot even it can still generate, but let's take something more complex, and do exactly the same, and the thing that I prepared is prime number generation. So we have a test over here and it's pretty simple. So again, a test called under fifty, we expect there to be an object called prime number generator or a class, I should say.

[00:09:52] Bouke Nijhuis: It has one method called generate. You provide it a number, which is the upper bound of the prime numbers to generate. And if you provide it 50, we expect those prime numbers to be generated, right? So we copy this again, we go to ChatGPT, new context, please provide an implementation passes this test. Again, I'm going to paste it and we're going to wait for the answer.

[00:10:21] Bouke Nijhuis: The answer will be more complex this time, so we're not going to spend too much time on it. We're just going to copy it once again. So let's see when it's finished. It's a bit slow this morning.

[00:10:31] Bouke Nijhuis: There it is. So now I can copy the code. Go back to it. My IDE, let's go to the test subject. This dummy implementation returns null, which is incorrect.

[00:10:41] Bouke Nijhuis: I'm going to select everything. I'm going to paste the code back to my test, select everything, and let's run it to see what happens. And as you can see, we have a green check mark. So ChatGPT was capable of coming up with a prime number generator, by just providing this test. Let's take a look at the code there.

[00:10:59] Bouke Nijhuis: Yeah, we can, that's totally fine.

[00:11:01] Bouke Nijhuis: Yeah, we can really deep dive in this, but. I'm not really good at, I don't know how it actually, I don't know. There are all kinds of algorithms to do this. And I probably, this is not the best one.I don't even really understand it. Do you, how's your math?

[00:11:15] Simon Maple: Oh gosh!.

[00:11:16] Bouke Nijhuis: Yeah. but that's why we have test for. And this is also the idea behind this way of working. You don't need to know how it's implemented. If you trust your implementation, so

[00:11:25] Simon Maple: yeah. Yeah, no, that looks right, actually. So it's checking it's essentially checking if, if everything less than, if every number less than that number, when modded equals zero, so if there's no remainder.

[00:11:37] Bouke Nijhuis: Yeah. But do you ask me, do you go back to the implementation?

[00:11:39] Bouke Nijhuis: No, I don't. Yeah. I want to, I was curious.

[00:11:41] Simon Maple: I was curious

[00:11:44] Bouke Nijhuis: At a given moment. It's okay.

[00:11:45] Simon Maple: Cool.

[00:11:45] Bouke Nijhuis: I'm really happy with this. We now, I have proven that the concept works, right? So we can write tests ourselves, we can give it to an AI and AI will come up with an implementation. So this works. So I'm pretty happy, but there are two big issues with this approach.

[00:11:58] Bouke Nijhuis: And the first one is the fact that there are too many manual steps. I had to copy code from my IDE to my browser, wait till it generated, copy it from the browser back to my IDE, go to the test subject, paste it back to the test, run it. So far too many manual steps. So I decided to automate this. Because that's what we do as developers, right?

[00:12:19] Bouke Nijhuis: If we have to do too many manual things, we become unhappy and we're gonna automate things. So I created a tool, that does this for you. So let's do exactly the same, but now with a tool. And, the tool is just a simple JAR file. It's over here, it's in a different directory, but you can just call it.

[00:12:36] Bouke Nijhuis: And what we have to provide it, we have to provide the tool with a path to a test file. Let's take the prime number thingy. So this test, I'm gonna copy the path, and I'm gonna paste it over here, and I'm gonna press enter. So the tool starts up. I told you, we have two loops. We have a test loop to see if all tests pass, and we have a code loop to ask the AI to come up for code.

[00:13:00] Bouke Nijhuis: And, currently it's talking to a local LLM running on my machine, Llama 3.1.

[00:13:06] Bouke Nijhuis: As you can see, there is an implementation. There's a test and all tests passed. Now let's see what it came up with.

[00:13:14] Simon Maple: So the code loop there is really just trying to loop around until it has something which is test worthy, which is essentially can it compile.

[00:13:24] Bouke Nijhuis: So this is the code loop and this is the test loop.

[00:13:26] Simon Maple: Yeah, and then the test loop is really Will the tests pass now it's compiled and running?

[00:13:33] Bouke Nijhuis: As you can see, this is a different implementation. No square root.there is some multiplication in there. but this, so I did exactly the same, but now I just, I run a tool, I grab a cup of coffee, I come back and I have a working implementation.

[00:13:48] Bouke Nijhuis: This is so much more nice. You don't have to do all the manual steps anymore.While having this now we have a tool, so now we can easily,provide tests and wait for the AIs to come up with solutions. all the demos I gave up till now, I tried to create something new, but, developers all have written a hundreds or maybe a thousand tests in their life.

[00:14:08] Simon Maple: Yeah.

[00:14:08] Bouke Nijhuis: Also use this tool on existing tests. And that's what I did. Actually, I used the tool to improve the tool. Let me show you a little bit.

[00:14:17] Bouke Nijhuis: So in the tool is something called a code container.

[00:14:19] Bouke Nijhuis: And the code container has, multiple features. First of all, it holds the code that is provided by the large language model. Secondly, this is, we do all this in Java, and if you know a little bit about Java, in Java, the name of a file should be equal to the name of the class. ,

[00:14:34] Bouke Nijhuis: This is a class, so this file, this class called primary generator should be in a file called PrimeNumberGenerator, those should be the same. So one of the features of the Code Container is to extract the file name from the code. Another feature that the Code Container has, it should be capable of extracting the package. Because in Java, if you use a package, the folder structure on your file system should be equal to this package structure.

[00:15:02] Bouke Nijhuis: So this file should be in org.example.primenumber. And look at that. org.example.primenumber. So this code container has those features. I created, of course, tests for that. So let's start with the first step. The first one, we'll try to extract the file name from code. So we have a little bit of code over here.

[00:15:20] Bouke Nijhuis: Now this is pretty simple. The file name that I expect is HappyFlow.Java. Now, I have all kinds of tests here. I put random spaces and spots. We have classes, we have public classes. And, you see all the answers and also notice that I added messages.

[00:15:37] Bouke Nijhuis: We talked about, does the AI take cues from the wording that you use?

[00:15:43] Bouke Nijhuis: I use those messages specifically to instruct the AI because if a test fails, I feedback this message and ask for a new implementation. So even without AIs, I think it's a good practice to add messages, but this also really helps the AI to come up with a better solution. So it understands, better what goes wrong.

[00:16:00] Simon Maple: Yeah, and this helps, otherwise, there's going to be more human interaction on this, right? Because as soon as something fails, you need to then instruct why it's failed. But if you're doing this in a more automated way, it's a nice loop that doesn't need a human interaction.

[00:16:15] Bouke Nijhuis: Now we do something similar for package name extraction.

[00:16:17] Bouke Nijhuis: So you see packages over here, a little classes. So this one should return happyflow, again. We have spaces in different spots. And notice that I have multi level packages over here. And notice that I have no package over here because package is optional. And then again the messages. So those are, this is the test that I use to test my implementation.

[00:16:38] Bouke Nijhuis: And now I have a confession to make. I made a really crappy implementation. If you know a little bit about programming, you know you can easily solve this with regular expressions. But yeah, I'm pretty bad at those. I started out with a really crappy string manipulation solution. And I hoped when feeding this to the tool, it would come up with a better implementation that I could than using the tool. So let's see what happens.I need this path. I go over here, I remove the last parameter and I paste the path, see that it ends with CodeContainerTest. And now I'm asking the local model to come up with an implementation that passes the test that you see on top. And this is already a much more complicated problem than the problems we talked about before.

[00:17:28] Bouke Nijhuis: Let's see what, if it can, Do it in one shot, also while we're waiting, let's introduce the generator. log. this is what I use for debugging, you can see I'm using OLlama 3. 1. something happened, so there was an implementation, it found two tests, it ran the tests against the implementation, but zero of the tests succeeded.

[00:17:48] Bouke Nijhuis: So now we feed back to the AI, those two tests failed, please come up with a better implementation. And it's retrying.

[00:17:56] Bouke Nijhuis: Now, if we look over here, you see the prompt is pretty elaborate. You're a professional Java developer. Give me a single file, complete Java implementation that will pass this test.

[00:18:04] Bouke Nijhuis: Do not respond with a test. Sometimes that happens. Give me only complete code, no snippets, include imports, and use the right package. And as you can see, we get a timeout. I use a timeout of 30 seconds. and the local model wasn't, didn't seem to be able to solve it. Now, the first thing that you do. Like every developer, you retry, right?

[00:18:26] Simon Maple: Yep.

[00:18:27] Bouke Nijhuis: Because, and this is something we already did, but especially with large language models, they always come up with something different. There's like a fussiness in there. You retry.

[00:18:36] Bouke Nijhuis: If in the second attempt, the local model cannot handle it, notice that we already have something better, one test passes, and when I practice with this, I also use those test, those demos for presentations at conferences.

[00:18:50] Bouke Nijhuis: Normally the local model is capable of solving this. But if it isn't capable of solving it, we will switch to a cloud model. But Hey, look at that. It improved. first we had one success again, once he has, but now we have two successes. So let's see what it came up with. So here's the test. Let's go back to the test subject.

[00:19:11] Bouke Nijhuis: And as you can see, for the people who know Java. Regex, regular expressions. And it came up with two regular expressions that can extract the file name and extract the package name. So you can use this tool to improve your existing code if you already have tests lying around. Any questions about that, Simon?

[00:19:30] Simon Maple: I find this really interesting because when you think about that first test we did, the simpler test, there are going to be, there are going to be examples of an odd even or even a prime number all over the place, right? Because it's like one of those things that people try just to, just, as they're starting coding.

[00:19:43] Simon Maple: So you'd think it would be easier for AI to be able to find an example of that. With this though, it's slightly different because this isn't something that you would expect to find everywhere. this is, much more specific to what we're trying, which I guess is why it took, in this case, three attempts, but it's still very impressive that it looked at its regex and was trying to, I'm talking about it like it's actually doing this, But, understand what was going wrong and how it could actually change it.

[00:20:07] Simon Maple: And it's all, again, it's all correlation and pattern matching, but, and token matching. But, but it's mind blowing that it can do this in just a few attempts.

[00:20:14] Bouke Nijhuis: And then one of the reasons why it's so useful is I think as a developer, you solve different problems every day, but probably 99 percent of problems that you have solved are already solved many times by other developers.

[00:20:25] Simon Maple: Yeah, or similar enough.

[00:20:27] Bouke Nijhuis: Yeah, maybe the domain is different, the variable names are different, but if you boil it down, probably somebody else has already,solved it. So most problems should be solvable this way.

[00:20:38] Simon Maple: Now, one other question actually, in terms of when you go through the iterations and you have the, number of different attempts, obviously we can alter temperature of, of various models to increase the chances of a hallucination, which may actually make it more creative and identify more interesting solutions.

[00:20:53] Simon Maple: And actually on the podcast previously,Des Traynor was talking about how, if you increase the temperature of their, of their chatbot Fin, if you increase the temperature, you're more likely to , be successful with a greater number of requests. However, there is a greater chance of it effectively, just talking bullshit and creating things that are just not true.

[00:21:13] Simon Maple: And the question, was, how comfortable are you with that balance? Now, I feel like sometimes if it is continually failing here, increasing the temperature is not going to be too big a deal, because you still have tests which is validating that.

[00:21:26] Simon Maple: And it's much harder to validate some written text than it is something which needs to be syntactically true or false. Do you play with the temperature at all in your models? Or is that something that, makes a difference?

[00:21:38] Bouke Nijhuis: I get this question a lot.

[00:21:39] Bouke Nijhuis: Actually, I don't play with it. I've, I find that, the one they use, by default works pretty well. but I would like to add something. You just said, there's a factor we didn't talk about yet. And that's costs.

[00:21:49] Simon Maple: Yeah,

[00:21:49] Bouke Nijhuis: I'm running this against my local LLM. So the costs are pretty minimalistic. of course my laptop will, the battery will die sooner.

[00:21:57] Bouke Nijhuis: So my electric bill will be a little bit higher at the end of the month, but I probably will not notice it. Once you start using cloud large language models, there's a different story. So let me give you a little bit of example. I created this tool, which involved calling those, like cloud ones pretty much, I practice, I give it at conferences and meetups.

[00:22:17] Bouke Nijhuis: So I've been using this for, let's say, eight months and I spent about 10 euros at ChatGPT. So playing around with it is not that expensive.

[00:22:27] Simon Maple: Yeah.

[00:22:28] Bouke Nijhuis: But if you're going to use this for every test case that you have, I think it will become pretty pricey pretty soon. But on the other hand, the models are getting better, they're getting faster, they're getting cheaper.

[00:22:38] Bouke Nijhuis: So we are in a time, I see a future in which this is feasible.

[00:22:43] Simon Maple: Yeah.

[00:22:44] Bouke Nijhuis: Picture this, you write your test, you run a tool, you get your coffee, you come back. If the tool worked, you save probably hours and, for the cost of a cup of coffee.

[00:22:54] Simon Maple: Yeah.

[00:22:55] Bouke Nijhuis: So yeah, why not try it?

[00:22:57] Simon Maple: And you get a, and you get a coffee.

[00:22:59] Simon Maple: Yeah.

[00:22:59] Bouke Nijhuis: Yeah. and even better, if it failed the tool, it costs you a little bit of money, but you still have to test. Then you do the old fashioned way. You still write the implementation yourself. Yeah. You don't lose a lot.

[00:23:09] Bouke Nijhuis: Or you look at where it's going wrong. That makes you think about the implementation. I'm pretty sure afterwards you will write a better implementation. Actually, there are no downsides. Except for the little money part.

[00:23:21] Simon Maple: Yeah. Yeah. Okay. so let's jump forward.

[00:23:24] Bouke Nijhuis: Yeah.

[00:23:24] Bouke Nijhuis: So you asked me, how do you add features? Let's do a little bit, a little demo about it. We have the parameter generator test and, what I'm going to do, I'm going to, add a test that, what I want is when you provide. A negative number over here. It should throw an exception because we don't know negative prime numbers.

[00:23:41] Bouke Nijhuis: So let's do that. So what I'm going to do, I'm just going to add a test over here and I'm going to say public void negative input. And then we need to, no, we need. Now, let me first do something else. we're going to use something from JUnit. It's called assertThrows. It asserts if something is thrown. So like this, the first thing that you have to provide is what kind of exception do you want?

[00:24:08] Bouke Nijhuis: Runtime exception. That's fine to me. And then you provide a lambda, I think like this, and then you say. I'm missing something. I need this. So thank you. Did you see what happened? It was the one line code completion from IntelliJ, which is really pretty fantastic.

[00:24:24] Simon Maple: Is that using the IntelliJ assistant or is that using,

[00:24:26] Bouke Nijhuis: No, this is the, I don't pay for the JetBrains AI system. This is just a free one.

[00:24:30] Simon Maple: Okay.

[00:24:31] Bouke Nijhuis: And do you see what's on my screen? This is exactly what I want, but I only typed the "pr", it infers from the context, what I want, which is already pretty great. Let me see, did I do this correctly? I'm not sure if I did this correctly.

[00:24:45] Bouke Nijhuis: Let me, now let's run it. Let's see what happens. Let's first run the test to see if the test runs.Wrong button. Sorry for that. Yeah. So you see under fifty still works, but negative doesn't work because it doesn't throw this around. This code isn't there yet. Right now. Now I'm going to run exactly the same command that I did before.

[00:25:05] Bouke Nijhuis: So not the code container, but the prime number generator. So I did not change the prompt. So the only thing that's different is that currently the file contains two test cases instead of one test case.

[00:25:16] Bouke Nijhuis: And we expect it now to update the codes. So also the second test case, you see the two and find a solution.

[00:25:24] Bouke Nijhuis: Let's go to the implementation. And I think this is the new part. And this is how you add features to your implementation. Just write more tests that specify the features that you want. Questions?

[00:25:42] Simon Maple: No, this looks awesome. And it's really, it's that TDD way. It's that, it will do the minimum it requires to make those tests, pass.

[00:25:48] Simon Maple: And I think your code is as good as your tests here, right? It's,

[00:25:52] Bouke Nijhuis: Yeah, it's really slightly coupled.I'm really happy with this tool, but there is one problem, one problem left, one big problem, actually. And that's the fact that it's pretty good at generating plain Java code.and that's what we did up till now.

[00:26:06] Bouke Nijhuis: But if we are at our daily job, we use libraries, we use frameworks and the tool in its current form cannot use those. It doesn't know about Java code. Spring Boot. It doesn't know about Quarkus,and it's pretty difficult because it's only input is a test case. How can it extract from this test case which libraries or frameworks you're using?

[00:26:25] Bouke Nijhuis: So I started thinking about this problem and actually the, the, solution is pretty simple. We have to create, we have to do something with your build system.For Java, that is normally Maven or Gradle. I have far more experience with Maven than I have with Gradle. Therefore, I created a Maven plugin.

[00:26:40] Bouke Nijhuis: And if you do that, if you have a Maven plugin, you can easily access all your libraries. And so for the tool, I created a Maven plugin that creates a custom class loader, that loads all the dependencies, and then it can come up with working code when you're using libraries or frameworks. So let me show you how it works.

[00:27:03] Bouke Nijhuis: Sothe compiling parts should be, should know about the libraries that it uses.

[00:27:07] Simon Maple: Got it.

[00:27:08] Bouke Nijhuis: The pom.xml, so I, there's a lot of stuff in pom.xml, the most important part is this. I created a new plugin. It's called the Test Driven Generation Maven plugin. And it's most important part is the configuration part.

[00:27:20] Bouke Nijhuis: And like with the JAR file, we have to provide a path to a test file. Now notice that there is a test file already over here and it says endpoint test. Let me show you the endpoint test. So over here, so this is a really simple Spring Boot Hello World example. For people who are unfamiliar with Spring Boot, this is how you create a Spring Boot test.

[00:27:42] Bouke Nijhuis: It normally works with a random port, which is initialized over here. Not really important, but this is important. We have a REST template. We use this to do a REST call. As you can see over here, we do a GET on the root of localhost and then at random port. And if we do this, get, we expect it to say, Hello World.

[00:27:59] Bouke Nijhuis: But this is a Spring Boot, so it has to come up with a Spring Boot solution and the tool should be capable of running this Spring Boot solution. And that's why we need the Maven plugin. So did I point out everything I want to point out? I think I did. Let's try it. So let's go. We need to use, we need to use Maven.

[00:28:19] Bouke Nijhuis: You do this like this, test driven generation. Generate, so this will run this plugin I talked about before. And, of course this plugin, just reuses a JAR file. Now you recognize this output. You've seen this before when we were using the JAR file.And now it's talking to the local LLM and it asks LLM to come up with an implementation.

[00:28:47] Bouke Nijhuis: That passes this test, so we want a Spring Boot application that returns Hello World when you do a GET. on the root URL. As you can see, things are going wrong as expected. So we're feeding back the errors to the last language model and we ask it to come up with better implementations.there is an implementation here, but the test didn't succeed.

[00:29:08] Bouke Nijhuis: So we keep waiting and wait.

[00:29:10] Simon Maple: Again, what's interesting here is that it's the first time it's looping around the code loop now, and it's because it's a much more complex,Spring application that needs to be built to actually satisfy this.

[00:29:22] Bouke Nijhuis: Yeah. So this is much more difficult. It looks, it's running.

[00:29:25] Bouke Nijhuis: It's good. Everybody, yeah. Build success. Wow. Nice. Let's see. Let's see what it created. probably let me update that. Yes. Yeah. There it is. Endpoint application. So it's a Spring Boot application. yeah. People know Spring Boot. This is default.

[00:29:40] Bouke Nijhuis: It is a REST controller. I get mapping on routes, returns hello world.

[00:29:45] Bouke Nijhuis: This is exactly what I would write when people would ask me to write this.

[00:29:49] Bouke Nijhuis: Yeah. So this is pretty cool. Yeah. Cool. So let's see if we can make it a little bit more complex, but you have questions first.

[00:29:55] Simon Maple: Yeah, I do have a question.

[00:29:56] Simon Maple: In terms of, so obviously, this is, we're getting to that state now where the, a, where the GPT is effectively, going to have a greater number of attempts on the code creation. The more complex we get, the greater those numbers of attempts are going to be. And we're quite, we're going to be more likely to hit our timeouts or a retry limit.

[00:30:16] Simon Maple: When you iterate through, for example, if we were, in fact, let's use that previous example, when you added that feature to the prime number test, have you tried using the existing code as context? to effectively say, look, this is the code that I had that ran my previous test.

[00:30:33] Simon Maple: Check out my updated tests. You need to update my existing code so that it effectively adds the functionality required to pass the failing test, essentially. So I guess it comes down to a little bit more of a TDD style where it might say, let's use the existing code. Let's run my tests, understand what is failing, and then write some additional codes to,to fix that failing test.

[00:30:55] Simon Maple: is that something you've done or considered?

[00:30:57] Bouke Nijhuis: So your question is, does the second run know about the first run, right?

[00:31:01] Simon Maple: know about the, or even know about the code in the previous iteration.

[00:31:05] Simon Maple: yeah. That could even be before, that could even be before, you've run this. the previous time you ran it.

[00:31:11] Bouke Nijhuis: Oh, sorry. Ah, okay. So your question, yeah, but, in all my examples right now, there is no, we always start from scratch, right?

[00:31:18] Simon Maple: The prime number one, we just added that new functionality, right? So we added the throw runtime. And I think we're going to build upon the spring application as well.

[00:31:25] Simon Maple: So rather than go through those errors again, and actually have more chance of hitting that retry limit, could we actually say, look, this is the previous code that passed the previous tests. This is the change I'm making. Can you almost adapt or iterate, from this starting point.

[00:31:41] Bouke Nijhuis: Yeah. No, I did not do that. it is on my to do list because I think the tool would become so much better if it knows more context, the more context you give it, the better it results will be. But on the other hand, what is there is when you, if you have multiple loops, it knows about the previous loops and it knows what it did before.

[00:32:00] Bouke Nijhuis: So there is a little bit of context in there, but not. Not the entire context over here.

[00:32:05] Simon Maple: Yeah.

[00:32:05] Bouke Nijhuis: That would be like the next generation, right? Just nowadays, the, token window of the large language become bigger and bigger, so you can just put your entire project in there and I think it will really improve the hit rate. It will really make the tool even more useful.

[00:32:22] Simon Maple: Cool.

[00:32:23] Bouke Nijhuis: More questions?

[00:32:24] Simon Maple: no, all good. Let's move on, yeah. I think we're going to update the, the Spring test.

[00:32:28] Bouke Nijhuis: Yeah, we'll go to the last example and maybe we can go off script afterwards. Haha. Something harder. But the thing that I prepared is the following one.

[00:32:36] Bouke Nijhuis: So I have an endpoint HTest. And the idea here is, again, Spring Boot Program, we provide it with a birth date and it should calculate the age. let me walk you through the examples. if you were born on the 1st of January this year, you're zero. 1st of January of 2020. You're 4, 1st of January, 2000, you're 24, but if you're born on the last day of 2000, you're 23. So this is the difficult part, right?

[00:33:01] Simon Maple: That's depressing, isn't it, Bouke? a date, a birth date in 2000, and the age is 23, 24. You feel old, that's what you mean. It makes me feel old, yeah. It makes me feel like I need to go and have a nap, maybe, yeah.

[00:33:15] Bouke Nijhuis: I totally hear it. This is what we expect to do again on localhost, a random port again, and then the birthdate should be part of the URL. So we expect slash age slash birthday. And then we expect it to return the age. So let's see what happens if we feed this.

[00:33:33] Bouke Nijhuis: Actually, I played with this and it couldn't solve the problem. It couldn't solve the problem because I made a mistake in the input. I switched those two numbers.

[00:33:42] Simon Maple: I was going to ask that question, actually, when we were back in that, when we were trying to trick the LLM at the very, very start by changing the method names and the class names, et cetera, from odd even. I wonder if you were to actually change, maybe instead of 2, have 1 there instead, I wonder if it would have thought, would have looked at it and thought, do you know what, there's enough tests here to do a mod two,that it would pass, or it looks close enough to an odd even, but yeah, another conversation perhaps.

[00:34:09] Bouke Nijhuis: Yeah. Yeah. So let's try this and let's move from there. So I need to, no, let's go to the pom file. And so instead of an endpoint test, this one is called endpoint age test, and I'm going to rerun it to see what happens. and I think this is a fairly complex problem. I think if I had to code it myself, it would take me at least 50 minutes, maybe more, let's see if the local large language models are capable of solving this, Again, let's take a look at the log.

[00:34:40] Bouke Nijhuis: We're still using the local one, Llama 3.1, the timeout is 30 seconds. You see there in the second code attempt, so the first attempt delivered no code, or at least not code that couldn't compile. Let's see if we can find something interesting in here. Let me reload it, let me see what happens. So we see this is what I provided.

[00:34:58] Bouke Nijhuis: It came up with something. Apparently it doesn't compile. I'm not sure why this didn't compile. then it came up with a second implementation. But that didn't pass the test. So we ask it again, come up with something that passes the test. If I look at this, it looks pretty okay, I'm not sure what's wrong here.

[00:35:19] Simon Maple: Is age in years is a long, and it looks like it wants to, I know string value of, yeah.

[00:35:26] Bouke Nijhuis: This is also wrong, it copied the test in here.

[00:35:28] Simon Maple: Ah, yeah.

[00:35:29] Bouke Nijhuis: This always adds, so now we're looking into the depths of what it's doing. Actually, you don't want to do that. You just want to pass the test, right?

[00:35:37] Simon Maple: Yeah. yeah. We'd be drinking our coffee somewhere.

[00:35:39] Bouke Nijhuis: Yeah. That's it. Yeah, now I should get us some coffee, but that probably wouldn't be so interesting for the users.

[00:35:44] Simon Maple: Yeah.

[00:35:44] Bouke Nijhuis: Now, it's just, chugging along.

[00:35:46] Simon Maple: Yeah.

[00:35:46] Bouke Nijhuis: And now you also understand why I'm using 30 seconds because you don't want to wait too long, especially when you're doing demos.

[00:35:51] Simon Maple: Yeah.

[00:35:51] Bouke Nijhuis: But I think this is a pretty, pretty hard one. I would say in the beginning of the year, 5 percent of the times that I did this test, it was capable of solving it with a local LLM. Normally I had to go to a cloud one, Nowadays, with Llama 3.1, I would say it's capable of solving this problem 50 percent of the time.

[00:36:11] Bouke Nijhuis: This is, this is taking me too long. I, I'm a little bit impatient, so I'm going to prepare for running against the cloud.

[00:36:18] Simon Maple: Okay.

[00:36:19] Bouke Nijhuis: Let's go there. And what we're going to do you, cause you can see you, there are all kinds of properties over here. let's use Anthropic. You can use, ChatGPT or you can use Anthropic right now.

[00:36:30] Bouke Nijhuis: I'm planning on adding more models. as you can see, something went wrong. No solution found. So let's uncomment this, let's rerun it.So instead of going to my local LLM, it now goes to, the Cloud LLM, let me show you that it's true. As you can see, now it goes to Anthropic.

[00:36:50] Bouke Nijhuis: And it's using Claude 3.5 Sonnet. now you see directly the difference, something is happening. Yeah, wow,

[00:36:55] Simon Maple: straight away, look at that.

[00:36:56] Bouke Nijhuis: Yeah, so the First time, is that first time?Yeah. So the local ones are pretty impressive, but the cloud ones are so much better. So let's see what it created.

[00:37:05] Bouke Nijhuis: Let's see, where did it put it? So it's an endpoint age calculator. I don't see it. Should be over here, right? Ah,

[00:37:13] Simon Maple: There it is.

[00:37:14] Bouke Nijhuis: Spring Boot application, REST controller, starts with Spring Boot application. This is what I told you it should use, a slash age, and then the birth date. It uses a path variable for this thing, which is correct.

[00:37:25] Bouke Nijhuis: It parses it with an iso date, which is correct. It uses today. It, uses the difference between today and the provided parsed date and it extracts years. This is fantastic. This is what I would write.

[00:37:39] Bouke Nijhuis: And, and it did it in five seconds.Yeah, I think, I need a more complex.

[00:37:45] Simon Maple: And what's, what I love about this is the fact that you are effectively starting with the validation.

[00:37:50] Simon Maple: So as this gets more complex, yeah. we are able to rely on this code, much, much more because we start with the tests and we know the assertions are,are validated,by the end of it. Yeah, this is awesome.

[00:38:01] Bouke Nijhuis: And I think the validation part is really, this is the interesting part of this approach.

[00:38:05] Simon Maple: Yeah.

[00:38:05] Bouke Nijhuis: And so you have a way to validate the output of your large language model. And the Air Canada example showed you how important that is.

[00:38:12] Bouke Nijhuis: It's going to become more and more important in the future, because people will keep making this mistake over and get burned. It will cost a lot of money.

[00:38:20] Bouke Nijhuis: But now for coding, we have a way to check, which I think is really cool.

[00:38:24] Simon Maple: Yeah.I know this is the last demo. can folks have a play with this themselves?

[00:38:29] Bouke Nijhuis: Yeah, they can. I will, after this, I will give to you, Simon, four links.the first link will be this project,called the AI Native Dev Example, so people can play with that.

[00:38:38] Bouke Nijhuis: I will send a link to the repo that contains, the code that creates the JAR file, then a link to the repo that creates the Maven plugin, so I will send those to you, and if you put them somewhere

[00:38:49] Simon Maple: Yeah, we'll put that in the description and the show notes as well, so absolutely, yeah, absolutely perfect.

[00:38:53] Simon Maple: Thank you. We're pretty much at time. Bouke, this has been super insightful,awesome to chat and awesome to, awesome to throw some off script challenges at you as well. So thank you. Thank you very much and appreciate the, appreciate the session.

[00:39:05] Bouke Nijhuis: Yeah, thank you for having me. Thank you for the fact that I could show off my proof of concept in this area. I love to talk to you and I also really like the off script one. Those are the best. Yeah,

[00:39:17] Simon Maple: Wonderful. Thanks very much and for everyone listening, please tune in to the next session. Thank you.

[00:39:22] Bouke Nijhuis: Bye bye.

Podcast theme music by Transistor.fm. Learn how to start a podcast here.

TDD and Generative AI in Action! A hands on demo with Bouke Nijhuis

Episode Description

Resources

Chapters

Full Script

Be the first to try Tessl

You’re signed up!