TDD and Generative AI - A perfect developer pairing with Bouke Nijhuis

In this episode, Bouke Nijhuis, CTO of CINQ, shares his journey from a Java developer to a leading technologist, exploring the fascinating intersection of Test-Driven Development (TDD) and AI. Discover how AI tools can revolutionize your coding practices and what the future holds for AI-assisted development.

Episode Description

Join Simon Maple as he sits down with Bouke Nijhuis, the Chief Technology Officer of CINQ, a consultancy based in Amsterdam. Bouke, an experienced Java developer and international speaker, delves into the world of Test-Driven Development (TDD) and the role of AI in modern software development. Throughout the episode, Bouke discusses his journey in the tech industry, his experiences with various AI coding assistants, and the innovative concept of generating code from human-written tests using AI. Learn about Bouke's practical implementations, the challenges faced, and his vision for a future where tests are the primary artifacts in software development. Whether you're a seasoned developer or just starting out, this episode offers valuable insights into the evolving landscape of AI-assisted coding.

Resources Mentioned

Chapters

  1. [00:00:21] Introduction - Simon Maple introduces the episode and guest Bouke Nijhuis.
  2. [00:00:50] Bouke's Background - Bouke discusses his journey from Java developer to CTO at CINQ.
  3. [00:02:07] Introduction to TDD - Bouke shares his experience with Test-Driven Development.
  4. [00:03:07] AI Coding Assistants - Discussion on various AI coding assistants and their integration into Bouke's workflow.
  5. [00:04:37] Generating Code from Tests - Bouke explores the concept of generating code from human-written tests using AI.
  6. [00:06:08] Validating AI-Generated Tests - How to manually review AI-generated test cases for accuracy and comprehensiveness.
  7. [00:08:51] Practical Implementations - Step-by-step description of Bouke's tool for generating code from tests, including demonstrations.
  8. [00:13:23] Challenges and Solutions - Addressing challenges in AI-generated code and solutions through feedback loops.
  9. [00:15:04] Future of AI and TDD - Bouke's vision for the future where tests drive code generation.
  10. [00:19:22] Bouke's Experiences with AI Models - Insights into different AI models used by Bouke and their performance.

Full Script

[00:00:21] Simon Maple: On today's episode, we're going to be very much talking about TDD and how you can write tests manually and then have code generated from those tests. Joining me today, Bouke Nijhuis. How are you doing, Bouke?

[00:00:35] Bouke Nijhuis: I'm totally fine. How are you Simon?

[00:00:37] Simon Maple: I'm doing very well. Thank you. really happy to have you join our podcast now. Now, Bouke, similar to me, Java developer for many years.

[00:00:46] Simon Maple: Tell us a little bit about, about, your role and, what you do for CINQ.

[00:00:50] Bouke Nijhuis: Very good. Thank you. I'm working at CINQ. CINQ is a company that's a consultancy company. We're based in Amsterdam in the Netherlands. We have three areas in which we specialize. We have the data area. This, those are people who work with Splunk, Kribble and data engineers. Then we have the DevOps people.

[00:01:05] Bouke Nijhuis: They work with Kubernetes, Terraform, cloud providers, stuff like that. And lastly, we have the development unit and those specialize in Java, Kotlin. Angular, React, stuff like that. In the past, I started out as a Java developer at CINQ about 10 years ago. actually in 10 days, I will be exactly 10 years ago.

[00:01:24] Bouke Nijhuis: And, I used a lot of the, a big bank in the Netherlands. then I became a unit manager of the development unit, and two and a half years ago I became the CTO of the company. Also noteworthy is that I'm an international speaker since 2019. I like to stand on conferences to tell people, talk to people, have discussions afterwards.

[00:01:42] Bouke Nijhuis: I love it. And it also gives you a chance to be in podcasts because I assume you saw one of my talks online.

[00:01:48] Simon Maple: Absolutely. Yeah. And actually, one of the sessions, is similar to some of the topics that we're going to be talking about today, was one that I saw at, Devoxx and JBCN conf, as well was one that you, one that you talked at. And it's a really interesting topic when we think about, When we think about TDD and how that plays a part,in AI development going forward.

[00:02:07] Simon Maple: First of all, TDD, is that something that you have always done? Is that something that you do today in, in, in your normal development pre, AI assistants?

[00:02:15] Bouke Nijhuis: A very good question. I try to, it's not something I started out with. I learned about TDD, I would say about 15 years ago, and I think it's pretty hard to do it the official way. I think I'm doing a little bit in the middle. I've run a little bit of tests, I've run a little bit of code, go back to the test, so I'm not a diehard TDDer, but I like it a lot.

[00:02:35] Bouke Nijhuis: It helps me to develop faster and develop better.

[00:02:39] Simon Maple: Yeah, and from the AI point of view, looking at the various AI assistants that are in use today. First of all, are there any AI assistant tools that you find particularly useful in your daily workflows? And secondly, when you think about how AI tools or AI powered dev tools can really help you in your daily development with providing tests, where do you find the most value of a tool?

[00:03:07] Simon Maple: Is it the creating the test? Is it identifying where to test? Where does that lie for you?

[00:03:12] Bouke Nijhuis: Yeah, very good question, again. I normally, there's several things I use AI coding assistants. I prefer them to be my, IDE, because then I know the context of the things I'm talking about. And, I use them to, just generate code for me. I use the, if I just press enter, it does a proposal, I like that, but I like even better that you can chat with it, and then you can reason about the code and ask it to, generate something and to improve upon it.

[00:03:36] Bouke Nijhuis: I ask it to generate test cases, but that's only when I already have an implementation. What we're going to talk about is the other way around, of And, I also like it to find bugs. Sometimes it's really fun, but, and sometimes it comes up with really interesting suggestions. And it's Oh, while reasoning about the code, there were like corner cases I didn't think about.

[00:03:54] Bouke Nijhuis: So yeah, there are various things I use it for.

[00:03:56] Simon Maple: It's providing you with things that you, you may be thinking too close to the golden path and it's trying to, it's thinking more without context as a result. and as a result, focusing on areas that you wouldn't necessarily pay as much attention to perhaps.

[00:04:08] Bouke Nijhuis: And actually now thinking about it, if I, if you would summarize it, I use it like a virtual pair programmer.

[00:04:12] Simon Maple: Yeah, interesting. Which is really, when we think about these tools as AI assistants, that's exactly the kind of role that we want them to play, right? And of course, so you've also, let's introduce something else that you're working on now as a project as well. You have a building up on a, on a project that really is, trying to generate code, from the, from the TDD style tests that you've created earlier.

[00:04:37] Bouke Nijhuis: Yeah, let me also introduce it a little bit, how I came upon this idea, where it came from. last year I had a talk, I gave at several conferences and meetups called the Battle of the AI Coding Assistants, in that talk I compared different AI coding assistants like GitHub Copilot, JetBrains AI Assistant, but also ChatGPT.

[00:04:56] Bouke Nijhuis: And I showed off the features, the things that I could do. And one of the features that I showed off is the fact that they are capable of generating test cases based upon your implementation. So you give the implementation, ask for test cases, and they come up with test cases. And it works pretty well. At the end of one of those conferences, a discussion started, that happens most of the times. And then, somebody asked, can you also do it the other way around? You as a human, can you write a test? And then can you ask the AI to come up with an implementation? And I said, that's really interesting. I don't know, but I'm going to investigate. So I started investigating, this, uh, led to some research.

[00:05:29] Bouke Nijhuis: And, that led to my new talk, which is TDD and Gen AI a perfect pairing, in which I investigate if it's possible to write tests as a human, give it to the AI. Ask for an implementation, and then use the same test to see if the AI came up with a proper implementation.And that's the thing, yeah, that I'm nowadays talking about at conferences and also in this podcast.

[00:05:52] Simon Maple: Yeah, it's interesting. So we've really generated both styles. You create the code and then generate the tests, and then vice versa, generating the tests and having AI generate the, code. Yeah. Let's talk about the first case first, and then we'll spend most of our time on the TDD side.

[00:06:08] Simon Maple: On the first case, when we're, when we think about tests are really that kind of last line of defense. We want to make sure our code is bulletproof, or as bulletproof as we can get it. And we rely on those tests to really validate our assertions, that we want on our projects or the code.

[00:06:22] Simon Maple: If AI is generating that and saying, yes, this passes or this fails, what's the, how do we validate that actually the AI generated tests are actually being generated correctly or covering the areas of our projects that are the most important? is this a manual kind of review on that, output basically from the AI or is there a better way of doing that?

[00:06:42] Bouke Nijhuis: Now, actually, I'm doing a manual because, as we all know, AI tends to do a come up with things and sometimes they come up with things that are untrue or incorrect. Therefore, you should always check the output of, the, of a large language model of an AI. so in the first case, when you write the implementation and, AI writes the test, always check the test.

[00:07:00] Bouke Nijhuis: I noticed that it sometimes comes up with corner cases I didn't think about, so that helps a lot. but in general, it, like I would say, it always comes up with a happy flow and. Most cases also with, edge cases. And if it doesn't, you can just ask for more tests and, yeah, it will help you, it will inspire you.

[00:07:16] Bouke Nijhuis: And like I said before, it's either as a pair programmer that helps you to achieve a higher level of software.

[00:07:21] Simon Maple: Presumably you use the chat functionality just in the same way, but against test code versus, versus development code as well. And you can talk about the coverage or talk about what cases it's trying to, it's trying to catch there as well. and if we think about it the other way around, when you're doing, when you're basically doing a TDD, because you're the one writing the tests, I guess you can ask, AI to validate some of your work there.

[00:07:40] Simon Maple: But realistically, those tests are effectively,validating the generated code. So it's less of a motion to then manually need to go over the code because you can make certain assumptions based on the quality of your coverage or the tests,that you're writing.

[00:07:56] Simon Maple: Is that fair?

[00:07:57] Bouke Nijhuis: Yeah, so I would say the other way around is a better way, because if you provide the test, you provide a specification because tests, of course, are a specification of what you want. But you also provide a way to check the output of an AI. And that's something that actually every area would benefit from, but not every area has something to check the output of large language models.

[00:08:16] Bouke Nijhuis: Luckily, we, software developers, we have a way to check the correctness of our implementation at a unit test. And therefore, the second way, so from tests to implementation is, I would say, a better way.

[00:08:28] Simon Maple: Yeah. That makes a lot of sense. And I think you mentioned specifications there, which I think is very interesting. 'cause obviously specifications, you've mentioned chat as well, which is more of a short term prompt, whereas a specification you'd think is, you think is a longer document, which describes the project.

[00:08:42] Simon Maple: Would you say tests are enough? To be able to define your specification, or do you feel like there are other things as well that you need to add in to be able to, define that project?

[00:08:51] Bouke Nijhuis: Good question. Actually, as homework, I listened to a few of your podcasts before. And I

[00:08:55] Simon Maple: all right.

[00:08:56] Bouke Nijhuis: One with, with James Ward. And, you also discussed a topic. And I like what you said. I, he said, tests are not enough, but I think we should extend what should be part of the test to make sure that the AI has enough to come up, with an implementation.

[00:09:08] Bouke Nijhuis: And we have, enough to check the validity of this implementation. At this current state, how we are testing nowadays, let's say JUnit 5 test, yeah, probably there are some areas that are missing areas, let's say, security, let's say, performance, but I think we, in the near future, we should be able to add that in a test as well.

[00:09:26] Bouke Nijhuis: Actually, I already have ideas how you can do, if you add a performance test to your test suite and also, run that against the implementation of the AI, you already can cover the performance as well.

[00:09:35] Simon Maple: Oh, wow. Tell us a little bit more about that. And that sounds interesting.

[00:09:38] Bouke Nijhuis: Yeah, so I'm going to show off a proof of concept tool that I created, I'm going to show that in the next episode, but now it just used JUnit 5. But you could actually, in JUnit 5, you could already say, you could already time, and you could say the assertion, this test should run faster than, 20 milliseconds.

[00:09:56] Bouke Nijhuis: And then you already have some kind of performance measurement. So you could work around it. You could also say, run a pipeline and, use, security scanners to see if there are no security bugs in there. So everything is already there. But it just has to, yeah, become one package. So we're working towards, I think a future in which all those aspects can be fitted into unit tests, or at least tests.

[00:10:18] Simon Maple: Now, you mentioned that you watched the, or listened to the James Ward episode. I'm gonna ask a similar question, actually. I'd love to hear your opinion on this, which is really about, projecting ourselves into the future and thinking, okay, what's, what's the, whether it's source of truth or the most important artifacts.

[00:10:33] Simon Maple: When we think about, this idea of us writing tests or, a TDD style whereby the code gets generated. If the code is continually being generated and the thing we're changing are the tests, what's going to be more important in the future? Is it going to be the tests, or is it going to be the generated code, do you think?

[00:10:51] Bouke Nijhuis: Yeah, I think the answer is simple. If we trust the generated code from the AI, and that means if we trust our test, then the test will obviously be more important. We don't care about the implementation anymore. Actually, sometimes when I'm playing with this tool, or using the tool, it sometimes comes up with implementations that are hard to read for me.

[00:11:08] Bouke Nijhuis: That doesn't really matter, because I don't have to read the code anymore. At least in the future, if we really can get this yeah,way of working going.

[00:11:17] Simon Maple: And I suppose you could always ask AI to describe the code for you. I think we, we're finding ourselves in a far nicer place where AI is like more understanding,being able to describe code in a far clearer way to developers. And I think that's, that will help the fact of the maintainability through using AI, as well.

[00:11:33] Simon Maple: Yeah. That's really interesting. And I think what I know there's one other topic that we should talk about, which is really about,the purposes of tests. Now I know you mentioned, offline that there are two purposes, input and validation. talk us through what you mean by that.

[00:11:48] Bouke Nijhuis: Yeah,I think we talked already a lot about using tests as input, so we as a human write the test, we, give it to the AI, come up with an implementation that passes this test. And that's where the second, purpose of Tessl comes from. We're going to use the test to see if the provided implementation is a correct implementation.

[00:12:06] Bouke Nijhuis: And this is really where the value of this method, lies. I said it before, sometimes AIs, hallucinate, come up with things that are untrue. but with this test, you can see if they came up with a proper solution. And I think this is a really important, way in this mechanism.

[00:12:24] Simon Maple: And what's the way, what's the way that you identify whether a proper solution is, has been generated?

[00:12:29] Bouke Nijhuis: Once, I get the implementation from the AI assistant, I run the test against it. And if all tests are green, if they all pass, we have a perfect implementation. And if you trust your test, actually, you have production ready code, otherwise, yeah, so that's the idea, and, the, it would also be, and let me share a story, that really highlights this, problem of large language models that they make of things, I think a half year ago, there was a story in the news about Air Canada, which is an airliner, they have a website, and this website had a chatbot, and they decided to, make this chatbot more human like by using a large language model.

[00:13:07] Bouke Nijhuis: Which is by itself a good idea. so they implemented that and a customer started talking with a chatbot and they were talking about refund policies. So as you probably already imagine where it is going, the chatbot started to make things up about this refund policy. So the customer was foresighted and he made screenshots.

[00:13:23] Bouke Nijhuis: A few weeks later, he decided to make use of his refund policy. So he called the helpdesk, the actual humans behind it, and he said I would like to refund my ticket. And the people at the help desk said, no, that's not possible. The chatbot made a mistake. The customer disagreed. He said, come on, look at that screenshots.

[00:13:41] Bouke Nijhuis: I want a refund. He didn't get his refund and he went to court. And the judge agreed with the customer. The judge said, I don't care if it's a large language model or a human that puts stuff on your website. If it's on your website, it's true. Therefore, you have to honor the agreement. And Air Canada had to refund the ticket, and the next day the chatbot was offline.

[00:14:01] Bouke Nijhuis: I really think this story highlights the problem we have nowadays with chatbots. You cannot trust them for the full 100%. They're really useful, but there's this last few percents in which they, yeah, come up with false information.

[00:14:15] Simon Maple: Yeah, and if we relate that back to the tests then, and that's a really interesting story, and I think actually there's a lot more to think about there, just in terms of the verbal contracts really between consumers and vendors, but, if we relate that almost back to test and actually think, okay, if those errors or those hallucinations are actually, appearing in our code, let's say, okay, the tests don't come back green.

[00:14:35] Simon Maple: What's the next step? Do we need to alter our tests? Do we need to alter our code? How do we know which to change? In this project that you were creating, walk us through that flow.

[00:14:46] Bouke Nijhuis: Yeah,let me, let me first do the happy scenarios and then we would talk about when things go wrong. So in the happy path, you as a human write test, you give it to the AI, you ask for an implementation. This is the happy path, so we get a perfect implementation, production ready, we run the test, remember, happy path, all tests are green, so we have implementation, we can ship it to production.

[00:15:04] Bouke Nijhuis: So this is, ideal world. unfortunately for us, we don't live in an ideal world. So there are two loops in this mechanism. First, when we ask the LLM for an implementation, it sometimes comes up with an explanation. That's not good enough. We want an implementation. So we ask it again. Come on, we want an implementation.

[00:15:21] Bouke Nijhuis: Sometimes it comes up with code that doesn't compile. That's also not good enough. We need compiling code, otherwise we cannot run the test. This is the first loop. Then there's a second loop. Sometimes not all tests pass. If that's the case, we extract the errors from the test and we feed them back to the large language model.

[00:15:37] Bouke Nijhuis: And we say, hey, this failed, come up with a better implementation that will solve the problem that the current version of the implementation. And so you get a feedback loop from the tests to the large language model, and you feed the errors back to the large language models and then you keep doing that till you get a perfect solution,

[00:15:52] Bouke Nijhuis: of course, some things are insolvable. So there are also a max amount of retries in there. But normally it's capable of solving, problems. and I'm going to show you in the second part of this episode or the next episode, how we do that.

[00:16:04] Simon Maple: Yeah, absolutely. And in your experience, how many, first of all, actually, presumably you say which tests are failing as well with the failures. So it can recognize which flows it needs to concentrate on, how many iterations do you see it requiring typically before it can go from a failing from, let's say one failing test to, to all green.

[00:16:23] Bouke Nijhuis: Yeah, that's a very good question, but it really depends, like we always say in our,

[00:16:27] Simon Maple: Yeah.

[00:16:28] Bouke Nijhuis: but, the more difficult the problem, the more, loops it takes,in the demo, you will see some things will be one shot solutions. Also, interesting thing, a lot of language models are not deterministic, so sometimes they can fix a problem in one shot, and sometimes they cannot solve it again.

[00:16:41] Bouke Nijhuis: Then I rerun it, and it solves it again in one problem. The tool that I created stops at five retries. So five retries for generating code, and five retries for running the test.

[00:16:50] Bouke Nijhuis: And the majority of the time

[00:16:51] Simon Maple: it succeeds?

[00:16:54] Bouke Nijhuis: Yeah. But that depends really on the problems that you on the problem of how complex the

[00:16:56] Bouke Nijhuis: in the demo after this, we're going to start simple, and then we're going to ramp up the

[00:17:00] Simon Maple: So talk us through a little bit before we close this episode. for our listeners, this episode will be,on obviously all the podcast places as well as YouTube. Second one, you really need to YouTube to, to watch this. And we're going to do a screen share and we're going to show us creating tests and then generating code based on those tests, and then effectively run that against the run the tests against the code and see how many iterations etc we can get before we actually get a valid working scenario.

[00:17:29] Simon Maple: Tell us about those scenarios that we're going to be doing. How complex are these problems that we're going to be trying to solve?

[00:17:35] Bouke Nijhuis: So we're going to start a really simple, we're going to start manually. The first test that we're going to use is called the odd even test. Every developer knows about odd even, it's a pretty simple problem. we're going to copy the code to ChatGPT. We're going to ask for an implementation. We copy

[00:17:47] Bouke Nijhuis: this implementation back to my IDE and then we're going to run the test. Ah, this will work, of course. then, I'm going to make it a little bit more difficult. I'm going to use prime number generation, do exactly the same thing. Then, doing this stuff manually becomes boring really fast, so we need something to automate this.

[00:18:02] Bouke Nijhuis: So I'm going to introduce a tool which can automate those. tasks for you, you provide the tool with a test case and it just talks to the large language model, grabs the feedback, does the loops, the things I talk about until all test passes and we have a working implementation. So, the first problem I had,with this approach is. Too much manual work. That's why I created the tool. And then the second problem I have with the tool is the fact that the tool is pretty good at generating plain Java code. That's not something what we do at our workplace, right?

[00:18:29] Bouke Nijhuis: So we all use frameworks and libraries. So I created a Maven plugin that can read your POM file, it knows which libraries, which framers you're using. And then I'm going to show off a simple Spring Boot Hello World, and as last one, I'm going to show you the more difficult Spring Boot program, that there's an endpoint, you give it a date, a birth date, and it comes up with your age.

[00:18:52] Simon Maple: Amazing. using real world scenarios with a proper Spring application, not super complex, but certainly not trivial,applications for it.

[00:19:01] Bouke Nijhuis: But one more thing about it, really interesting. I started developing this thing in the beginning of the year. And, I do most of the things at local large language models. So running on my laptop, in the beginning of the year, it really troubles doing all of those. So halfway the presentation, I had to switch to cloud large language models.

[00:19:15] Bouke Nijhuis: But nowadays, most of the exercises are doable, but the local LLM.

[00:19:22] Simon Maple: Have you switched around with different models and have you found big variations between them? Are there some that are maybe better at generating code or understanding code versus others?

[00:19:32] Simon Maple: What's what did you land on?

[00:19:34] Bouke Nijhuis: Let's start with the local ones. There, there's a lot of local models around there. I played a lot with them, but in the end, the, the Llama 3 models are the best one for my use case. a few months back, I switched to Llama 3 and nowadays I'm using Llama 3.1 and that works really well. I think it's one of the best local ones out there.

[00:19:52] Bouke Nijhuis: And once I have to switch to, the Cloud ones, I used to use ChatGPT, which worked really fine. Nowadays I also use Claude Sonnet 3.5, and they also work really well. I haven't found a big difference in those two yet, but there are people out there who say that the Claude one is better.

[00:20:08] Simon Maple: Yeah,I've heard a mixture to be honest. I I always hear people raving about Claude models. and a lot of the, a lot of the graphics I see show both Claude and 4. 0 being very close together, but there being then a gap between them too and other models.

[00:20:24] Simon Maple: So yeah, interesting to see the differences there, but yeah, great to hear your take. That's the end of our chat. We're going to go into a deep dive now,for the next session, for those who want to, track that as well. And I very much,recommend, you watch cause I've seen this before and it's amazing to see, this in action, we're going to be writing tests and generating code from those tests.

[00:20:42] Simon Maple: So if you want to follow along with that, do check us out on the Tessl podcast page, or the YouTube page direct. Join us soon. And thanks very much Bouke. Speak to you soon.

[00:20:52] Bouke Nijhuis: Thank you, Simon, for inviting me, and see you soon. Bye bye.

Podcast theme music by Transistor.fm. Learn how to start a podcast here.