From the AI Native Dev: Navigating AI for Testing: Insights on Context and Evaluation with Sourcegraph
# min read
Introduction
In the latest episode of our podcast, we delved into the fascinating world of AI testing with Rishabh Mehrotra, an expert from Sourcegraph. The conversation explored the complexities of AI in code generation, the role of machine learning models, and the importance of evaluation and unit testing. This blog post will unpack the key points discussed, providing a comprehensive understanding of AI testing in modern development.
The Big Code Problem
Rishabh Mehrotra opened the discussion by addressing the "big code problem" faced by enterprises. He highlighted that large organizations, such as banks, often have tens of thousands of developers and even more code repositories. This massive code base presents significant challenges in code search and management. Rishabh shared that Sourcegraph has been tackling these issues for years, culminating in the launch of Cody, a coding assistant designed to enhance developer productivity within IDEs.
"Enterprises have a lot of developers which is great for them. They also have massive code bases. Look at a bank in the US; they would have like 20-30,000 developers and more than 40,000 repositories." - Rishabh Mehrotra
The Evolution of AI and ML in Coding
Rishabh's journey in AI and machine learning began in 2009, focusing on traditional NLP before transitioning to deep learning models. He emphasized the rapid evolution of AI models, noting how large language models (LLMs) have revolutionized the field by simplifying complex tasks.
"I've seen waves of specific models being washed away by more generalizable, larger-scale models." - Rishabh Mehrotra
He explained that modern AI tools like Cody allow developers to work at higher abstractions, removing the need to deal with low-level complexities.
Features of Cody
Cody is more than just a code suggestion tool. It encompasses a variety of features, including code completion, code editing, bug fixing, and unit test generation. Each of these features requires different machine learning models and evaluation metrics.
"Cody lives in your IDE. If you're a developer using VS Code or JetBrains, it helps you with code completion, code fixing, and more." - Rishabh Mehrotra
Code Completion
Code completion is a high-volume feature that must be extremely latency-sensitive. Developers expect immediate responses, making it crucial for the models to be both fast and accurate.
"For autocomplete, people will be latency sensitive. They want everything tab-complete within 400-500 milliseconds." - Rishabh Mehrotra
Code Editing and Fixing
Code editing and fixing are less latency-sensitive but require high accuracy. Developers are willing to wait a bit longer for these features as they tackle more complex tasks.
"For code edit and fix, people are okay to wait for a second or two as it figures out the right solution." - Rishabh Mehrotra
Chat and Unit Test Generation
Chat and unit test generation are even less time-sensitive, allowing the models to take their time to provide accurate and comprehensive responses.
"For chat, I'm typing a query, and for unit test generation, I want the entire code to be correct. Take your time and get it right." - Rishabh Mehrotra
The Importance of Evaluation
Rishabh stressed the critical role of evaluation in machine learning. Effective evaluation metrics are essential for understanding the performance of AI models and ensuring they meet user expectations.
"It's not about fancy models; it's about having a solid evaluation. Do you know when something is working better?" - Rishabh Mehrotra
He introduced the concept of "zero to one" in evaluation, emphasizing the need for initial benchmarks that evolve to reflect real-world usage.
Pass at One Metric
The "Pass at One" metric, commonly used in the industry, evaluates whether generated code functions correctly. However, Rishabh pointed out its limitations, as it doesn't always reflect real-world scenarios.
"Pass at One is a great start, but it's not how people are using coding tools in the real world." - Rishabh Mehrotra
Custom Commands and Open Context
Sourcegraph's Cody allows developers to create custom commands to tailor the tool to their specific needs. This flexibility is crucial for adapting to different enterprise environments.
"Cody provides custom commands, allowing you to create tailored solutions and share them with your team." - Rishabh Mehrotra
The introduction of Open Context further enhances Cody's capabilities by allowing the integration of various context sources, thereby improving the accuracy and relevance of AI-generated code.
"Open Context is a protocol designed to add new context sources, making Cody's solutions more accurate and relevant." - Rishabh Mehrotra
The Future of AI in Development
As AI tools become more integrated into development workflows, the focus will shift towards evaluation and unit testing. Developers will need to act as orchestrators, guiding AI systems to ensure they meet the desired outcomes.
"You are the domain expert. You get to define what success looks like and evaluate whether the models meet those standards." - Rishabh Mehrotra
Human-in-the-Loop
Despite the advancements in AI, human oversight remains crucial. Developers are best positioned to understand the nuances of their code and provide the necessary context for AI systems.
"The onus is on you to write good evaluations and ensure that the AI systems are meeting the required standards." - Rishabh Mehrotra
Summary
In this episode, we explored the complexities of AI testing in modern development. Key takeaways include the importance of evaluation, the role of custom commands and Open Context in improving AI accuracy, and the need for human oversight. As AI continues to evolve, developers will play a crucial role in guiding these systems to ensure they deliver high-quality, reliable code.
Key Takeaways
- Big Code Problem: Enterprises face challenges managing massive code bases.
- Evolution of AI: AI models have evolved rapidly, simplifying complex tasks.
- Features of Cody: Cody offers various features, each with different latency and accuracy requirements.
- Importance of Evaluation: Effective evaluation metrics are crucial for understanding AI performance.
- Custom Commands and Open Context: These features enhance Cody's flexibility and accuracy.
- Human-in-the-Loop: Despite AI advancements, human oversight remains essential.
As we move forward, embracing these AI tools and understanding their limitations will be key to unlocking their full potential in the development process.
Guypo with his brand new swag!
On a personal note, I’m extremely excited for this new adventure. Founding Blaze (acquired by Akamai) was about making the web faster; Founding Snyk was about proving security can be embedded into dev. Both are great missions, which I continue to be passionate about. However, for me, Tessl is an even bigger opportunity - offering a better way to create software. Provide a path, made possible by AI, for producing software that is naturally more performant, more secure, and better in many other ways. SO MUCH opportunity awaits, and we have an incredible team on the case.
Almost the whole team for our first team photo!
Yaniv, telling us something amazing!
Recording the next podcast episode of The AI Native Dev!