From the AI Native Dev: Navigating AI for Testing: Insights on Context and Evaluation with Sourcegraph

Simon Maple

Introduction

In the latest episode of our podcast, we delved into the fascinating world of AI testing with Rishabh Mehrotra, an expert from Sourcegraph. The conversation explored the complexities of AI in code generation, the role of machine learning models, and the importance of evaluation and unit testing. This blog post will unpack the key points discussed, providing a comprehensive understanding of AI testing in modern development.

The Big Code Problem

Rishabh Mehrotra opened the discussion by addressing the "big code problem" faced by enterprises. He highlighted that large organizations, such as banks, often have tens of thousands of developers and even more code repositories. This massive code base presents significant challenges in code search and management. Rishabh shared that Sourcegraph has been tackling these issues for years, culminating in the launch of Cody, a coding assistant designed to enhance developer productivity within IDEs.

"Enterprises have a lot of developers which is great for them. They also have massive code bases. Look at a bank in the US; they would have like 20-30,000 developers and more than 40,000 repositories." - Rishabh Mehrotra

The Evolution of AI and ML in Coding

Rishabh's journey in AI and machine learning began in 2009, focusing on traditional NLP before transitioning to deep learning models. He emphasized the rapid evolution of AI models, noting how large language models (LLMs) have revolutionized the field by simplifying complex tasks.

"I've seen waves of specific models being washed away by more generalizable, larger-scale models." - Rishabh Mehrotra

He explained that modern AI tools like Cody allow developers to work at higher abstractions, removing the need to deal with low-level complexities.

Features of Cody

Cody is more than just a code suggestion tool. It encompasses a variety of features, including code completion, code editing, bug fixing, and unit test generation. Each of these features requires different machine learning models and evaluation metrics.

"Cody lives in your IDE. If you're a developer using VS Code or JetBrains, it helps you with code completion, code fixing, and more." - Rishabh Mehrotra

Code Completion

Code completion is a high-volume feature that must be extremely latency-sensitive. Developers expect immediate responses, making it crucial for the models to be both fast and accurate.

"For autocomplete, people will be latency sensitive. They want everything tab-complete within 400-500 milliseconds." - Rishabh Mehrotra

Code Editing and Fixing

Code editing and fixing are less latency-sensitive but require high accuracy. Developers are willing to wait a bit longer for these features as they tackle more complex tasks.

"For code edit and fix, people are okay to wait for a second or two as it figures out the right solution." - Rishabh Mehrotra

Chat and Unit Test Generation

Chat and unit test generation are even less time-sensitive, allowing the models to take their time to provide accurate and comprehensive responses.

"For chat, I'm typing a query, and for unit test generation, I want the entire code to be correct. Take your time and get it right." - Rishabh Mehrotra

The Importance of Evaluation

Rishabh stressed the critical role of evaluation in machine learning. Effective evaluation metrics are essential for understanding the performance of AI models and ensuring they meet user expectations.

"It's not about fancy models; it's about having a solid evaluation. Do you know when something is working better?" - Rishabh Mehrotra

He introduced the concept of "zero to one" in evaluation, emphasizing the need for initial benchmarks that evolve to reflect real-world usage.

Pass at One Metric

The "Pass at One" metric, commonly used in the industry, evaluates whether generated code functions correctly. However, Rishabh pointed out its limitations, as it doesn't always reflect real-world scenarios.

"Pass at One is a great start, but it's not how people are using coding tools in the real world." - Rishabh Mehrotra

Custom Commands and Open Context

Sourcegraph's Cody allows developers to create custom commands to tailor the tool to their specific needs. This flexibility is crucial for adapting to different enterprise environments.

"Cody provides custom commands, allowing you to create tailored solutions and share them with your team." - Rishabh Mehrotra

The introduction of Open Context further enhances Cody's capabilities by allowing the integration of various context sources, thereby improving the accuracy and relevance of AI-generated code.

"Open Context is a protocol designed to add new context sources, making Cody's solutions more accurate and relevant." - Rishabh Mehrotra

The Future of AI in Development

As AI tools become more integrated into development workflows, the focus will shift towards evaluation and unit testing. Developers will need to act as orchestrators, guiding AI systems to ensure they meet the desired outcomes.

"You are the domain expert. You get to define what success looks like and evaluate whether the models meet those standards." - Rishabh Mehrotra

Human-in-the-Loop

Despite the advancements in AI, human oversight remains crucial. Developers are best positioned to understand the nuances of their code and provide the necessary context for AI systems.

"The onus is on you to write good evaluations and ensure that the AI systems are meeting the required standards." - Rishabh Mehrotra

Summary

In this episode, we explored the complexities of AI testing in modern development. Key takeaways include the importance of evaluation, the role of custom commands and Open Context in improving AI accuracy, and the need for human oversight. As AI continues to evolve, developers will play a crucial role in guiding these systems to ensure they deliver high-quality, reliable code.

Key Takeaways

  • Big Code Problem: Enterprises face challenges managing massive code bases.
  • Evolution of AI: AI models have evolved rapidly, simplifying complex tasks.
  • Features of Cody: Cody offers various features, each with different latency and accuracy requirements.
  • Importance of Evaluation: Effective evaluation metrics are crucial for understanding AI performance.
  • Custom Commands and Open Context: These features enhance Cody's flexibility and accuracy.
  • Human-in-the-Loop: Despite AI advancements, human oversight remains essential.

As we move forward, embracing these AI tools and understanding their limitations will be key to unlocking their full potential in the development process.