Catch up with DevCon Fall conference talks in our YouTube Playlist
Logo
Back to articlesIf agents use your tool, you need evals

21 Jan 20266 minute read

Macey Baker

Can Agents understand your favourite stack?

Coding agents have learned how to write code by training on the internet. That means they’re very good at common patterns, popular frameworks, and well-trodden stacks. It also means they’re often fuzzy on the details that matter most: recent changes, edge cases, and the way a library is actually meant to be used.

For developers, that creates a new question. If an agent is writing code on your behalf, how well does it really understand the tools it’s using? And as a library or framework maintainer, the question cuts even deeper: can agents actually use your software?

That’s the context in which Vercel recently published a small evaluation suite for Next.js. The suite outlines around 50 scenarios with accompanying scoring criteria. Agents are tested on their ability to accomplish the goal described in the scenario, and on the correctness of the Next.js code they produce along the way.

That makes Vercel a bit of a trendsetter. Or at least, I hope it does.

Agent eval suites for libraries and frameworks, even widely used ones, aren’t very common yet. When they do exist, they’re often written by third parties, independent developers, or sometimes by agent providers themselves. Rarely are they written, maintained, and endorsed by the people who actually build the library or framework.

Why this matters for developers and library maintainers

So, it’s hard to find official evaluation suites for popular tools. But we can still use them. We still have perfectly up-to-date documentation for them (right? right…?). Do evals actually matter? Betteridge’s law says no. I disagree.

I am absolutely sick of writing the following sentence, so I’m going to make it an image instead:

A New Age

Agents are writing a huge amount of our code. This has been true for a while now.

If you’ve written open-source software, or you maintain a framework that others build on, that means agents already represent a significant portion of your users - and that number is only going up.

For developers, this is mostly good news. The agent takes care of the details, while you can focus on the bigger picture. But for maintainers, it’s more complicated.

Agents are backed by LLMs, and unless your software hasn’t changed since <insert model cutoff date here>, the agent is almost certainly working with incomplete or outdated information. In many cases, it may not have been trained on your library at all. So when an agent tries to accomplish something using your software, it’s likely to fail the first time. Often the second time too.

To the human behind the agent, that failure just looks like friction: things just “don’t work.” The agent struggles, the developer blames the tool. Code is ripped and replaced in favour of something familiar, whether or not it’s perfectly suitable. Over time, agents and developers alike learn to prefer the path of least resistance, and libraries that are harder to use — or just harder to understand — quietly get avoided.

Evals are the new unit tests

This is where evals start to matter.

As a maintainer, you’re in the best position to define what “using this library well” actually means. You know the sharp edges, the intended workflows, the usual failure modes. Eval scenarios let you encode that knowledge directly, and then observe how agents behave when they try to apply it.

That feedback loop is useful in both directions. It helps you see where agents struggle, and it gives humans clearer signals about what the library is capable of and how it’s meant to be used.

Over time, I suspect eval scenarios may even replace unit tests as the first thing newcomers look at. As we abstract further away from low-level code, the question becomes less about how software works, and more about how we can build on top of it.

From vibes to engineering practices

In 2026, I expect we’ll see more maintainers publishing eval results, and eventually full eval suites, alongside their releases. Even without a comprehensive suite, a small set of well-chosen scenarios and accompanying scoring rubrics can go a long way. They give agents something to anchor on, and they give humans a faster way to understand whether a tool is likely to fit their needs.

If agents are going to be regular collaborators in writing software, then evals aren’t a nice-to-have. They’re part of how you communicate what your software is, and why it’s useful. While we’ve built up myriad ways to answer the question “how does it work?”, evals are a way to answer the new most important question “so... what can I do with it?”.

Join Our Newsletter

Be the first to hear about events, news and product updates from AI Native Dev.