Back to articlesFixing API Misuse: How Tessl Improves Agent Accuracy by up to 3.3X

22 Jan 20266 minute read

Macey Baker

Macey Baker is Tessl’s Founding Community Engineer, helping shape the future of AI-native development. Formerly at Intercom and London ML startups, she’s now an AI code generation obsessive.

Table of Contents

What’s in an eval?

What’s useful to measure?

How well do agents listen? Measuring abstraction adherence

See for yourself!

Want to benchmark your own stack?

Back to articles

Fixing API Misuse: How Tessl Improves Agent Accuracy by up to 3.3X

22 Jan 20266 minute read

Table of Contents

What’s in an eval?

What’s useful to measure?

How well do agents listen? Measuring abstraction adherence

See for yourself!

Want to benchmark your own stack?

Developers who use coding agents are all too familiar with their failure modes. Whether it’s poor adherence to instructions, or spiralling from a single test failure, agents getting things not quite right can just feel like the cost of doing business.

Luckily, these behaviours can be measured, and if they can be measured, they can be improved. Anthropic is right: evals are foundational to measuring agent behaviour. At Tessl, we’ve been evaluating coding agents on a specific problem that comes up often in real day-to-day work: correct use of public library APIs.

Across 300+ open-source libraries, Tessl tiles (our structured, versioned context explaining a library’s APIs and idioms) improve correct API usage by up to 3.3X and 1.17X on average. The evals and their results are now published in the Tessl Registry.

What’s in an eval?

Agent evals are distinct from standard model evals. Unlike models, agents are highly empowered LLM loops. They operate across many turns, use tools, react to their own state, and decide when and how to modify it.

When evaluating a model’s performance, you are typically grading a singular response to a prompt. For an agent, the evaluable output is much more complex: it involves not only the ultimate outcome of the task, but also the path the agent took to get there. Furthermore, evals must be run multiple times to account for stochasticity (buzzword alert!) — which amps up the complexity even further.

What’s useful to measure?

Because the surface area of agent output is so broad, there are many metrics to choose from. Most existing benchmarks focus on the correctness of the final solution. However, "correctness" often misses a common failure mode: API misuse.

Perhaps the LLM was trained on outdated data and doesn’t know the library—or only knows an outdated version. Maybe the library is niche, or it's well-known but being used in a specific edge case not represented in training data. When this happens, agents tend to guess. This may result in code that compiles, but the implementation is brittle and hard to maintain.

At Tessl, we measure this directly as Abstraction Adherence: whether an agent uses the library’s intended public interfaces and patterns instead of inventing its own.

How well do agents listen? Measuring abstraction adherence

To measure abstraction adherence, we ran various eval scenarios on 300+ open-source packages across npm and PyPI.

Each scenario requires correct use of the public API with clear success criteria. Tasks run in isolated containers with a dedicated solve environment and a separate grading environment. Our grading system uses task-specific rubrics and points-based scoring across correctness, public API usage, idiomatic patterns, and configuration hygiene.

The Results

We found that tiles improve abstraction adherence significantly:

Up to 3.3X improvement.
1.2X improvement overall.

The reason for this is simple: when an agent doesn’t know what it’s doing, Tessl tiles provide a tailored "how-to" guide. Instead of burning multiple turns guessing, or fetching hundreds of tokens’ worth of raw documentation, the agent accesses neatly organised information at the exact moment it’s needed.

The Breakdown

For unseen packages or packages not well represented in an agent’s training data where they score poorly, Tessl provides agents with the context they need to succeed. And for packages where agents already perform well, Tessl still provides a boost.

We’ve quantified the improvement in abstraction adherence for different bands of initial agent success with packages.

Agent Performance Band (baseline success score)	Average Tessl Improvement (per Tile)
All Tiles	1.2
0-60%	1.4x
60-80%	1.2x
80-100%	1x

See for yourself!

We’ve published these evals across ~300 public tiles in the Registry, covering packages such as asyncstdlib from pypi and mathjs from npm. We’re sharing them so teams can track the impact of changes over time, including model upgrades, version updates, and updates to the tiles themselves. Plus — we’ll be adding more and more public evals over time.

The Tessl Registry is free to explore and use. Get started with our quickstart here.

Want to benchmark your own stack?

These evals aren't just for open source libraries. When you create private tiles, Tessl will generate and run evals over them for you. Define your own scenarios, integrate your internal libraries, and see exactly how well your agents perform within your personalised context.

Macey Baker

Macey Baker is Tessl’s Founding Community Engineer, helping shape the future of AI-native development. Formerly at Intercom and London ML startups, she’s now an AI code generation obsessive.

Fixing API Misuse: How Tessl Improves Agent Accuracy by up to 3.3X

What’s in an eval?

What’s useful to measure?

How well do agents listen? Measuring abstraction adherence

The Results

The Breakdown

See for yourself!

Want to benchmark your own stack?

Resources

Related Articles

How to teach coding agents your software libraries

If agents use your tool, you need evals

A Proposed Evaluation Framework for Coding Agents: Tiles Enhance Proper Use of Public APIs by ~35%

Resources

Related Articles

How to teach coding agents your software libraries

If agents use your tool, you need evals

A Proposed Evaluation Framework for Coding Agents: Tiles Enhance Proper Use of Public APIs by ~35%