Skills on Tessl: a developer-grade package manager for agent skillsLearn more
Logo

35% Higher

Abstraction

Adherence

Maria Gorinova
Member of Technical Staff, Tessl
Back to podcasts

How New Libraries Saw a 50% Improvement

with Maria Gorinova

Transcript

Chapters

Trailer
[00:00:00]
Introduction
[00:01:08]
Challenges and Solutions in AI Agent Coding
[00:04:03]
Understanding Documentation vs. Rules
[00:21:10]
Exploring Abstraction Adherence
[00:23:30]
Baseline and Scenario Testing
[00:25:22]
Key Results and Future Directions
[00:28:22]

In this episode

In this episode of AI Native Dev, host Simon Maple and guest Maria Gorinova from Tessl explore a groundbreaking research report on coding agents' ability to effectively use existing libraries and abstractions. They discuss the importance of evaluating agents not just on functional correctness but on their adeptness at leveraging real-world APIs, highlighting how developers can improve agent performance by providing curated context and integrating library usage into workflows. Tune in to learn how these insights can help teams create more efficient and maintainable code with AI assistance.

In this episode of AI Native Dev, host Simon Maple sits down with Maria Gorinova, Member of Technical Staff on the AI engineering team at Tessl, to unpack a new research report on how coding agents perform when they must use existing libraries and abstractions—both with and without extra context. Instead of asking agents to implement algorithms from scratch, Maria’s team evaluates whether agents can correctly and efficiently use real-world APIs. The discussion dives into why evals matter, how the benchmark was built, and what developers can do today to make agents more library-savvy in everyday workflows.

Abstractions, Not Reinvention: The Real Test for Coding Agents

Maria frames the core challenge: software is built on layers of abstractions, and productive engineers don’t reinvent low-level logic when solid libraries exist. Yet, many coding agents default to reimplementing functionality from scratch or use libraries clumsily, requiring human micromanagement. That’s inefficient, costly, and leaves teams with code that’s harder to maintain and reason about.

In modern development, agents must integrate into existing codebases and follow the patterns, dependencies, and APIs teams already trust. Using known libraries yields performance gains, cost savings, predictable behavior, and shared understanding across teammates. Since humans still collaborate closely with agents, the code produced needs to be idiomatic and familiar—not bespoke stovepipes that diverge from the project’s conventions or dependencies.

The report zeroes in on this gap by benchmarking an agent’s ability to use libraries properly. Rather than “solve the algorithm,” the tasks say, “use the library to solve the problem.” Think: “use Pydantic to define a validated data model,” not “implement your own validators.” That shift brings the benchmark closer to everyday engineering and exposes how agents handle real abstractions.

Why Evals Trump Anecdotes in Agent Engineering

Simon and Maria underline a key reality of AI systems: they’re probabilistic. One great or terrible interaction with a model is just an anecdote. Without evaluating agents across many examples, teams risk making product decisions and workflow changes on misleading impressions. Good evals quantify both what an agent can do and how often it does it reliably.

Traditional coding benchmarks—like those focused on functional correctness—are useful but incomplete. They typically test whether the final behavior matches expected outputs (often via unit tests). Maria’s team wanted to capture a different dimension: whether the agent uses the right abstractions to achieve that behavior. A solution that passes tests but recreates a JSON parser by hand instead of using the project’s standard utility is still problematic in real projects.

Evals also reveal how other levers—prompt design, context quality, and model choice—affect outcomes. A stronger model may help, but poorly structured or missing context can still derail library usage. By measuring performance statistically across many tasks, the team can isolate what truly moves the needle.

Building a Benchmark for Library Use

Tessl’s evaluation dataset is built from pairs of “question + evaluation criteria,” grounded in real open-source libraries. An agent first analyzes a library’s API surface and documentation, then generates coding questions that demand the correct use of that API. For example, in Python: “Use Pydantic to define a model with specific validation rules,” rather than “write validation logic from scratch.”

Each question comes with explicit evaluation criteria keyed to the library’s abstractions. The scoring rubric emphasises “API adherence”: Did the solution call the correct functions and options? Did it rely on the library’s intended patterns and types? Is the solution idiomatic to the library and aligned with its guarantees? This pushes beyond pass/fail outputs to assess whether the approach leverages the right abstraction.

To scale evaluation, the team uses an agent-as-judge to score the generated code against the criteria. This judge is separate from the solving agent and is guided by the rubric to focus on library use, not just end results. While agent-as-judge introduces its own considerations, it enables rapid iteration on large datasets. The result is a benchmark tailored to the real challenge developers face: getting agents to use libraries effectively.

Context as a Multiplier: With vs. Without Support

A major part of the study probes how context impacts agent performance. Agents were tested with and without different forms of support—library docs, API signatures, examples, or repository-specific guidance. If you ask an agent to use Pydantic but don’t provide its docs or code references, you’re forcing the model to recall details from pretraining or guess. With well-targeted context, you dramatically narrow the search space and nudge the agent toward the library’s “happy path.”

For developers, this suggests a practical recipe. Package and retrieve the most relevant context: minimal API docs, type signatures, short examples, and project-local usage samples. Compact context beats a massive dump—curate snippets that clearly demonstrate canonical usage and constraints. When possible, add an explicit rubric to the prompt: “Prefer library calls over custom code; do not reimplement X; follow these patterns.” This aligns the agent with the evaluation criteria.

Beyond raw context, prompt structure matters. A system message that states “You must use the project’s existing libraries” and enumerates the preferred import paths sets expectations. You can also include “banned patterns” (e.g., “Do not write your own JSON schema validator”) and ask the agent to self-check for violations before finalizing an answer. These guardrails often pay bigger dividends than swapping models.

A Developer Playbook to Make Agents Library-Savvy

If you’re integrating coding agents into your team’s workflow, start by making library usage an explicit goal. Bake it into your prompts, your tooling, and your acceptance criteria. For day-to-day tasks, supply curated context: the library’s README snippet, key function signatures, one or two canonical examples, and any project-specific wrappers or utility functions the team expects. Keep the context fresh and easy to maintain.

Adopt a lightweight eval loop for your own codebase. Create a small suite of tasks that represent your common patterns—“create a DB migration using our ORM,” “add a Pydantic model for this payload,” “call our HTTP client wrapper instead of requests directly.” Pair each with a short rubric: which APIs must be used, which anti-patterns to avoid. Run your agent against this suite periodically and whenever you change models, prompts, or context retrieval strategies.

Add verification layers. Simple AST or regex checks can catch telltale reinvention (e.g., hand-rolled parsing). Unit tests should verify behavior, and a rubric-based judge (human or agent) should verify library adherence. Consider sampling multiple candidates at a low temperature and selecting the one that passes tests and rubric checks. In prompts, explicitly instruct the model to self-critique: “Verify you used the specified API,” then revise before finalizing. These practices make agents more predictable and their outputs more maintainable.

Looking Ahead: From Benchmarks to Better Agents

Maria hints that more research is coming—this is an evolving space. As benchmarks mature, we can expect more nuanced scoring (e.g., degrees of idiomatic use, performance-aware choices, and maintainability metrics) and wider coverage of languages, frameworks, and domains. Future iterations may assess multi-step planning: reading docs, proposing an approach, implementing with the chosen APIs, and self-checking against the rubric.

On the product side, better retrieval, richer tool use, and tighter library adapters will reduce the need for micromanagement. However, even as models improve, the underlying engineering truth remains: good software depends on good abstractions. Teams that articulate, expose, and test those abstractions—through docs, examples, and evals—will get the most from coding agents.

Ultimately, this report reframes the question from “Can an agent code?” to “Can an agent code like our team codes?” That’s the standard that matters in production.

Key Takeaways

  • Treat abstractions as first-class: instruct agents to use existing libraries and project utilities instead of reimplementing functionality.
  • Evals beat anecdotes: measure performance across many tasks and report rates (e.g., library-API adherence), not one-off wins.
  • Build library-use benchmarks: pair realistic tasks (“use Pydantic to model X”) with explicit rubrics that reward correct API usage.
  • Context is leverage: provide minimal, curated API docs, signatures, and examples; avoid dumping entire documentation sets.
  • Add guardrails: specify “preferred imports,” “banned patterns,” and a self-check step requiring the agent to confirm API adherence.
  • Verify on multiple axes: combine functional tests, rubric-based judging (human or agent), and simple static checks for reinvention.
  • Operationalise your evals: run them whenever you change models, prompts, or context retrieval, and track trends over time.
  • Optimise for maintainability: prioritise outputs that are idiomatic, predictable, and easy for teammates to review and extend.