Back to articlesHow to Evaluate AI Agents: An Introduction to Harbor

10 Feb 202613 minute read

Rotem Tamir

Rotem Tamir is a Go developer, ex-CTO and Co-founder for atlasgo.io and consultant at honeybadge labs.

Table of Contents

The Shift

Why Traditional Testing Fails

Benchmarks: Not New, Just Inaccessible

Benchmarks as an Engine of Progress

The Standardization Problem

Harbor: A Standard for Agent Evals

Your First Benchmark

Where to Go From Here

Back to articles

How to Evaluate AI Agents: An Introduction to Harbor

10 Feb 202613 minute read

The Shift

For decades, software engineering rested on a simple assumption: same input, same output. You write a function, you write a test, the test passes or fails. Deterministic. Reproducible. Done.

That assumption no longer holds.

When you configure an AI agent - tweak a system prompt, add an MCP server, write a skill - you're working with a system that's non-deterministic in two ways.

First, the models change under you. Like SaaS services that push updates without warning, model providers ship new versions constantly. The agent that solved your bug yesterday might not solve it today - not because your configuration changed, but because the model did.

Second, even on the same model version, inference is stochastic. Temperature, sampling, context window variations - run the same task twice, get different results. "It worked when I tried it" tells you almost nothing.

This is a fundamental change in how software behaves. And it requires a fundamental change in how we evaluate it.

Why Traditional Testing Fails

Consider how you’d normally verify a code change. You write tests. Tests pass. Ship it.

But what does “tests pass” mean when your system is stochastic? You run your agent on a task, it succeeds. Is your configuration good? Maybe. Or maybe you got lucky on that particular run. Maybe it fails 40% of the time and you happened to see a success.

Traditional testing assumes determinism. When outputs vary, you need something else: statistical evaluation. Not "did it pass?" but "what's the pass rate over N trials?" Not "does it work?" but "how reliably does it work, and under what conditions?"

If you've worked on a large codebase, this might sound familiar. Flaky tests - the ones you rerun three times before trusting the result - are already a form of dealing with non-determinism. The difference with agents is that everything is flaky. It's not an occasional nuisance. It's the default behavior.

You're no longer checking correctness. You're measuring reliability. Predictability.

Benchmarks: Not New, Just Inaccessible

Measuring the performance of complex systems through standardized benchmarks isn't something we need to invent. AI researchers have been doing it for years. ImageNet gave us modern computer vision - a standardized dataset with clear metrics that let thousands of researchers iterate toward better models. SWE-Bench did the same for coding agents - real GitHub issues, verifiable fixes, comparable scores.

The pattern works. Define a task. Specify success criteria. Run controlled trials. Measure. Iterate.

Most teams building with agents fall back on intuition. Run it a few times, seems to work, ship it. Not because they don't know better - but because rigorous evaluation has been researcher territory. You needed infrastructure. Expertise in evaluation methodology. Time to build custom harnesses. The approach existed; the access didn't.

Benchmarks as an Engine of Progress

If you've ever run a test suite before merging a PR, you already understand benchmarks. Same idea, different context. In traditional software, you call them tests. In ML research, they're called benchmarks. In the agent world, people say "evals." The mechanics are identical: define expected behavior, run the system, check results.

The difference is scope. A unit test checks one function. An eval checks an agent's ability to complete a task end-to-end — like asking a junior developer to fix a bug and verifying they actually fixed it.

Consider how progress happens in this space. Frontier model developers (Anthropic, OpenAI, Google) compete on benchmark leaderboards. Agent harness developers (Cursor, Windsurf, Claude Code) compete on the same leaderboards. Each new benchmark sets a bar; competition drives everyone to clear it; the bar rises. SWE-Bench started at single-digit pass rates. Now leading agents clear 70%+. That improvement came from having an agreed-upon challenge that everyone could target.

Benchmarks serve multiple purposes in this ecosystem:

Evaluation gates - Same as running CI before merging. Run your agent against a benchmark, get a score, decide if you're ready to ship. Your team can build internal benchmarks around your specific use cases, or use industry standards — the way you'd use both custom tests and a linter.

Regression detection - You refactored your prompt the way you'd refactor a module. Did anything break? Benchmarks tell you whether you're moving forward or backward. Not "does it feel better?" but "did the pass rate go up or down?"

Training signal - This is where it gets interesting. A technique called RLVR (Reinforcement Learning from Verifiable Rewards) uses benchmark-style task runs to improve models themselves. Think of it like this: run agents against hard problems, identify the sessions that succeeded, feed those successful trajectories back into model training. The benchmark becomes training data. (See Deepseek-R1 for more on this approach.)

The Standardization Problem

Here's the issue: benchmarks exist, but there's no standard. Every group developing evals works in isolation - their own format for defining tasks, specifying environments, and running agents. Sometimes by accident, sometimes by design. Evals can be competitive advantage, and teams guard them.

Say you want to run your agent against existing benchmarks. You'll need to figure out each benchmark's setup - different task formats, different container configurations, different ways of reporting results. Want to evaluate multiple agents on your internal benchmarks? You'll need to build custom infrastructure, or run half-assed scripts that don't scale.

This lack of standardization hurts everyone. Benchmark authors can't share their work easily. Agent developers can't compare across benchmarks consistently. Teams building internal evals are reinventing the same infrastructure over and over.

Harbor: A Standard for Agent Evals

Harbor is a framework for evaluating agents in containerized environments. The project grew out of Terminal-Bench, a collaborative benchmark by Laude Institute and Stanford researchers that went from idea to industry-standard in 126 days.

Harbor emerged when the team noticed people weren’t just running Terminal-Bench as a benchmark - they were using it as CI/CD for agents, for RL with synthetic tasks, for prompt optimization. So they rebuilt the harness from scratch for reliability, observability, and scale.

Why does this need to be a standard, and why open-source? Think about who needs eval data to flow between them: agent developers, model trainers, data vendors, enterprises. Standards let different parts of an ecosystem streamline workloads - like containers did for apps. When the standard is open, it fosters collaboration even between competitors - like CNCF did for Kubernetes. That’s why Harbor is open source.

As Alex Shaw, Harbor co-lead, puts it: “The more data can flow, the quicker the industry can accelerate.” And there’s a bonus: the more people create benchmarks, the harder it becomes for anyone to game them. “If you make it easier to create and consume evals, there will be more evals coming out faster. Then you don’t run the risk of overfitting.”

Your First Benchmark

Enough theory. Let’s build something.

We're going to create a simple benchmark task: ask an agent to write a Python script, then verify it produced the correct output. Along the way you'll see how Harbor tasks are structured, how success is measured, and how to run the same task against different agents.

Prerequisites

Python 3.12+
Docker
uv (a fast Python package manager - install here)

The Anatomy of a Task

A Harbor task is a directory with this structure:

hello-world/
├── instruction.md      # What the agent sees
├── tests/test.sh       # How success is measured
├── task.toml           # Timeouts and resource limits
├── environment/        # Dockerfile for the container
└── solution/           # Optional oracle for sanity-checking

Two pieces matter: the instruction (what you’re asking) and the test (how you verify success). Everything else is scaffolding.

A Minimal Example

instruction.md - the prompt the agent receives:

# Task: Hello World

Create a Python script at `/workspace/hello.py` that prints exactly:
  
  Hello, world!
  
The script must output exactly `Hello, World!` when run - no extra whitespace.

tests/test.sh - writes a reward (0 or 1) based on success:

#!/bin/bash
set -e
cd /workspace

if [ ! -f "hello.py" ]; then
    echo "0" > /logs/verifier/reward.txt
    exit 0
fi

OUTPUT=$(python hello.py 2>&1)
if [ "$OUTPUT" = "Hello, World!" ]; then
    echo "1" > /logs/verifier/reward.txt
else
    echo "0" > /logs/verifier/reward.txt
fi

The test doesn’t just check pass/fail - it writes a numerical reward. This matters when you’re running 30 trials and computing statistics. The remaining files set timeouts, prepare the container, and provide a known-good solution for validation. See the complete hello-world task for reference.

Running It

$ git clone https://github.com/rotemtam-tessl/hello-harbor.git
$ cd hello-harbor

First, verify your task works by running the oracle (a script that executes your solution):

$ uv run harbor run --agent oracle --path ./tasks/hello-world
# Mean: 1.000 ✓

A mean of 1.0 confirms the task is solvable and the test correctly validates success. Now run it against a real agent:

$ export ANTHROPIC_API_KEY="your-key"
$ uv run harbor run --agent claude-code --path ./tasks/hello-world
# Mean: 1.000 ✓

That’s it. You’ve defined a task, specified success criteria, and measured an agent’s performance. The mechanics are simple. The power comes from what you do with them.

From Anecdote to Evidence

A single run tells you if your setup works. But to get statistical significance, you need to run the same task many times across different configurations. That’s where infrastructure matters.

Harbor runs anywhere Docker runs - your laptop for quick iterations, or cloud sandboxes like Daytona and E2B for parallel execution at scale. Want to compare two prompt variants across 50 trials each? Running sequentially on your laptop takes hours. With cloud sandboxes, you can spin up 100 containers in parallel and get results in minutes.

This flexibility is part of the value: you’re not locked into one execution model. Iterate locally when you’re exploring, scale out when you need statistical confidence.

To see what real benchmark runs look like at scale - with variance, pass rates across agents, and detailed breakdowns - check out the Terminal-Bench 2 results:

Where to Go From Here

This was the “hello world” - proof that the machinery works. The real value comes when you apply it to questions you actually care about:

Does adding this MCP server improve reliability or just add latency?
Is my custom system prompt helping or hurting on edge cases?
Which of these two skill implementations performs better?

These are empirical questions. You can form hypotheses. Run controlled experiments. Get data instead of guesses.

In Part 2, we’ll do exactly that: take a real question about Claude Code’s skill system and design an experiment to answer it. We’ll see how the data surprised us, how we iterated based on evidence, and what we learned about the scientific method applied to agent configuration.

The code for everything in this post is at github.com/rotemtam-tessl/hello-harbor.