Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.
99
Does it follow best practices?
Validation for skill structure
Framework for evaluating and optimizing AI agents in containerized Docker environments. From the creators of Terminal-Bench.
| Concept | Description |
|---|---|
| Tasks | Benchmark challenges with Dockerfile, instruction.md, solution/, tests/ |
| Adapters | Bridges translating existing benchmarks into Harbor format |
| Agents | Entities performing actions (claude-code, openhands, aider, swe-agent, etc.) |
| Environments | Execution backends: docker (local), daytona, modal, e2b, runloop, gke |
| Jobs | Collections of trials (agent × task × attempts) |
| ATIF | Agent Trajectory Interchange Format for standardized trace logging |
# Install
uv tool install harbor
# Run a benchmark (primary command)
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 -n 4
# List available datasets
harbor datasets list
# Run local task/dataset
harbor run -p path/to/task -a claude-code -m anthropic/claude-opus-4-5-20251101| Flag | Description |
|---|---|
-d, --dataset | Dataset name@version from registry |
-p, --path | Local task or dataset path |
-a, --agent | Agent name |
-m, --model | Model (format: provider/model) |
-n, --n-concurrent | Parallel trials (default: 1) |
-e, --env | Environment: docker, daytona, modal, e2b |
-k, --n-attempts | Attempts per task |
--ak | Agent kwargs (key=value) |
oracle (default), claude-code, openhands, aider, codex, swe-agent, mini-swe-agent, cursor-cli, gemini-cli, goose, opencode, qwen-coder, cline-cli, terminus-2
# Scale to 100 parallel trials on Daytona
harbor run -d swebench-verified@1.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 \
--env daytona -n 100harbor --help
harbor run --help
harbor datasets list
harbor jobs --help
harbor traces --help
harbor sweeps --helpjobs/<job-name>/<trial>/ for stdout/stderrdocker build locally--timeout-multiplier 2 or task.toml [agent] timeout_secharbor jobs resume -p jobs/<job-dir>--debug for verbose output| Topic | When to Read |
|---|---|
| Creating Tasks | Creating custom benchmark tasks |
| Creating Adapters | Integrating existing benchmarks |
| Custom Agents | Implementing agents, --ak kwargs, Claude Code specifics |
| Commands | Complete CLI reference with all flags |
# Package location
uv run --with harbor python -c "import harbor; print(harbor.__file__)"
# Agent __init__ signature for --ak kwargs
uv run --with harbor python -c "import inspect; from harbor.agents.installed.claude_code import ClaudeCode; print(inspect.signature(ClaudeCode.__init__))"Install with Tessl CLI
npx tessl i honeybadge/harbor@0.1.0