honeybadge/harbor

Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.

Review — 99%

Does it follow best practices?

Validation — 15 / 16 Passed

Validation for skill structure

name:: harbor
license:: Apache-2.0
description:: Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.

Harbor Framework

Name: honeybadge/harbor
Rating: 0.99 (1 reviews)
Author: honeybadge

Framework for evaluating and optimizing AI agents in containerized Docker environments. From the creators of Terminal-Bench.

Core Concepts

Concept	Description
Tasks	Benchmark challenges with Dockerfile, instruction.md, solution/, tests/
Adapters	Bridges translating existing benchmarks into Harbor format
Agents	Entities performing actions (claude-code, openhands, aider, swe-agent, etc.)
Environments	Execution backends: docker (local), daytona, modal, e2b, runloop, gke
Jobs	Collections of trials (agent × task × attempts)
ATIF	Agent Trajectory Interchange Format for standardized trace logging

Quick Start

# Install
uv tool install harbor

# Run a benchmark (primary command)
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 -n 4

# List available datasets
harbor datasets list

# Run local task/dataset
harbor run -p path/to/task -a claude-code -m anthropic/claude-opus-4-5-20251101

Common Flags

Flag	Description
`-d, --dataset`	Dataset name@version from registry
`-p, --path`	Local task or dataset path
`-a, --agent`	Agent name
`-m, --model`	Model (format: `provider/model`)
`-n, --n-concurrent`	Parallel trials (default: 1)
`-e, --env`	Environment: docker, daytona, modal, e2b
`-k, --n-attempts`	Attempts per task
`--ak`	Agent kwargs (key=value)

Agents

oracle (default), claude-code, openhands, aider, codex, swe-agent, mini-swe-agent, cursor-cli, gemini-cli, goose, opencode, qwen-coder, cline-cli, terminus-2

Cloud Scaling

# Scale to 100 parallel trials on Daytona
harbor run -d swebench-verified@1.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 \
    --env daytona -n 100

CLI Discovery

harbor --help
harbor run --help
harbor datasets list
harbor jobs --help
harbor traces --help
harbor sweeps --help

Troubleshooting Workflow

Run fails → Check logs in jobs/<job-name>/<trial>/ for stdout/stderr
Docker build fails → Verify Dockerfile syntax, run docker build locally
Agent timeout → Increase with --timeout-multiplier 2 or task.toml [agent] timeout_sec
Resume failed job → harbor jobs resume -p jobs/<job-dir>
Debug mode → Re-run with --debug for verbose output

Detailed Guides

Topic	When to Read
Creating Tasks	Creating custom benchmark tasks
Creating Adapters	Integrating existing benchmarks
Custom Agents	Implementing agents, `--ak` kwargs, Claude Code specifics
Commands	Complete CLI reference with all flags

Inspecting Source

# Package location
uv run --with harbor python -c "import harbor; print(harbor.__file__)"

# Agent __init__ signature for --ak kwargs
uv run --with harbor python -c "import inspect; from harbor.agents.installed.claude_code import ClaudeCode; print(inspect.signature(ClaudeCode.__init__))"

Resources

Install with Tessl CLI

npx tessl i honeybadge/harbor@0.1.0

references

SKILL.md

tile.json