CtrlK
BlogDocsLog inGet started
Tessl Logo

honeybadge/harbor

Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.

99

Does it follow best practices?

Validation for skill structure

Overview
Skills
Evals
Files

SKILL.md

name:
harbor
license:
Apache-2.0
description:
Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.

Harbor Framework

Framework for evaluating and optimizing AI agents in containerized Docker environments. From the creators of Terminal-Bench.

Core Concepts

ConceptDescription
TasksBenchmark challenges with Dockerfile, instruction.md, solution/, tests/
AdaptersBridges translating existing benchmarks into Harbor format
AgentsEntities performing actions (claude-code, openhands, aider, swe-agent, etc.)
EnvironmentsExecution backends: docker (local), daytona, modal, e2b, runloop, gke
JobsCollections of trials (agent × task × attempts)
ATIFAgent Trajectory Interchange Format for standardized trace logging

Quick Start

# Install
uv tool install harbor

# Run a benchmark (primary command)
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 -n 4

# List available datasets
harbor datasets list

# Run local task/dataset
harbor run -p path/to/task -a claude-code -m anthropic/claude-opus-4-5-20251101

Common Flags

FlagDescription
-d, --datasetDataset name@version from registry
-p, --pathLocal task or dataset path
-a, --agentAgent name
-m, --modelModel (format: provider/model)
-n, --n-concurrentParallel trials (default: 1)
-e, --envEnvironment: docker, daytona, modal, e2b
-k, --n-attemptsAttempts per task
--akAgent kwargs (key=value)

Agents

oracle (default), claude-code, openhands, aider, codex, swe-agent, mini-swe-agent, cursor-cli, gemini-cli, goose, opencode, qwen-coder, cline-cli, terminus-2

Cloud Scaling

# Scale to 100 parallel trials on Daytona
harbor run -d swebench-verified@1.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 \
    --env daytona -n 100

CLI Discovery

harbor --help
harbor run --help
harbor datasets list
harbor jobs --help
harbor traces --help
harbor sweeps --help

Troubleshooting Workflow

  1. Run fails → Check logs in jobs/<job-name>/<trial>/ for stdout/stderr
  2. Docker build fails → Verify Dockerfile syntax, run docker build locally
  3. Agent timeout → Increase with --timeout-multiplier 2 or task.toml [agent] timeout_sec
  4. Resume failed jobharbor jobs resume -p jobs/<job-dir>
  5. Debug mode → Re-run with --debug for verbose output

Detailed Guides

TopicWhen to Read
Creating TasksCreating custom benchmark tasks
Creating AdaptersIntegrating existing benchmarks
Custom AgentsImplementing agents, --ak kwargs, Claude Code specifics
CommandsComplete CLI reference with all flags

Inspecting Source

# Package location
uv run --with harbor python -c "import harbor; print(harbor.__file__)"

# Agent __init__ signature for --ak kwargs
uv run --with harbor python -c "import inspect; from harbor.agents.installed.claude_code import ClaudeCode; print(inspect.signature(ClaudeCode.__init__))"

Resources

  • harborframework.com
  • Documentation
  • Registry
  • Discord
  • GitHub

Install with Tessl CLI

npx tessl i honeybadge/harbor@0.1.0

SKILL.md

tile.json