honeybadge/harbor

Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.

Review — 99%

Does it follow best practices?

Validation — 15 / 16 Passed

Validation for skill structure

Harbor Command Reference

Name: honeybadge/harbor
Rating: 0.99 (1 reviews)
Author: honeybadge

Primary Command: `harbor run`

Alias for harbor jobs start. The main way to run evaluations.

harbor run -d <dataset@version> -a <agent> -m <model>
harbor run -p <path/to/task-or-dataset> -a <agent> -m <model>

Dataset/Task Options

Flag	Description
`-d, --dataset`	Dataset name@version from registry
`-p, --path`	Local task or dataset path
`-t, --task-name`	Include specific tasks (glob patterns)
`-x, --exclude-task-name`	Exclude tasks (glob patterns)
`-l, --n-tasks`	Limit number of tasks
`--registry-url`	Custom registry URL
`--registry-path`	Local registry.json path

Agent Options

Flag	Description
`-a, --agent`	Agent name (default: oracle)
`-m, --model`	Model (format: provider/model)
`--agent-import-path`	Custom agent import path
`--ak, --agent-kwarg`	Agent kwargs (key=value, repeatable)

Environment Options

Flag	Description
`-e, --env`	Environment: docker, daytona, modal, e2b, runloop, gke
`--force-build/--no-force-build`	Force rebuild environment
`--delete/--no-delete`	Delete environment after run
`--override-cpus`	Override CPU count
`--override-memory-mb`	Override memory (MB)
`--override-storage-mb`	Override storage (MB)
`--override-gpus`	Override GPU count
`--ek, --environment-kwarg`	Environment kwargs

Job Options

Flag	Description
`-n, --n-concurrent`	Parallel trials (default: 1)
`-k, --n-attempts`	Attempts per task (default: 1)
`-o, --jobs-dir`	Output directory (default: jobs/)
`--job-name`	Custom job name
`--timeout-multiplier`	Multiply task timeouts
`-q, --quiet`	Suppress progress display
`--debug`	Enable debug logging
`-c, --config`	Job config file (yaml/json)

Retry Options

Flag	Description
`-r, --max-retries`	Max retry attempts
`--retry-include`	Exception types to retry
`--retry-exclude`	Exception types to skip

Trace Export Options

Flag	Description
`--export-traces`	Export traces after job
`--export-sharegpt`	Include ShareGPT format
`--export-episodes`	all or last
`--export-push`	Push to HuggingFace Hub
`--export-repo`	HF repo id (org/name)

Dataset Commands

harbor datasets list    # List registry datasets

Job Commands

harbor jobs start ...           # Same as harbor run
harbor jobs resume -p <job-dir> # Resume failed job
harbor jobs summarize <job-dir> # AI-summarize failures

Trace Commands

# Export traces from job directory
harbor traces export -p <path> --recursive

# Filter by result
harbor traces export -p <path> --filter success
harbor traces export -p <path> --filter failure

# Push to HuggingFace
harbor traces export -p <path> --push --repo org/name

Sweep Commands

Run successive sweeps, dropping tasks with successes:

harbor sweeps run -c config.yaml --max-sweeps 3 --trials-per-task 2
harbor sweeps run -c config.yaml --push --export-repo org/name

Adapter Commands

harbor adapters init <name>     # Create new adapter
harbor adapters validate <path> # Validate adapter

Other Commands

harbor cache clear    # Clear Harbor cache
harbor view           # Start web UI for browsing results
harbor --version      # Show version

Configuration Files

# config.yaml
job_name: my-evaluation
jobs_dir: ./jobs
n_attempts: 2

orchestrator:
  n_concurrent_trials: 8

agents:
  - name: claude-code
    model_name: anthropic/claude-sonnet-4-5-20250514
    kwargs:
      max_thinking_tokens: 16000

environment:
  type: daytona
  force_build: false
  delete: true

datasets:
  - name: terminal-bench
    version: "2.0"
    n_tasks: 10

Examples

# Run Terminal-Bench with Claude Opus 4.5
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4-5-20251101 -n 4

# Run SWE-Bench on Daytona cloud with Sonnet 4.5
harbor run -d swebench-verified@1.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 \
    --env daytona -n 100

# Run local task with OpenAI o1
harbor run -p ./my-task -a openhands -m openai/o1 --debug

# Run with GPT-4o
harbor run -p ./my-task -a aider -m openai/gpt-4o

# Custom agent with kwargs
harbor run -p datasets/task \
    --agent-import-path my_agents:MyAgent \
    --ak custom_param=value \
    -m anthropic/claude-sonnet-4-5-20250514

# Export traces to HuggingFace
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4-5-20251101 \
    --export-traces --export-push --export-repo myorg/traces

# Resume a failed job
harbor jobs resume -p jobs/my-job-2024-01-15

Environment Variables

Variable	Description
`ANTHROPIC_API_KEY`	Anthropic API key
`OPENAI_API_KEY`	OpenAI API key
`DAYTONA_API_KEY`	Daytona cloud API key
`E2B_API_KEY`	E2B sandbox API key
`MODAL_TOKEN_ID`	Modal cloud credentials
`MODAL_TOKEN_SECRET`	Modal cloud credentials