Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.
99
Does it follow best practices?
Validation for skill structure
harbor runAlias for harbor jobs start. The main way to run evaluations.
harbor run -d <dataset@version> -a <agent> -m <model>
harbor run -p <path/to/task-or-dataset> -a <agent> -m <model>| Flag | Description |
|---|---|
-d, --dataset | Dataset name@version from registry |
-p, --path | Local task or dataset path |
-t, --task-name | Include specific tasks (glob patterns) |
-x, --exclude-task-name | Exclude tasks (glob patterns) |
-l, --n-tasks | Limit number of tasks |
--registry-url | Custom registry URL |
--registry-path | Local registry.json path |
| Flag | Description |
|---|---|
-a, --agent | Agent name (default: oracle) |
-m, --model | Model (format: provider/model) |
--agent-import-path | Custom agent import path |
--ak, --agent-kwarg | Agent kwargs (key=value, repeatable) |
| Flag | Description |
|---|---|
-e, --env | Environment: docker, daytona, modal, e2b, runloop, gke |
--force-build/--no-force-build | Force rebuild environment |
--delete/--no-delete | Delete environment after run |
--override-cpus | Override CPU count |
--override-memory-mb | Override memory (MB) |
--override-storage-mb | Override storage (MB) |
--override-gpus | Override GPU count |
--ek, --environment-kwarg | Environment kwargs |
| Flag | Description |
|---|---|
-n, --n-concurrent | Parallel trials (default: 1) |
-k, --n-attempts | Attempts per task (default: 1) |
-o, --jobs-dir | Output directory (default: jobs/) |
--job-name | Custom job name |
--timeout-multiplier | Multiply task timeouts |
-q, --quiet | Suppress progress display |
--debug | Enable debug logging |
-c, --config | Job config file (yaml/json) |
| Flag | Description |
|---|---|
-r, --max-retries | Max retry attempts |
--retry-include | Exception types to retry |
--retry-exclude | Exception types to skip |
| Flag | Description |
|---|---|
--export-traces | Export traces after job |
--export-sharegpt | Include ShareGPT format |
--export-episodes | all or last |
--export-push | Push to HuggingFace Hub |
--export-repo | HF repo id (org/name) |
harbor datasets list # List registry datasetsharbor jobs start ... # Same as harbor run
harbor jobs resume -p <job-dir> # Resume failed job
harbor jobs summarize <job-dir> # AI-summarize failures# Export traces from job directory
harbor traces export -p <path> --recursive
# Filter by result
harbor traces export -p <path> --filter success
harbor traces export -p <path> --filter failure
# Push to HuggingFace
harbor traces export -p <path> --push --repo org/nameRun successive sweeps, dropping tasks with successes:
harbor sweeps run -c config.yaml --max-sweeps 3 --trials-per-task 2
harbor sweeps run -c config.yaml --push --export-repo org/nameharbor adapters init <name> # Create new adapter
harbor adapters validate <path> # Validate adapterharbor cache clear # Clear Harbor cache
harbor view # Start web UI for browsing results
harbor --version # Show version# config.yaml
job_name: my-evaluation
jobs_dir: ./jobs
n_attempts: 2
orchestrator:
n_concurrent_trials: 8
agents:
- name: claude-code
model_name: anthropic/claude-sonnet-4-5-20250514
kwargs:
max_thinking_tokens: 16000
environment:
type: daytona
force_build: false
delete: true
datasets:
- name: terminal-bench
version: "2.0"
n_tasks: 10# Run Terminal-Bench with Claude Opus 4.5
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4-5-20251101 -n 4
# Run SWE-Bench on Daytona cloud with Sonnet 4.5
harbor run -d swebench-verified@1.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 \
--env daytona -n 100
# Run local task with OpenAI o1
harbor run -p ./my-task -a openhands -m openai/o1 --debug
# Run with GPT-4o
harbor run -p ./my-task -a aider -m openai/gpt-4o
# Custom agent with kwargs
harbor run -p datasets/task \
--agent-import-path my_agents:MyAgent \
--ak custom_param=value \
-m anthropic/claude-sonnet-4-5-20250514
# Export traces to HuggingFace
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4-5-20251101 \
--export-traces --export-push --export-repo myorg/traces
# Resume a failed job
harbor jobs resume -p jobs/my-job-2024-01-15| Variable | Description |
|---|---|
ANTHROPIC_API_KEY | Anthropic API key |
OPENAI_API_KEY | OpenAI API key |
DAYTONA_API_KEY | Daytona cloud API key |
E2B_API_KEY | E2B sandbox API key |
MODAL_TOKEN_ID | Modal cloud credentials |
MODAL_TOKEN_SECRET | Modal cloud credentials |
Install with Tessl CLI
npx tessl i honeybadge/harbor@0.1.0