CtrlK
BlogDocsLog inGet started
Tessl Logo

honeybadge/harbor

Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.

99

Does it follow best practices?

Validation for skill structure

Overview
Skills
Evals
Files

commands.mdreferences/

Harbor Command Reference

Primary Command: harbor run

Alias for harbor jobs start. The main way to run evaluations.

harbor run -d <dataset@version> -a <agent> -m <model>
harbor run -p <path/to/task-or-dataset> -a <agent> -m <model>

Dataset/Task Options

FlagDescription
-d, --datasetDataset name@version from registry
-p, --pathLocal task or dataset path
-t, --task-nameInclude specific tasks (glob patterns)
-x, --exclude-task-nameExclude tasks (glob patterns)
-l, --n-tasksLimit number of tasks
--registry-urlCustom registry URL
--registry-pathLocal registry.json path

Agent Options

FlagDescription
-a, --agentAgent name (default: oracle)
-m, --modelModel (format: provider/model)
--agent-import-pathCustom agent import path
--ak, --agent-kwargAgent kwargs (key=value, repeatable)

Environment Options

FlagDescription
-e, --envEnvironment: docker, daytona, modal, e2b, runloop, gke
--force-build/--no-force-buildForce rebuild environment
--delete/--no-deleteDelete environment after run
--override-cpusOverride CPU count
--override-memory-mbOverride memory (MB)
--override-storage-mbOverride storage (MB)
--override-gpusOverride GPU count
--ek, --environment-kwargEnvironment kwargs

Job Options

FlagDescription
-n, --n-concurrentParallel trials (default: 1)
-k, --n-attemptsAttempts per task (default: 1)
-o, --jobs-dirOutput directory (default: jobs/)
--job-nameCustom job name
--timeout-multiplierMultiply task timeouts
-q, --quietSuppress progress display
--debugEnable debug logging
-c, --configJob config file (yaml/json)

Retry Options

FlagDescription
-r, --max-retriesMax retry attempts
--retry-includeException types to retry
--retry-excludeException types to skip

Trace Export Options

FlagDescription
--export-tracesExport traces after job
--export-sharegptInclude ShareGPT format
--export-episodesall or last
--export-pushPush to HuggingFace Hub
--export-repoHF repo id (org/name)

Dataset Commands

harbor datasets list    # List registry datasets

Job Commands

harbor jobs start ...           # Same as harbor run
harbor jobs resume -p <job-dir> # Resume failed job
harbor jobs summarize <job-dir> # AI-summarize failures

Trace Commands

# Export traces from job directory
harbor traces export -p <path> --recursive

# Filter by result
harbor traces export -p <path> --filter success
harbor traces export -p <path> --filter failure

# Push to HuggingFace
harbor traces export -p <path> --push --repo org/name

Sweep Commands

Run successive sweeps, dropping tasks with successes:

harbor sweeps run -c config.yaml --max-sweeps 3 --trials-per-task 2
harbor sweeps run -c config.yaml --push --export-repo org/name

Adapter Commands

harbor adapters init <name>     # Create new adapter
harbor adapters validate <path> # Validate adapter

Other Commands

harbor cache clear    # Clear Harbor cache
harbor view           # Start web UI for browsing results
harbor --version      # Show version

Configuration Files

# config.yaml
job_name: my-evaluation
jobs_dir: ./jobs
n_attempts: 2

orchestrator:
  n_concurrent_trials: 8

agents:
  - name: claude-code
    model_name: anthropic/claude-sonnet-4-5-20250514
    kwargs:
      max_thinking_tokens: 16000

environment:
  type: daytona
  force_build: false
  delete: true

datasets:
  - name: terminal-bench
    version: "2.0"
    n_tasks: 10

Examples

# Run Terminal-Bench with Claude Opus 4.5
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4-5-20251101 -n 4

# Run SWE-Bench on Daytona cloud with Sonnet 4.5
harbor run -d swebench-verified@1.0 -a claude-code -m anthropic/claude-sonnet-4-5-20250514 \
    --env daytona -n 100

# Run local task with OpenAI o1
harbor run -p ./my-task -a openhands -m openai/o1 --debug

# Run with GPT-4o
harbor run -p ./my-task -a aider -m openai/gpt-4o

# Custom agent with kwargs
harbor run -p datasets/task \
    --agent-import-path my_agents:MyAgent \
    --ak custom_param=value \
    -m anthropic/claude-sonnet-4-5-20250514

# Export traces to HuggingFace
harbor run -d terminal-bench@2.0 -a claude-code -m anthropic/claude-opus-4-5-20251101 \
    --export-traces --export-push --export-repo myorg/traces

# Resume a failed job
harbor jobs resume -p jobs/my-job-2024-01-15

Environment Variables

VariableDescription
ANTHROPIC_API_KEYAnthropic API key
OPENAI_API_KEYOpenAI API key
DAYTONA_API_KEYDaytona cloud API key
E2B_API_KEYE2B sandbox API key
MODAL_TOKEN_IDModal cloud credentials
MODAL_TOKEN_SECRETModal cloud credentials

Install with Tessl CLI

npx tessl i honeybadge/harbor@0.1.0

references

adapters.md

agents-usage.md

commands.md

tasks.md

SKILL.md

tile.json