Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.
99
Does it follow best practices?
Validation for skill structure
| Agent | Description |
|---|---|
oracle | Runs reference solution (default) |
nop | No-op for testing |
claude-code | Claude Code CLI |
openhands | OpenHands agent |
aider | Aider coding assistant |
codex | OpenAI Codex CLI |
swe-agent | Full SWE-agent |
mini-swe-agent | Minimal SWE-agent |
cursor-cli | Cursor CLI |
gemini-cli | Gemini CLI |
goose | Goose coding agent |
opencode | OpenCode agent |
qwen-coder | Qwen Coder |
cline-cli | Cline CLI |
terminus-2 | Terminus 2 agent |
# Latest Sonnet 4.5
harbor run -p datasets/task -a claude-code -m anthropic/claude-sonnet-4-5-20250514
# Latest Opus 4.5 with extended thinking
harbor run -p datasets/task -a claude-code -m anthropic/claude-opus-4-5-20251101 \
--ak max_thinking_tokens=32000Environment variables: ANTHROPIC_API_KEY, CLAUDE_CODE_OAUTH_TOKEN, ANTHROPIC_BASE_URL
Allowed tools: Bash, Edit, Write, Read, Glob, Grep, LS, WebFetch, NotebookEdit, NotebookRead, TodoRead, TodoWrite, Agent, Skill, SlashCommand, Task, WebSearch
Models use LiteLLM format: provider/model-name
# Anthropic - Claude 4.5 (latest)
-m anthropic/claude-opus-4-5-20251101
-m anthropic/claude-sonnet-4-5-20250514
# Anthropic - Claude 4
-m anthropic/claude-opus-4-20250514
-m anthropic/claude-sonnet-4-20250514
# OpenAI - GPT-4o series
-m openai/gpt-4o
-m openai/gpt-4o-mini
# OpenAI - o-series (reasoning)
-m openai/o1
-m openai/o1-mini
-m openai/o3-mini
# OpenAI - Legacy
-m openai/gpt-4-turbo
# Google
-m google/gemini-2.0-flash
-m google/gemini-1.5-pro
# OpenRouter (via ANTHROPIC_BASE_URL)
-m openrouter/anthropic/claude-sonnet-4.5--ak# Custom version
harbor run -p datasets/task -a claude-code --ak version=1.0.23
# Extended thinking for Opus 4.5
harbor run -p datasets/task -a claude-code -m anthropic/claude-opus-4-5-20251101 \
--ak max_thinking_tokens=32000
# Custom prompt template
harbor run -p datasets/task -a claude-code \
--ak prompt_template_path=/path/to/template.mdfrom pathlib import Path
from harbor.agents.base import BaseAgent
from harbor.environments.base import BaseEnvironment
from harbor.models.agent.context import AgentContext
class SimpleAgent(BaseAgent):
SUPPORTS_ATIF: bool = False # Set True if outputting ATIF trajectories
@staticmethod
def name() -> str:
return "simple-agent"
def __init__(self, logs_dir: Path, model_name: str | None = None,
custom_param: str = "default", **kwargs):
super().__init__(logs_dir, model_name=model_name, **kwargs)
self.custom_param = custom_param
def version(self) -> str | None:
return "1.0.0"
async def setup(self, environment: BaseEnvironment) -> None:
await environment.exec("pip install requests")
async def run(self, instruction: str, environment: BaseEnvironment,
context: AgentContext) -> None:
result = await environment.exec(
command=f"python solve.py",
env={"CUSTOM_PARAM": self.custom_param},
)
context.n_input_tokens = 100
context.n_output_tokens = 50from harbor.agents.installed.base import BaseInstalledAgent, ExecInput
class MyCLIAgent(BaseInstalledAgent):
SUPPORTS_ATIF: bool = True
@staticmethod
def name() -> str:
return "my-cli-agent"
@property
def _install_agent_template_path(self) -> Path:
return Path(__file__).parent / "install-my-agent.sh.j2"
def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
return [ExecInput(
command=f"my-agent solve {shlex.quote(instruction)}",
env={"MY_API_KEY": self.api_key},
)]
def populate_context_post_run(self, context: AgentContext) -> None:
# Parse output and populate context
passharbor run -p datasets/task \
--agent-import-path my_agents.simple_agent:SimpleAgent \
-m anthropic/claude-sonnet-4-5-20250514 \
--ak custom_param=my_valueStandardized JSON format for agent traces. Set SUPPORTS_ATIF = True to enable trajectory export.
Schema version: ATIF-v1.2
Trajectories include:
# Get agent __init__ signature for --ak kwargs
uv run --with harbor python -c "
import inspect
from harbor.agents.installed.claude_code import ClaudeCode
print(inspect.signature(ClaudeCode.__init__))
"
# List installed agents
uv run --with harbor python -c "
import os, harbor
path = os.path.join(os.path.dirname(harbor.__file__), 'agents', 'installed')
for f in sorted(os.listdir(path)):
if f.endswith('.py'): print(f)
"# Execute command
result = await environment.exec(
command="python solve.py",
cwd="/workspace",
env={"API_KEY": "secret"},
timeout_sec=300,
)
print(result.return_code, result.stdout, result.stderr)
# Upload files
await environment.upload_file(Path("./script.py"), "/workspace/script.py")
await environment.upload_dir(Path("./solution"), "/workspace/solution")Install with Tessl CLI
npx tessl i honeybadge/harbor@0.1.0