CtrlK
BlogDocsLog inGet started
Tessl Logo

honeybadge/harbor

Framework for AI agent evaluation in containerized environments. Use when: (1) Running agent evaluations with `harbor run` against benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc.), (2) Creating custom benchmark tasks with Dockerfile, instruction.md, solution, and tests, (3) Building adapters to convert existing benchmarks to Harbor format, (4) Implementing custom agents extending BaseAgent or BaseInstalledAgent, (5) Scaling evaluations to cloud providers (Daytona, Modal, E2B), (6) Exporting traces for RL/SFT training, (7) Debugging Harbor runs or inspecting package internals.

99

Does it follow best practices?

Validation for skill structure

Overview
Skills
Evals
Files

agents-usage.mdreferences/

Custom Agents in Harbor

Built-in Agents

AgentDescription
oracleRuns reference solution (default)
nopNo-op for testing
claude-codeClaude Code CLI
openhandsOpenHands agent
aiderAider coding assistant
codexOpenAI Codex CLI
swe-agentFull SWE-agent
mini-swe-agentMinimal SWE-agent
cursor-cliCursor CLI
gemini-cliGemini CLI
gooseGoose coding agent
opencodeOpenCode agent
qwen-coderQwen Coder
cline-cliCline CLI
terminus-2Terminus 2 agent

Claude Code Agent

# Latest Sonnet 4.5
harbor run -p datasets/task -a claude-code -m anthropic/claude-sonnet-4-5-20250514

# Latest Opus 4.5 with extended thinking
harbor run -p datasets/task -a claude-code -m anthropic/claude-opus-4-5-20251101 \
    --ak max_thinking_tokens=32000

Environment variables: ANTHROPIC_API_KEY, CLAUDE_CODE_OAUTH_TOKEN, ANTHROPIC_BASE_URL

Allowed tools: Bash, Edit, Write, Read, Glob, Grep, LS, WebFetch, NotebookEdit, NotebookRead, TodoRead, TodoWrite, Agent, Skill, SlashCommand, Task, WebSearch

Model Format

Models use LiteLLM format: provider/model-name

# Anthropic - Claude 4.5 (latest)
-m anthropic/claude-opus-4-5-20251101
-m anthropic/claude-sonnet-4-5-20250514

# Anthropic - Claude 4
-m anthropic/claude-opus-4-20250514
-m anthropic/claude-sonnet-4-20250514

# OpenAI - GPT-4o series
-m openai/gpt-4o
-m openai/gpt-4o-mini

# OpenAI - o-series (reasoning)
-m openai/o1
-m openai/o1-mini
-m openai/o3-mini

# OpenAI - Legacy
-m openai/gpt-4-turbo

# Google
-m google/gemini-2.0-flash
-m google/gemini-1.5-pro

# OpenRouter (via ANTHROPIC_BASE_URL)
-m openrouter/anthropic/claude-sonnet-4.5

Passing Options via --ak

# Custom version
harbor run -p datasets/task -a claude-code --ak version=1.0.23

# Extended thinking for Opus 4.5
harbor run -p datasets/task -a claude-code -m anthropic/claude-opus-4-5-20251101 \
    --ak max_thinking_tokens=32000

# Custom prompt template
harbor run -p datasets/task -a claude-code \
    --ak prompt_template_path=/path/to/template.md

Implementing Custom Agents

Option 1: Extend BaseAgent

from pathlib import Path
from harbor.agents.base import BaseAgent
from harbor.environments.base import BaseEnvironment
from harbor.models.agent.context import AgentContext

class SimpleAgent(BaseAgent):
    SUPPORTS_ATIF: bool = False  # Set True if outputting ATIF trajectories

    @staticmethod
    def name() -> str:
        return "simple-agent"

    def __init__(self, logs_dir: Path, model_name: str | None = None,
                 custom_param: str = "default", **kwargs):
        super().__init__(logs_dir, model_name=model_name, **kwargs)
        self.custom_param = custom_param

    def version(self) -> str | None:
        return "1.0.0"

    async def setup(self, environment: BaseEnvironment) -> None:
        await environment.exec("pip install requests")

    async def run(self, instruction: str, environment: BaseEnvironment,
                  context: AgentContext) -> None:
        result = await environment.exec(
            command=f"python solve.py",
            env={"CUSTOM_PARAM": self.custom_param},
        )
        context.n_input_tokens = 100
        context.n_output_tokens = 50

Option 2: Extend BaseInstalledAgent (CLI Tools)

from harbor.agents.installed.base import BaseInstalledAgent, ExecInput

class MyCLIAgent(BaseInstalledAgent):
    SUPPORTS_ATIF: bool = True

    @staticmethod
    def name() -> str:
        return "my-cli-agent"

    @property
    def _install_agent_template_path(self) -> Path:
        return Path(__file__).parent / "install-my-agent.sh.j2"

    def create_run_agent_commands(self, instruction: str) -> list[ExecInput]:
        return [ExecInput(
            command=f"my-agent solve {shlex.quote(instruction)}",
            env={"MY_API_KEY": self.api_key},
        )]

    def populate_context_post_run(self, context: AgentContext) -> None:
        # Parse output and populate context
        pass

Using Custom Agents

harbor run -p datasets/task \
    --agent-import-path my_agents.simple_agent:SimpleAgent \
    -m anthropic/claude-sonnet-4-5-20250514 \
    --ak custom_param=my_value

ATIF (Agent Trajectory Interchange Format)

Standardized JSON format for agent traces. Set SUPPORTS_ATIF = True to enable trajectory export.

Schema version: ATIF-v1.2

Trajectories include:

  • Session metadata
  • Agent info (name, version, model)
  • Steps (messages, tool calls, observations)
  • Final metrics (tokens, cost)

Inspecting Agent Source

# Get agent __init__ signature for --ak kwargs
uv run --with harbor python -c "
import inspect
from harbor.agents.installed.claude_code import ClaudeCode
print(inspect.signature(ClaudeCode.__init__))
"

# List installed agents
uv run --with harbor python -c "
import os, harbor
path = os.path.join(os.path.dirname(harbor.__file__), 'agents', 'installed')
for f in sorted(os.listdir(path)):
    if f.endswith('.py'): print(f)
"

Environment Interface

# Execute command
result = await environment.exec(
    command="python solve.py",
    cwd="/workspace",
    env={"API_KEY": "secret"},
    timeout_sec=300,
)
print(result.return_code, result.stdout, result.stderr)

# Upload files
await environment.upload_file(Path("./script.py"), "/workspace/script.py")
await environment.upload_dir(Path("./solution"), "/workspace/solution")

Install with Tessl CLI

npx tessl i honeybadge/harbor@0.1.0

references

adapters.md

agents-usage.md

commands.md

tasks.md

SKILL.md

tile.json