Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).
100
100%
Does it follow best practices?
Impact
100%
1.23xAverage score across 4 eval scenarios
Passed
No known issues
Design the tool set the agent will use. Tool granularity should match model capability.
| Model tier | Strategy | Example |
|---|---|---|
| Frontier (Opus 4+, GPT-5) | Coarse — bash + file editor | Claude Code minimal scaffold |
| Strong general (Sonnet 4, GPT-4o) | Medium — 5-10 curated tools | SWE-agent ACI |
| Mid-tier (Haiku, GPT-4o-mini) | Fine-grained with guardrails | Structured tools with validation |
| Small/open-source (<70B) | Maximum structure, atomic operations | Strictly typed, no CodeAct |
8-12 tools per agent context. Accuracy drops from ~95% (4 tools) to ~71% (46 tools). Tool definitions must stay under 20% of context budget. Use dynamic tool loading or sub-agent architectures for broader needs.
| Condition | Strategy |
|---|---|
| File < 200 lines | str_replace with exact unique match |
| File > 200 lines, change < 20 lines | str_replace with expanded context |
| Change > 20 lines, AST available | Semantic editing via FQDN/AST node |
| Change > 20 lines, no AST | Whole-file replacement with syntax/lint validation |
| Context | Strategy | Trade-off |
|---|---|---|
| Trusted code, internal team | Hardened Docker + seccomp + AppArmor | Low overhead |
| Untrusted code, single-tenant | gVisor containers | 10-30% I/O overhead, strong isolation |
| Multi-tenant, untrusted | Firecracker microVMs | ~125ms boot, <5 MiB/VM, hardware isolation |
| Regulated, zero-trust | Kata Containers on Kubernetes | ~200ms boot |
Separate observation tools (read-only) from mutation tools (write). Enable observation-only mode, graduated permission escalation, and audit trails (ask/allow/deny model).
evals
scenario-1
scenario-2
scenario-3
scenario-4
references