Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).
100
100%
Does it follow best practices?
Impact
100%
1.23xAverage score across 4 eval scenarios
Passed
No known issues
Observation formatting is a direct lever on task completion rates. 30-60% of tokens sent to models add no value.
| Output size | Strategy |
|---|---|
| < 2K tokens | Pass through unchanged |
| 2K-10K tokens, high signal | Per-tool summarization |
| 2K-10K tokens, low signal | Head-tail truncation |
| 10K-25K tokens | Offload to disk + preview |
| > 25K tokens | Offload + sub-agent delegation |
| Error output (any size) | Preserve verbatim up to 5K tokens |
Use token-based limits, not line-based. Approximate with num_bytes / 4 when no tokenizer is available. 25K tokens (Claude Code default) is a well-tested ceiling.
| Duration | Strategy |
|---|---|
| < 30 min | No management needed |
| 30 min - 3 hours | FIC (Frequent Intentional Compaction) at phase boundaries; target 40-60% utilization |
| 3+ hours | FIC with sub-agents (essential) or one-session-per-task resets |
| Automated loops | One-session-per-task with external state persistence |
| Parallel exploration | Sub-agent context isolation |
Best to worst quality:
When producing observation-context-spec.md, include this hierarchy by name and
in this order. State the default rule as:
Keep raw context when possible; if space is needed, clear bulky tool results first, then mask low-signal observations, then use anchored structured summaries. Use free-form summarization only as a last resort.
The spec must distinguish these approaches instead of saying "compact old context" generically, and it must explicitly preserve failed attempts, error traces, and rejected strategies across every compaction step.
Maintain stable prefixes. Append-only context modifications. Deterministic JSON serialization (sorted keys). Never remove/reorder tools mid-session (10x cost increase from cache invalidation: $0.30 vs. $3.00/MTok).
evals
scenario-1
scenario-2
scenario-3
scenario-4
references