Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).
100
100%
Does it follow best practices?
Impact
100%
1.23xAverage score across 4 eval scenarios
Passed
No known issues
A platform engineering team is building an agent that performs large-scale codebase refactoring — migrating a legacy Python 2 monorepo to Python 3, updating APIs, fixing type errors, and running a test suite after each batch of changes. Tasks routinely take 3-6 hours to complete. The agent uses Claude Sonnet as its model.
The team is hitting two interrelated problems. First, tool outputs are bloating the context window fast: test runners produce thousands of lines, search results return entire files, and the agent receives the full text of every output without any processing. By the midpoint of a task, context is saturated and quality degrades. Second, when they've tried to compact old context to recover space, the agent starts repeating mistakes it already made in earlier iterations — as if it forgot what didn't work.
The team knows they need a systematic approach to both how tool outputs are formatted before entering context and how context is managed over the multi-hour session. They want to handle the full range of output sizes cleanly, preserve the right information across compaction events, and ensure the cache costs stay reasonable as the agent runs over many turns.
Produce a technical specification document saved as observation-context-spec.md that covers:
The document should be detailed enough that an engineer could implement it directly. Be specific — include concrete thresholds and numerical targets where they exist.
evals
scenario-1
scenario-2
scenario-3
scenario-4
references