Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).
100
100%
Does it follow best practices?
Impact
100%
1.23xAverage score across 4 eval scenarios
Passed
No known issues
Separating generation from evaluation is the single most impactful harness design decision. Self-evaluation bias inflates scores by 4-9% across all tested models.
When evaluators surface candidate issues to developers, every false positive costs trust. Tooling that achieves a high catch rate by also flagging noise gets abandoned: at ~9 false positives per real bug, teams disable the tool entirely.
evals
scenario-1
scenario-2
scenario-3
scenario-4
references