Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).
100
100%
Does it follow best practices?
Impact
100%
1.23xAverage score across 4 eval scenarios
Passed
No known issues
A software tools company is launching an automated code review agent that analyzes pull requests and produces inline comments, summary feedback, and a go/no-go recommendation. The agent targets teams that ship frequently and need fast, consistent feedback — particularly for catching logic errors, security issues, and style deviations before human review.
The team has a working generator but no real evaluation infrastructure. Their current approach is: after the agent produces its review, a second call to the same model is made with the instruction "check your own work." During dogfooding they've found this produces consistently over-optimistic assessments — the agent almost never recommends changes to its own output. A developer who reviewed 50 agent outputs against the ground-truth human reviews estimated that about 30% of the agent's "approved" outputs actually had significant issues.
The team wants to redesign the evaluation system end-to-end. They have access to the same frontier models as the generator, standard CI tooling (linters, type checkers, test runners), and the git history for each PR. The reviews are developer-facing — engineers will read them directly and act on them.
Produce a document evaluation-design.md that specifies:
Be specific — include the ordering of evaluation stages, numerical limits, and concrete recommendations the team can implement.
evals
scenario-1
scenario-2
scenario-3
scenario-4
references