sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x

Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

Securityby

Passed

No known issues

Guardrails

Name: sharaf/agentic-harness-architect
Rating: 100 (1 reviews)
Author: sharaf

Must-not-do rules that apply across all phases. Consult before finalizing a design.

Architecture

Do not reach for multi-agent before a single well-contextualized agent has been tried. For most real-world applications today, multi-agent systems are fragile and overrated compared to single, well-contextualized agents.
Do not allow self-evaluation. Separate generation from evaluation architecturally — this is structural, not fixable through prompting.
Do not parallelize write operations across agents. Read operations (research, analysis, testing) parallelize well; write operations require serialization through a single generator.
Do not exceed 4 agents without strong empirical justification. Performance saturates; coordination overhead consumes gains.
Do not use natural language at agent boundaries. Every boundary must enforce machine-checkable contracts via typed schemas.

Action Space

Do not exceed 12 tools per agent context. Add dynamic tool loading or sub-agents instead.
Do not use line-number-based file editing. Line numbers break between read and edit steps.
Do not use relative filepaths. Require absolute paths as a poka-yoke.
Do not return vague error messages. Include expected format, constraints, and correction examples.
Do not use CodeAct with models below GPT-4 class. Performance drops from 74.4% to 13.4%.

Observation & Context

Do not dump raw tool output into context. Apply per-tool-type summarization; "success is silent, failure is verbose."
Do not use line-based truncation limits. Use token-based limits (approximate with num_bytes / 4).
Do not truncate without recovery hints. Always tell the model what was removed and how to retrieve it.
Do not pass full coordinator history to sub-agents. Pass only specific instructions; let sub-agents gather their own context.
Do not modify prior messages mid-session or add timestamps to system prompts. This invalidates KV-cache (10x cost increase).
Do not strip error history during compaction. Failed attempts serve as implicit negative examples that prevent doom loops.

Evaluation

Do not deploy evaluators without empirical calibration. Systematic biases differ by task type. Target leniency score within +/-0.1.
Do not run uncapped evaluation loops. Hard limit of 3-5 iterations. Returns diminish sharply after initial iterations.
Do not use pointwise rubric evaluation (PRE). Present complete rubrics holistically (CRE) to avoid excessive strictness.
Do not optimize for catch rate alone. Excessive noise (9 false positives per real bug) causes tool abandonment.

Error Handling

Do not rely on the agent to detect its own loops. Loop detection must be architectural (LoopGuard), not instructional.
Do not retry permanent errors. Distinguish transient (429, 5xx) from permanent (logic errors, wrong approach). Permanent errors require strategy change.
Do not centralize retry logic across multiple layers. Single retry policy in one layer; all others delegate.
Do not allow agents to self-assess completion. Use external verification or structured checklists.

Simplification

Do not add components without passing the Complexity Justification Matrix (all four criteria).
Do not skip ablation after model upgrades. Run the full protocol; yesterday's scaffolding may be today's bloat.
Do not solve reasoning problems with infrastructure. Below GPT-4-class capability, no harness compensates for insufficient reasoning.
Do not build harness components as permanent architecture. Build to delete — treat them as temporary patches on model limitations.