CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x
Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

guardrails.mdreferences/

Guardrails

Must-not-do rules that apply across all phases. Consult before finalizing a design.

Architecture

  • Do not reach for multi-agent before a single well-contextualized agent has been tried. For most real-world applications today, multi-agent systems are fragile and overrated compared to single, well-contextualized agents.
  • Do not allow self-evaluation. Separate generation from evaluation architecturally — this is structural, not fixable through prompting.
  • Do not parallelize write operations across agents. Read operations (research, analysis, testing) parallelize well; write operations require serialization through a single generator.
  • Do not exceed 4 agents without strong empirical justification. Performance saturates; coordination overhead consumes gains.
  • Do not use natural language at agent boundaries. Every boundary must enforce machine-checkable contracts via typed schemas.

Action Space

  • Do not exceed 12 tools per agent context. Add dynamic tool loading or sub-agents instead.
  • Do not use line-number-based file editing. Line numbers break between read and edit steps.
  • Do not use relative filepaths. Require absolute paths as a poka-yoke.
  • Do not return vague error messages. Include expected format, constraints, and correction examples.
  • Do not use CodeAct with models below GPT-4 class. Performance drops from 74.4% to 13.4%.

Observation & Context

  • Do not dump raw tool output into context. Apply per-tool-type summarization; "success is silent, failure is verbose."
  • Do not use line-based truncation limits. Use token-based limits (approximate with num_bytes / 4).
  • Do not truncate without recovery hints. Always tell the model what was removed and how to retrieve it.
  • Do not pass full coordinator history to sub-agents. Pass only specific instructions; let sub-agents gather their own context.
  • Do not modify prior messages mid-session or add timestamps to system prompts. This invalidates KV-cache (10x cost increase).
  • Do not strip error history during compaction. Failed attempts serve as implicit negative examples that prevent doom loops.

Evaluation

  • Do not deploy evaluators without empirical calibration. Systematic biases differ by task type. Target leniency score within +/-0.1.
  • Do not run uncapped evaluation loops. Hard limit of 3-5 iterations. Returns diminish sharply after initial iterations.
  • Do not use pointwise rubric evaluation (PRE). Present complete rubrics holistically (CRE) to avoid excessive strictness.
  • Do not optimize for catch rate alone. Excessive noise (9 false positives per real bug) causes tool abandonment.

Error Handling

  • Do not rely on the agent to detect its own loops. Loop detection must be architectural (LoopGuard), not instructional.
  • Do not retry permanent errors. Distinguish transient (429, 5xx) from permanent (logic errors, wrong approach). Permanent errors require strategy change.
  • Do not centralize retry logic across multiple layers. Single retry policy in one layer; all others delegate.
  • Do not allow agents to self-assess completion. Use external verification or structured checklists.

Simplification

  • Do not add components without passing the Complexity Justification Matrix (all four criteria).
  • Do not skip ablation after model upgrades. Run the full protocol; yesterday's scaffolding may be today's bloat.
  • Do not solve reasoning problems with infrastructure. Below GPT-4-class capability, no harness compensates for insufficient reasoning.
  • Do not build harness components as permanent architecture. Build to delete — treat them as temporary patches on model limitations.

README.md

SKILL.md

tile.json