sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x

Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

Securityby

Passed

No known issues

Phase 5: Observation Formatting & Context Management

Name: sharaf/agentic-harness-architect
Rating: 100 (1 reviews)
Author: sharaf

Observation formatting is a direct lever on task completion rates. 30-60% of tokens sent to models add no value.

Observation formatting pipeline

Raw capture: stdout, stderr, exit code, metadata
Per-tool transformation: line numbers for files, exit-code-first for commands, ranked snippets for search
Size assessment: measure against token budget
Compression: summarize, truncate, or offload based on thresholds
Metadata injection: truncation markers, recovery hints, file references
Position optimization: highest-signal content at start/end (U-shaped attention curve)

The cardinal rule — "success is silent, failure is verbose"

Passing tests: count only
Failing tests: full output
Successful commands: confirmation line
Failed commands: full error with context

Truncation strategy by output size

Output size	Strategy
< 2K tokens	Pass through unchanged
2K-10K tokens, high signal	Per-tool summarization
2K-10K tokens, low signal	Head-tail truncation
10K-25K tokens	Offload to disk + preview
> 25K tokens	Offload + sub-agent delegation
Error output (any size)	Preserve verbatim up to 5K tokens

Token limits

Use token-based limits, not line-based. Approximate with num_bytes / 4 when no tokenizer is available. 25K tokens (Claude Code default) is a well-tested ceiling.

Context management strategy by task duration

Duration	Strategy
< 30 min	No management needed
30 min - 3 hours	FIC (Frequent Intentional Compaction) at phase boundaries; target 40-60% utilization
3+ hours	FIC with sub-agents (essential) or one-session-per-task resets
Automated loops	One-session-per-task with external state persistence
Parallel exploration	Sub-agent context isolation

Context quality zones

0-40%: High quality
40-60%: Optimal (FIC target)
60-80%: Degrading — activate per-tool summarizers
80-85%: Replace older tool results with ~15-token reference pointers
85-90%: Remove entire older turns
90%+: Emergency compaction or context reset

Compaction preference hierarchy

Best to worst quality:

Raw context (keep original)
Tool result clearing (API-level)
Observation masking (52% cheaper than summarization, +2.6% solve rate)
Structured summarization with anchored sections
Free-form summarization (last resort)

When producing observation-context-spec.md, include this hierarchy by name and in this order. State the default rule as:

Keep raw context when possible; if space is needed, clear bulky tool results first, then mask low-signal observations, then use anchored structured summaries. Use free-form summarization only as a last resort.

The spec must distinguish these approaches instead of saying "compact old context" generically, and it must explicitly preserve failed attempts, error traces, and rejected strategies across every compaction step.

KV-cache preservation

Maintain stable prefixes. Append-only context modifications. Deterministic JSON serialization (sorted keys). Never remove/reorder tools mid-session (10x cost increase from cache invalidation: $0.30 vs. $3.00/MTok).