sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x

Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

Securityby

Passed

No known issues

Phase 8: Prompt Engineering

Name: sharaf/agentic-harness-architect
Rating: 100 (1 reviews)
Author: sharaf

System prompt design principles

Start minimal, add targeted instructions only when eval data shows failure (Right Altitude Framework)
Every token must earn its place; reasoning degrades around ~3,000 prompt tokens
Use three-tier boundary system: always do / ask-first / never do
Reframe negatives as positive directives; reserve "never" for genuine safety constraints
Include concrete code examples, not prose descriptions of code style

Positive framing — rewrite EVERY restriction as a directive

Every constraint that can be expressed as an action the agent should take must be written that way. The ONLY negative-framed lines that may appear in the final prompt are inside the "Never Do" tier of the three-tier permission structure (safety-critical only — typically ≤3 items).

❌ Negative phrasing	✅ Positive directive
"Do not use line-number-based patches"	"Use `str_replace` with exact unique strings to edit files"
"Don't return vague error messages"	"Return errors with expected format, constraints, and a correction example"
"Avoid relative paths"	"Use absolute filepaths for every file operation"
"Do not modify the user's git config"	"Limit git operations to commits and reads of the working tree"
"Never edit files outside the repo"	"Confine all edits to paths inside the configured repository root"
"Don't write to stdout in tool outputs"	"Route all tool output through the structured logger"
"Avoid hardcoding API keys"	"Read every secret from the configured env vars or vault path"

Apply this rewrite pass once after the first draft, before shipping the prompt.

Just-in-time steering — inject guidance at decision points

Don't front-load every instruction in the system prompt. Instead, inject contextual guidance immediately before the relevant decision or tool call. Achieves 100% accuracy across 600 runs vs. 82.5% for upfront-only instructions.

When the deliverable includes system-prompt.md, prompt-architecture.md, or another prompt architecture artifact, include a visible ## Just-in-Time Steering Protocol section. This section must say:

The base system prompt contains only stable identity, permissions, output format, and safety boundaries.
Decision-specific guidance is injected just before the relevant tool call, file edit, permission decision, verification step, or recovery step.
The injected guidance is removed after that decision unless a failure requires one post-result hint on the next turn.

Use the words "just-in-time" or "contextual steering" explicitly, and include a small trigger table like this:

Trigger	Injected guidance	Removed when
Editing a file over 500 lines	Require exact string replacement with 3+ lines of context	Edit succeeds or one retry fails
Touching an ask-first path	Require explicit user approval before write	Approval is granted or denied
Verification fails	Preserve the error verbatim and retry only after root-cause analysis	Next correction is attempted

Implementation pattern — three layers:

System prompt — stable identity, three-tier permissions, output format. ~500-1500 tokens, never changes mid-session.
Pre-tool guidance — small (~50-150 token) blocks injected immediately before specific tool calls, conditional on context.
Post-result hints — micro-corrections after tool failures, injected into the next turn only.

Example — file-edit tool

[System prompt — always present]
You edit files using str_replace with absolute paths.

[Injected before any str_replace call when target_file > 500 lines]
This file is 1,247 lines. Include >=3 lines of context on each side
of your old_string to guarantee uniqueness. If your match fails,
re-read the relevant region before retrying.

[Injected after a failed str_replace]
Your previous str_replace failed: old_string was not unique.
Add more surrounding context and retry once before considering
a different edit strategy.

The system prompt stays minimal. The just-in-time blocks fire only when relevant — so the prompt the agent sees at each decision point is laser-targeted.

Prompt architecture by task type

Task type	Architecture
Single-turn completable	Minimal prompt via Right Altitude
Multi-session	Differentiated prompts (initializer + continuation) with JSON feature tracking
Multi-step within session	Technical Design Spec with sprint contracts
Subjective quality	Generator-Evaluator with separated rubric prompts

Specification granularity — what to specify vs. leave implicit

Requirement type	Specify?	Rationale
Format requirements	Usually no	70.7% guessed correctly; specifying adds bloat
Conditional logic	Always	Only 22.9% guessed correctly; high regression risk
Error handling patterns	Yes	Model-dependent; regresses across updates
Code style	Yes, with examples	Actual code, not prose
Implementation details	No	Over-specification causes cascade failures

When producing a ready-to-use system prompt, include a compact ## Conditional Rules section with explicit branches. Use if / then / otherwise wording, not implicit prose. Include at least three rules covering permission decisions, large-file edits, verification, or recovery:

## Conditional Rules

- If a target file is larger than 500 lines, then edit with exact string
  replacement and at least 3 lines of context; otherwise use the standard edit
  path.
- If a change touches an ask-first path, then request approval before writing;
  otherwise proceed within the Always Do permissions.
- If verification fails, then preserve the error verbatim, identify root cause,
  and retry once; otherwise summarize the passing check count only.

Code style guidance requires concrete code blocks

When the system prompt specifies code style, include 2-3 verbatim code blocks showing the desired style. Prose descriptions ("use modern idioms", "keep functions short") produce inconsistent results because models infer different styles from the same words. A code block is unambiguous.

Include code blocks for: naming conventions, error-handling pattern, async/concurrency pattern, module/file structure, comment style. Even one well-chosen example anchors the model better than 10 lines of prose.

# In your produced system-prompt.md, include sections like:

## Code style

Use type hints on every public function. Prefer explicit returns
over implicit None. Use f-strings for formatting.

# Example of the style we expect:

def parse_config(path: Path) -> Config:
    """Load a Config from a YAML file."""
    if not path.exists():
        raise FileNotFoundError(f"Config not found at {path}")
    raw = yaml.safe_load(path.read_text())
    return Config.model_validate(raw)

The actual code blocks become part of the system prompt — the agent sees them at every step. Always include at least two examples covering different aspects of the style.

Evaluator prompts deserve equal investment — and aspirational framing

The evaluator's opening line activates a quality register that propagates through the entire evaluation. Utilitarian openers ("Check the work for issues") produce permissive, surface-level reviews. Aspirational openers produce rigorous, standards-driven reviews.

Open every evaluator prompt with at least one of these aspirational phrases (or equivalents in your domain):

"You are an expert {domain} reviewer..."
"Evaluate against the standard of an exceptional {profession}..."
"Score against the highest standard of {quality dimension}..."
"Review with the rigor of a museum-quality assessment..."

Verbatim example evaluator opener (use as a template, swap the domain):

You are an expert code migration reviewer with the standards of a museum
curator examining a Renaissance painting — every fragment must earn its
place on the wall. The migration you are evaluating must be:

- idiomatically modern (no remnants of the legacy style)
- conceptually clean (avoid AI slop, copy-paste cargo cult, defensive
  pseudo-coverage)
- worthy of a senior engineer's signature on the PR

Score the migration against the rubric below. For each criterion, cite
the specific lines or patterns that earn or fail the score.

Two stylistic levers in that example: (1) aspirational phrases like "expert", "exceptional", "highest standard", "museum-quality" — these activate the quality register; (2) anti-pattern callouts like "AI slop", "cargo cult", "defensive pseudo-coverage" — these steer more effectively than positive-only instructions because they name the failure modes directly.

Always include both registers in the evaluator opener — aspirational ceiling + anti-pattern floor.