sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x

Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

Securityby

Passed

No known issues

Phase 6: Evaluation Design

Name: sharaf/agentic-harness-architect
Rating: 100 (1 reviews)
Author: sharaf

Separating generation from evaluation is the single most impactful harness design decision. Self-evaluation bias inflates scores by 4-9% across all tested models.

Evaluator architecture decision tree

Is the output objectively testable?
- Yes, comprehensive test suite exists → deterministic evaluation only, no LLM needed
- Yes, incomplete test suite → LLM-generated tests, then deterministic evaluation
- No → LLM-based evaluation required
If LLM-based:
- Same model as generator? → HIGH RISK (bias 0.520); mitigate via different model, 3-4 model ensemble, or perplexity-aware weighting
- Different model → Lower bias; still calibrate
Consistent domain → Fixed calibrated rubric
Variable domain → Adaptive rubric generation (AdaRubric)

Evaluation ordering — cheap deterministic checks first

Compilation / syntax check
Static analysis / linting
Type checking
Test execution (FAIL_TO_PASS + PASS_TO_PASS)
Mutation analysis (test quality)
LLM-based holistic quality assessment (only if needed)

Rubric design

Present complete rubrics holistically (CRE), not criterion-by-criterion (PRE)
CRE achieves leniency score of 0.082 (well-calibrated) vs. PRE's -0.329 (excessively strict)
CRE Pearson correlation with human evaluators: 0.912
Calibrate rubrics by comparing evaluator logs to human expectations over several rounds
Weight dimensions where models underperform (design quality, originality) more heavily

Iteration limits

Simple (single function): 1-2 iterations
Multi-file feature: 2-3 iterations
Full-stack sprint (UI + API + DB): 3-5 iterations
Returns diminish sharply after initial iterations; cap generator-critic loops at 3

Optimize for precision, not catch rate

When evaluators surface candidate issues to developers, every false positive costs trust. Tooling that achieves a high catch rate by also flagging noise gets abandoned: at ~9 false positives per real bug, teams disable the tool entirely.

Tune evaluators to MAXIMIZE precision (fraction of flagged items that are real), not recall (fraction of real issues flagged).
Set a precision floor (e.g., ≥ 80% of flagged comments must be actionable) before any other tuning.
It is acceptable to MISS real issues if the alternative is shipping noise to developers — humans will catch what the evaluator misses; humans will not tolerate a tool that cries wolf.
Measure both precision and recall on a labeled golden set every release; treat precision regressions as launch-blocking.

Filtering for developer-facing output

Use a two-stage judge agent filter (HubSpot pattern): primary agent generates comments, separate judge evaluates for succinctness, accuracy, and actionability
Result: 80%+ approval rate, sub-3% negative feedback