CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x
Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

phase-06-evaluation.mdreferences/

Phase 6: Evaluation Design

Separating generation from evaluation is the single most impactful harness design decision. Self-evaluation bias inflates scores by 4-9% across all tested models.

Evaluator architecture decision tree

  1. Is the output objectively testable?
    • Yes, comprehensive test suite exists → deterministic evaluation only, no LLM needed
    • Yes, incomplete test suite → LLM-generated tests, then deterministic evaluation
    • No → LLM-based evaluation required
  2. If LLM-based:
    • Same model as generator? → HIGH RISK (bias 0.520); mitigate via different model, 3-4 model ensemble, or perplexity-aware weighting
    • Different model → Lower bias; still calibrate
  3. Consistent domain → Fixed calibrated rubric
  4. Variable domain → Adaptive rubric generation (AdaRubric)

Evaluation ordering — cheap deterministic checks first

  1. Compilation / syntax check
  2. Static analysis / linting
  3. Type checking
  4. Test execution (FAIL_TO_PASS + PASS_TO_PASS)
  5. Mutation analysis (test quality)
  6. LLM-based holistic quality assessment (only if needed)

Rubric design

  • Present complete rubrics holistically (CRE), not criterion-by-criterion (PRE)
  • CRE achieves leniency score of 0.082 (well-calibrated) vs. PRE's -0.329 (excessively strict)
  • CRE Pearson correlation with human evaluators: 0.912
  • Calibrate rubrics by comparing evaluator logs to human expectations over several rounds
  • Weight dimensions where models underperform (design quality, originality) more heavily

Iteration limits

  • Simple (single function): 1-2 iterations
  • Multi-file feature: 2-3 iterations
  • Full-stack sprint (UI + API + DB): 3-5 iterations
  • Returns diminish sharply after initial iterations; cap generator-critic loops at 3

Optimize for precision, not catch rate

When evaluators surface candidate issues to developers, every false positive costs trust. Tooling that achieves a high catch rate by also flagging noise gets abandoned: at ~9 false positives per real bug, teams disable the tool entirely.

  • Tune evaluators to MAXIMIZE precision (fraction of flagged items that are real), not recall (fraction of real issues flagged).
  • Set a precision floor (e.g., ≥ 80% of flagged comments must be actionable) before any other tuning.
  • It is acceptable to MISS real issues if the alternative is shipping noise to developers — humans will catch what the evaluator misses; humans will not tolerate a tool that cries wolf.
  • Measure both precision and recall on a labeled golden set every release; treat precision regressions as launch-blocking.

Filtering for developer-facing output

  • Use a two-stage judge agent filter (HubSpot pattern): primary agent generates comments, separate judge evaluates for succinctness, accuracy, and actionability
  • Result: 80%+ approval rate, sub-3% negative feedback

README.md

SKILL.md

tile.json