CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x
Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

task.mdevals/scenario-4/

Evaluation Pipeline for an Automated Code Review Agent

Background

A software tools company is launching an automated code review agent that analyzes pull requests and produces inline comments, summary feedback, and a go/no-go recommendation. The agent targets teams that ship frequently and need fast, consistent feedback — particularly for catching logic errors, security issues, and style deviations before human review.

The team has a working generator but no real evaluation infrastructure. Their current approach is: after the agent produces its review, a second call to the same model is made with the instruction "check your own work." During dogfooding they've found this produces consistently over-optimistic assessments — the agent almost never recommends changes to its own output. A developer who reviewed 50 agent outputs against the ground-truth human reviews estimated that about 30% of the agent's "approved" outputs actually had significant issues.

The team wants to redesign the evaluation system end-to-end. They have access to the same frontier models as the generator, standard CI tooling (linters, type checkers, test runners), and the git history for each PR. The reviews are developer-facing — engineers will read them directly and act on them.

Output Specification

Produce a document evaluation-design.md that specifies:

  1. The architecture of the evaluation pipeline — what stages run, in what order, and what each stage checks
  2. How the LLM-based evaluator is structured (model selection, rubric design, how the rubric is presented to the evaluator)
  3. How iteration between generator and evaluator is managed, including how many refinement cycles are allowed
  4. How the evaluation is calibrated and how to detect when the evaluator has drifted from human expectations
  5. Any special handling needed because the output goes directly to developers

Be specific — include the ordering of evaluation stages, numerical limits, and concrete recommendations the team can implement.

evals

README.md

SKILL.md

tile.json