CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/agentic-harness-architect

Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).

100

1.23x
Quality

100%

Does it follow best practices?

Impact

100%

1.23x

Average score across 4 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-4/

{
  "context": "Tests whether the agent designs an evaluation pipeline following the skill's evaluation architecture guidelines: correct evaluation ordering (syntax→linting→type→test→mutation→LLM), separate evaluator model from generator, holistic rubric presentation (CRE not PRE), calibration approach, iteration limits by task scope, two-stage judge for developer-facing output.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Evaluation stage ordering",
      "description": "evaluation-design.md lists evaluation stages in the correct order: syntax checking before linting, linting before type checking, type checking before test execution, tests before mutation testing or LLM assessment — does NOT invert this sequence",
      "max_score": 14
    },
    {
      "name": "LLM evaluation stage last",
      "description": "evaluation-design.md explicitly places LLM-based assessment after all deterministic stages (syntax, linting, type checks, tests) — not as the first or only evaluation step",
      "max_score": 10
    },
    {
      "name": "Separate evaluator model",
      "description": "evaluation-design.md specifies using a different model (or an ensemble) for the LLM evaluator than the generator — does NOT instruct the generator to self-evaluate",
      "max_score": 12
    },
    {
      "name": "Holistic rubric presentation (CRE)",
      "description": "evaluation-design.md specifies presenting the rubric holistically (all criteria at once, or consolidated) rather than criterion-by-criterion — does NOT describe scoring each criterion independently in sequence",
      "max_score": 10
    },
    {
      "name": "Calibration approach specified",
      "description": "evaluation-design.md describes a calibration mechanism: comparing evaluator outputs to human-labeled examples, or logging evaluator decisions for human review",
      "max_score": 8
    },
    {
      "name": "Precision over catch rate",
      "description": "evaluation-design.md explicitly states that the evaluation should optimize for precision (avoiding false positives) rather than maximizing catch rate",
      "max_score": 8
    },
    {
      "name": "Iteration limit stated",
      "description": "evaluation-design.md specifies a numeric upper bound on generator-evaluator refinement iterations (e.g. 2-3 cycles for feature-level output) — not an open-ended loop",
      "max_score": 10
    },
    {
      "name": "Two-stage judge for developer-facing output",
      "description": "evaluation-design.md includes a two-stage or secondary judge/filter step specifically for developer-facing output (the review comments), distinct from the primary quality assessment",
      "max_score": 12
    },
    {
      "name": "Self-evaluation bias addressed",
      "description": "evaluation-design.md acknowledges or addresses the self-evaluation bias problem (the generator evaluating its own output), either by naming it explicitly or by structural separation",
      "max_score": 8
    },
    {
      "name": "Deterministic tests before LLM",
      "description": "evaluation-design.md specifies that deterministic checks (linting, type checking, test runners) are run before LLM assessment — early termination is possible if deterministic checks fail",
      "max_score": 8
    }
  ]
}

evals

README.md

SKILL.md

tile.json