Design, build, or audit a coding agent, agentic loop, tool-use harness, or autonomous coding system — covering loop architecture, action space, context strategy, observation formatting, evaluation, error handling, prompt engineering, and task decomposition. Use when the user wants to design an agent, build a coding agent, scaffold an agentic system, architect a tool-use loop, review an existing agent harness for improvements, fix context bloat or compaction problems, tune observation formatting or tool output handling, debug agent loop or termination issues, design a system prompt or evaluator prompt for an agent, set up or redesign an agent evaluation pipeline, plan multi-agent orchestration, or specify how an agent should manage context, tools, prompts, evaluation, or recovery (greenfield design or audit mode).
100
100%
Does it follow best practices?
Impact
100%
1.23xAverage score across 4 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent produces a correct observation formatting and context management specification for a long-running coding agent, applying the skill's specific rules for truncation strategy by size band, token-based limits, error output preservation, success-silent/failure-verbose, FIC strategy for 3h+ tasks, error history preservation, KV-cache preservation, compaction preference hierarchy, and observation pipeline structure.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Success silent, failure verbose",
"description": "observation-context-spec.md specifies that passing tests produce only a count (not full output), while failing tests include full output — applying different verbosity rules based on success/failure",
"max_score": 12
},
{
"name": "Token-based truncation limits",
"description": "observation-context-spec.md specifies token-based limits for truncation (not line-based), and includes an approximation method (e.g. num_bytes / 4) or references token counts directly",
"max_score": 10
},
{
"name": "Truncation bands by output size",
"description": "observation-context-spec.md defines different handling strategies for at least 3 distinct output size bands (e.g. small pass-through, medium summarize, large offload to disk)",
"max_score": 10
},
{
"name": "Error output preserved verbatim",
"description": "observation-context-spec.md specifies that error output is preserved verbatim regardless of size (up to a stated limit, e.g. ~5K tokens), rather than being truncated like other output",
"max_score": 10
},
{
"name": "FIC strategy for 3h+ tasks",
"description": "observation-context-spec.md specifies context compaction at phase boundaries (Frequent Intentional Compaction) as the strategy for 3h+ sessions, with a target utilization range (e.g. 40-60%)",
"max_score": 12
},
{
"name": "Error history preserved during compaction",
"description": "observation-context-spec.md explicitly states that failed attempts and error traces must NOT be removed during compaction or summarization",
"max_score": 10
},
{
"name": "KV-cache preservation rules",
"description": "observation-context-spec.md includes at least one KV-cache preservation rule: stable prefixes, append-only modifications, no tool reordering mid-session, or avoidance of timestamps in system prompts",
"max_score": 10
},
{
"name": "Compaction preference hierarchy",
"description": "observation-context-spec.md ranks compaction approaches in quality order, distinguishing between raw context retention, tool result clearing, observation masking, and summarization — with summarization recommended only as a last resort",
"max_score": 8
},
{
"name": "Observation pipeline steps",
"description": "observation-context-spec.md describes a multi-step processing pipeline for tool outputs (e.g. capture → transform → size check → compress → inject metadata → position) rather than treating all output the same way",
"max_score": 8
},
{
"name": "Recovery hints on truncation",
"description": "observation-context-spec.md specifies that when output is truncated or offloaded, the agent receives metadata indicating what was removed and how to retrieve it (recovery hints)",
"max_score": 10
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
references