Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".
99
100%
Does it follow best practices?
Impact
99%
3.66xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent limits the Next Experiments section to 1-3 items that most reduce uncertainty, rather than listing every possible action from a long brainstorm list.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Next Experiments section present",
"description": "Output contains a '### Next Experiments' section",
"max_score": 10
},
{
"name": "Experiments count limit",
"description": "The Next Experiments section contains between 1 and 3 items (inclusive) — NOT 4 or more",
"max_score": 20
},
{
"name": "Experiments reduce uncertainty",
"description": "The listed experiments are investigation steps that would resolve remaining uncertainty (e.g. merging/validating the health-check fix, checking for the same pattern elsewhere) — not administrative tasks like writing docs or updating dashboards",
"max_score": 15
},
{
"name": "Debug Target section",
"description": "Output contains a '### Debug Target' section with a one-sentence description",
"max_score": 8
},
{
"name": "Evidence section",
"description": "Output contains an '### Evidence' section as a bullet list",
"max_score": 8
},
{
"name": "Attempts section with labels",
"description": "Output contains an '### Attempts' section where each attempt has a worked/failed/inconclusive label",
"max_score": 10
},
{
"name": "Ruled Out section",
"description": "Output contains a '### Ruled Out' section",
"max_score": 8
},
{
"name": "Still Plausible section",
"description": "Output contains a '### Still Plausible' section",
"max_score": 8
},
{
"name": "File created",
"description": "A file named ci_debug.md exists in the workspace",
"max_score": 5
},
{
"name": "No dead detail",
"description": "Output does NOT include unrelated items (post-mortems, doc updates, dashboards) that don't affect investigation progress",
"max_score": 8
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
compact-debug-ledger