Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".
99
100%
Does it follow best practices?
Impact
99%
3.66xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent correctly labels each attempt in the Attempts section with 'worked', 'failed', or 'inconclusive', as required by the skill's output format.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Attempts section present",
"description": "Output contains an '### Attempts' section",
"max_score": 8
},
{
"name": "All attempts labeled",
"description": "Every item listed under Attempts includes one of the outcome labels: 'worked', 'failed', or 'inconclusive' (not just described narratively)",
"max_score": 15
},
{
"name": "Correct worked labels",
"description": "Attempts that had a positive effect (LRU cache fix, heap snapshot collection, per-endpoint tracking) are labeled 'worked' or 'inconclusive' — NOT 'failed'",
"max_score": 12
},
{
"name": "Correct failed labels",
"description": "Attempts that had no effect (npm audit scan, log buffer flush) are labeled 'failed'",
"max_score": 12
},
{
"name": "Ruled Out section",
"description": "Output contains a '### Ruled Out' section that includes the disproved hypotheses (session middleware, elastic-client connection leak)",
"max_score": 10
},
{
"name": "Still Plausible section",
"description": "Output contains a '### Still Plausible' section that includes the remaining viable hypotheses (EventEmitter listener leak not yet fully confirmed in prod)",
"max_score": 10
},
{
"name": "Next Experiments section",
"description": "Output contains a '### Next Experiments' section",
"max_score": 8
},
{
"name": "Next experiments count",
"description": "Between 1 and 3 next experiments are listed (not 0, not more than 3)",
"max_score": 10
},
{
"name": "Evidence section",
"description": "Output contains an '### Evidence' section with bullet-list facts from the investigation",
"max_score": 7
},
{
"name": "Debug Target section",
"description": "Output contains a '### Debug Target' section with a single-sentence description of the bug",
"max_score": 8
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
compact-debug-ledger