Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".
99
100%
Does it follow best practices?
Impact
99%
3.66xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent produces a debug ledger with all required sections in the correct format, including Debug Target, Evidence, Attempts, Ruled Out, Still Plausible, and Next Experiments sections.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Debug Target section",
"description": "Output contains a '### Debug Target' section header",
"max_score": 8
},
{
"name": "Debug Target one sentence",
"description": "The content under Debug Target is exactly one sentence (a single sentence, not a paragraph or bullet list)",
"max_score": 8
},
{
"name": "Evidence section",
"description": "Output contains an '### Evidence' section header",
"max_score": 8
},
{
"name": "Evidence as bullet facts",
"description": "The content under Evidence is formatted as a bullet list of facts (lines starting with '- ')",
"max_score": 8
},
{
"name": "Attempts section",
"description": "Output contains an '### Attempts' section header",
"max_score": 8
},
{
"name": "Attempt status labels",
"description": "Each item under Attempts includes one of the labels 'worked', 'failed', or 'inconclusive' (e.g. '- <attempt>: worked')",
"max_score": 10
},
{
"name": "Ruled Out section",
"description": "Output contains a '### Ruled Out' section header",
"max_score": 8
},
{
"name": "Still Plausible section",
"description": "Output contains a '### Still Plausible' section header",
"max_score": 8
},
{
"name": "Next Experiments section",
"description": "Output contains a '### Next Experiments' section header",
"max_score": 8
},
{
"name": "No off-topic detail",
"description": "Output does NOT include conversational filler (e.g. coffee break mention, 'I joined late', routine social exchanges) that has no bearing on the investigation",
"max_score": 9
},
{
"name": "Evidence over chronology",
"description": "Output is organized by investigation state (Evidence, Attempts, Hypotheses) rather than a timestamped chronological replay of events",
"max_score": 9
},
{
"name": "Output saved to file",
"description": "A file named debug_ledger.md exists in the workspace",
"max_score": 8
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
compact-debug-ledger