Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".
99
100%
Does it follow best practices?
Impact
99%
3.66xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent organizes the output around evidence and hypotheses rather than reproducing the chronological log structure, and avoids carrying over irrelevant personal detail.",
"type": "weighted_checklist",
"checklist": [
{
"name": "No timestamp structure",
"description": "Output is NOT organized with timestamps or time-ordered headings (e.g. '09:00', '10:00') — it uses the Debug Target / Evidence / Attempts / Ruled Out / Still Plausible / Next Experiments structure instead",
"max_score": 18
},
{
"name": "Evidence section present",
"description": "Output contains an '### Evidence' section with bullet-point facts (not a timeline)",
"max_score": 10
},
{
"name": "Evidence captures key facts",
"description": "Evidence section includes at least 2 of: pandas upgrade as trigger, append-loop overhead, file I/O bottleneck percentage, NFS context",
"max_score": 10
},
{
"name": "No irrelevant detail",
"description": "Output does NOT mention the lunch break, the manager 1:1, or other personal/off-topic entries from the log",
"max_score": 15
},
{
"name": "Attempts section with labels",
"description": "Output contains an '### Attempts' section where each attempt includes a worked/failed/inconclusive label",
"max_score": 10
},
{
"name": "Ruled Out section",
"description": "Output contains a '### Ruled Out' section that includes NFS packet loss as a ruled-out hypothesis",
"max_score": 8
},
{
"name": "Still Plausible section",
"description": "Output contains a '### Still Plausible' section",
"max_score": 8
},
{
"name": "Next Experiments section",
"description": "Output contains a '### Next Experiments' section with 1-3 items",
"max_score": 8
},
{
"name": "Debug Target one sentence",
"description": "Output contains a '### Debug Target' section with exactly one sentence",
"max_score": 8
},
{
"name": "File saved",
"description": "A file named perf_investigation.md exists in the workspace",
"max_score": 5
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
compact-debug-ledger