Use when a debugging thread needs to be compressed into a reusable investigation ledger. Capture the target, evidence, attempted fixes, ruled-out hypotheses, viable hypotheses, and next experiments. Good triggers include "compact this debugging session", "summarize what we've tried", and "turn this into a debugging ledger".
99
100%
Does it follow best practices?
Impact
99%
3.66xAverage score across 8 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent produces a Debug Target section that is exactly one sentence, even when the incident involves multiple interacting systems that could tempt a verbose multi-sentence description.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Debug Target section present",
"description": "Output contains a '### Debug Target' section header",
"max_score": 8
},
{
"name": "Debug Target is one sentence",
"description": "The content under Debug Target is exactly one sentence — it ends with a single period (or equivalent) and does not contain multiple sentences separated by periods, semicolons acting as sentence breaks, or bullet points",
"max_score": 20
},
{
"name": "Debug Target captures essence",
"description": "The single-sentence Debug Target meaningfully describes the core incident (e.g. references the cascading failure, the ES/SendGrid combination, or the user-visible symptoms) rather than being a generic placeholder",
"max_score": 12
},
{
"name": "Evidence section",
"description": "Output contains an '### Evidence' section as a bullet list",
"max_score": 8
},
{
"name": "Attempts section",
"description": "Output contains an '### Attempts' section where items have worked/failed/inconclusive labels",
"max_score": 10
},
{
"name": "Ruled Out section",
"description": "Output contains a '### Ruled Out' section",
"max_score": 8
},
{
"name": "Still Plausible section",
"description": "Output contains a '### Still Plausible' section",
"max_score": 8
},
{
"name": "Next Experiments section",
"description": "Output contains a '### Next Experiments' section with 1-3 items",
"max_score": 8
},
{
"name": "No chronological replay",
"description": "Output is NOT structured as a timestamped chronological replay of the war room — it uses the Evidence/Attempts/Hypotheses structure instead",
"max_score": 10
},
{
"name": "File saved",
"description": "A file named incident_record.md exists in the workspace",
"max_score": 8
}
]
}evals
scenario-1
scenario-2
scenario-3
scenario-4
scenario-5
scenario-6
scenario-7
scenario-8
skills
compact-debug-ledger