Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when the user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
68
81%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
A discipline for hard bugs. Skip phases only when explicitly justified.
When exploring the codebase, use CONTEXT.md to ground your mental model of the relevant modules and check docs/adr/ for decisions in the area you're touching.
Before anything else, write down what the user actually reported. The wrong-bug failure mode (debugging a nearby symptom rather than the reported one) starts here.
Capture, even if only in conversation:
If any of these are missing and the user is reachable, ask. If AFK, note the gap explicitly and proceed with the strongest assumption stated.
Only after the symptom is captured does Phase 1 begin.
This is the skill. Everything else is mechanical. With a fast, deterministic, agent-runnable pass/fail signal you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. Without one, no amount of staring at code will save you.
Spend disproportionate effort here. Be aggressive. Be creative. Refuse to give up.
git bisect run it.Right loop, bug 90% fixed.
Treat the loop as a product. Once you have a loop, ask:
A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
Goal isn't a clean repro — it's a higher reproduction rate. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. 50%-flake = debuggable. 1% = not — keep raising the rate.
Stop and say so explicitly. List what you tried. Ask for: (a) access to the environment that reproduces, (b) a captured artifact (HAR, log dump, core dump, screen recording with timestamps), or (c) permission for temporary production instrumentation. Do not hypothesise without a loop.
Don't proceed to Phase 2 until you have a loop you believe in.
Run the loop. Watch the bug appear. Confirm:
Don't proceed until you reproduce.
Generate 3–5 ranked hypotheses before testing any. Single-hypothesis generation anchors on the first plausible idea.
Each hypothesis must be falsifiable:
Format: "If is the cause, then will make the bug disappear / will make it worse."
Can't state the prediction → it's a vibe. Discard or sharpen.
Show the ranked list to the user before testing. They often re-rank instantly with domain knowledge ("we just shipped a change to #3"), or know what's already been ruled out. Cheap checkpoint, big time saver. Don't block — proceed with your ranking if AFK.
Each probe must map to a specific prediction from Phase 3. Change one variable at a time.
Tool preference:
Tag every debug log with a unique prefix, e.g. [DEBUG-a4f2]. Cleanup at the end is a single grep. Untagged logs survive; tagged logs die.
Perf branch. For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, performance.now(), profiler, query plan), then bisect. Measure first, fix second.
Write the regression test before the fix — but only if there's a correct seam for it.
A correct seam exercises the real bug pattern as it occurs at the call site. Too-shallow seam (single-caller test for a multi-caller bug, unit test that can't replicate the chain) gives false confidence.
No correct seam exists? That's itself the finding. Note it. The architecture is preventing the bug from being locked down. Flag for Phase 6.
If a correct seam exists:
Required before declaring done:
[DEBUG-...] instrumentation removed (grep the prefix)Then ask: what would have prevented this bug? If the answer is architectural (no good test seam, tangled callers, hidden coupling) hand off to improve-codebase-architecture with specifics. Make the recommendation after the fix is in — you have more information now than when you started.
be88d6c
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.