pantheon-ai/troubleshoot

Search-first troubleshooting with a diagnostic phase — use when an error, bug, or unexpected behaviour is reported.

Quality

75%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

name:: troubleshoot
description:: Use when user reports an error, bug, or something not working. Search-first troubleshooting with diagnostic phase. Triggers: debug, error, broken, not working, failing, crash, exception.
allowed-tools:: WebSearch, WebFetch, AskUserQuestion, Read, Glob, Grep
model:: opus
context:: main
argument-hint:: <error or symptom description>
cynefin-domain:: complicated
cynefin-verb:: analyze

Troubleshoot

Name: pantheon-ai/troubleshoot
Rating: 75 (1 reviews)
Author: pantheon-ai

Search-first diagnostic workflow. Human executes commands.

⚠️ AskUserQuestion Guard

CRITICAL: After EVERY AskUserQuestion call, check if answers are empty/blank. Known Claude Code bug: outside Plan Mode, AskUserQuestion silently returns empty answers without showing UI.

If answers are empty: DO NOT proceed with assumptions. Instead:

Output: "⚠️ Questions didn't display (known Claude Code bug outside Plan Mode)."
Present the options as a numbered text list and ask user to reply with their choice number.
WAIT for user reply before continuing.

Workflow

0.Load → 1.Search → 2.Qualify → 3.Diagnose → 4.Investigate → 5.Persist → 6.Learn

0. Load Learnings

If a project-level learnings file exists (e.g., learnings.yaml in the project root or a known location), read it and apply known patterns before searching.

1. Search (do first)

80% of bugs solved online.

WebSearch: [error] [stack] [framework] on SO, GitHub, Docs, Reddit
Solution found → skip to 5.Learn

2. Qualify (2-3 questions)

AskUserQuestion:

Stack? (language, framework, runtime)
Environment? (local, container, cloud)
Changed recently? (deploy, config, dependency)

3. Diagnose

See references/protocols/diagnose.md for details.

Mental models: Check learnings file → WebSearch pattern → reason with 5 Whys/Fishbone
Isolation: Wolf Fence (binary search), swap one variable, minimal repro
Root cause drill: 5 Whys, Fishbone 6 M's

Pattern matches → suggest fix, skip OODA

4. Investigate (OODA)

Only if diagnosis inconclusive.

Observe: User runs command, pastes output
Orient: Analyze, update hypothesis
Decide: Next command or confirm cause
Act: Suggest fix (user executes)

Exit when root cause confirmed and fix verified.

5. Persist Thinking Artifact ⚠️ MANDATORY

MUST execute after root cause confirmed and fix verified. DO NOT skip. DO NOT wait for user to ask.

If a persistent thinking store is configured (e.g., via an environment variable pointing to a notes directory), write the diagnostic session there. Skip silently if no store is configured.

Content (required sections):

Symptoms (as reported)
Hypotheses tested (ordered list with result: confirmed/eliminated)
OODA loops (if Phase 4 was entered)
Root cause (as confirmed)
Resolution (fix applied)
Cynefin transition (if problem re-classified during diagnosis)

This is the active thinking trail — distinct from any learnings file which captures distilled, reusable conclusions.

6. Learn

After resolution, AskUserQuestion: "Save this learning?"

Global → append to a shared learnings store
Project → append with scope: project:<name>
Skip

Philosophy

Reproduce before you theorize — a bug that cannot be reproduced is a hypothesis, not a confirmed problem.
Source location is the exit condition — every investigation ends with a specific file and line, not a module or service name.
Distinguish symptom from cause — the first observable failure is rarely the root cause; follow the chain.
Domain determines method — Complicated problems have known solutions; Complex problems need probes; do not mix the strategies.

When to Use

When a user reports an error, exception, or crash with no clear cause
When a system exhibits unexpected behavior that cannot be explained by recent changes
When a bug is intermittent or flaky and cannot be reproduced consistently
When log output or stack traces point to a symptom but not a root cause
When a fix has been applied but the problem recurs or manifests differently

When Not to Use

When the bug is already reproduced and the fix is obvious — apply the fix directly
When the request is a feature request framed as a bug report — redirect to planning
When the issue is a known, documented limitation with no workaround — communicate clearly
When the environment is completely unavailable (no logs, no reproduction path, no user access) — gather access first
When a postmortem or incident review is needed instead of a live diagnostic — use a dedicated retrospective workflow

Anti-Patterns

NEVER jump to a fix without reproducing the issue — Untested fixes introduce regressions. Why: Reproduction confirms you're solving the actual problem, not a symptom.
NEVER ignore the source code location step — All investigations must end with a specific file and line. Why: Vague "somewhere in the auth module" diagnoses cannot be verified or code-reviewed.
NEVER treat correlation as causation — Two events occurring together don't mean one causes the other. Why: Fixing the correlated symptom leaves the root cause intact.
NEVER skip the search step — Searching external sources first avoids reinventing known solutions. Why: 80% of bugs have documented answers; skipping search wastes diagnostic cycles.
NEVER persist a thinking artifact before root cause is confirmed — Premature writes capture incomplete hypotheses as conclusions. Why: Misleading learnings files cause the same misdiagnosis in future sessions.

Usage Examples

Investigating a silent API failure:

# Skill prompts: reproduce → isolate → check logs → locate source
# Output: "Root cause at src/api/auth.ts:142 — token expiry not caught"

Triaging a flaky test:

# Skill prompts: classify domain (Clear/Complicated/Complex)
# Routes to probe skill if Complex domain detected

Diagnosing a performance regression after a deploy:

# Skill prompts: qualify stack/environment/recent changes → Wolf Fence bisect
# Output: "Regression introduced in commit abc123 — N+1 query in UserService.list()"

✅ Completion Checklist

Before responding to user, verify:

Artifact written (or skipped if no store configured)

Refs

references/protocols/diagnose.md - Mental models, bisect strategies
references/protocols/search-multi-source.md - Multi-source search workflow
references/reference.md - Isolation techniques and root cause methods

References

Reference — isolation techniques (Wolf Fence, Swap One Variable, Minimal Repro), root cause drilling (5 Whys, Fishbone), OODA loop
Diagnose Protocol — mental model matching, isolation sequence, root cause drill, OODA handoff conditions
Search Multi-Source Protocol — query construction, source priority (SO, GitHub Issues, Docs, Reddit), no-result fallback

references

SKILL.md

tile.json