CtrlK
BlogDocsLog inGet started
Tessl Logo

root-cause-analysis

Performs systematic root cause analysis to identify the true source of bugs, errors, and unexpected behavior through structured investigation phases — not just treating symptoms. Use when a user reports a bug, crash, error, or broken behavior and needs to debug, troubleshoot, or investigate why something is not working; especially for complex or intermittent issues across multiple components. Applies the Five Whys method, hypothesis-driven testing, stack trace analysis, git blame/log evidence gathering, and causal chain documentation to isolate and confirm root causes before applying any fix.

Install with Tessl CLI

npx tessl i github:rohitg00/skillkit --skill root-cause-analysis
What are skills?

93

Does it follow best practices?

Validation for skill structure

SKILL.md
Review
Evals

Root Cause Analysis

You are performing systematic root cause analysis to find the true source of a bug. Do not apply fixes until you understand WHY the bug exists.

Core Principle

Never fix a symptom. Always find and fix the root cause.

The Five Whys Method

Ask "Why?" repeatedly to drill down to the root cause:

  1. Why did the API return an error? → The database query failed
  2. Why did the database query fail? → The connection pool was exhausted
  3. Why was the pool exhausted? → ROOT CAUSE: Missing finally block to close connections

Investigation Phases

Phase 1: Reproduce the Bug

Before investigating:

  1. Reproduce consistently - If you can't reproduce it, you can't verify a fix
  2. Document reproduction steps - Exact sequence of actions
  3. Note environment details - OS, versions, configuration
  4. Identify minimal reproduction - Smallest case that shows the bug

Questions to answer:

  • Does it happen every time or intermittently?
  • Does it happen in all environments?
  • When did it start happening? (recent changes)

Phase 2: Gather Evidence

Collect information before forming theories:

  • Error messages and stack traces
  • Log files (application, system, database)
  • Recent code changes (git log, blame)
  • User reports and reproduction steps
  • Monitoring data (metrics, APM)
  • Related issues (search issue tracker)

Do NOT:

  • Make changes while gathering evidence
  • Assume you know the cause without evidence
  • Ignore related symptoms

Phase 3: Form Hypotheses

Based on evidence, create ranked hypotheses:

PriorityHypothesisEvidenceTest Plan
1Connection leak in UserServiceStack trace shows connection poolAdd logging, check usage
2Query timeout too shortOccurs under loadTest with longer timeout
3Database server overloadCorrelates with peak hoursCheck DB metrics

For each hypothesis:

  • What evidence supports it?
  • What evidence contradicts it?
  • How can we test it?

Phase 4: Test Hypotheses

Test each hypothesis systematically:

  1. Start with highest probability
  2. Design a definitive test - Should clearly confirm or reject
  3. Make ONE change at a time
  4. Document results

If hypothesis is rejected:

  • Cross it off the list
  • Re-evaluate remaining hypotheses
  • Consider if new evidence suggests new hypotheses

Phase 5: Verify Root Cause

Before declaring root cause found:

  • Can you explain the full causal chain?
  • Does fixing it consistently prevent the bug?
  • Does it explain ALL observed symptoms?
  • Is there nothing earlier in the chain that could be fixed?

Common Root Cause Categories

  • Code Defects: logic errors, boundary conditions, race conditions, resource leaks, null/undefined handling
  • Design Issues: missing error handling, inadequate validation, poor state management, coupling
  • Environment: configuration errors, resource constraints, version mismatches, network issues
  • Data Issues: invalid input, data corruption, schema mismatches, encoding problems

Evidence Collection Commands

# Recent changes to relevant files
git log --oneline -20 -- path/to/file

# Who changed this line
git blame path/to/file

# Changes since last working version
git diff v1.2.3..HEAD -- src/

# Search for related error handling
grep -r "catch\|error\|throw" --include="*.ts" src/

Red Flags - You Haven't Found Root Cause

  • "I'm not sure why, but this fix works"
  • "The bug went away after I restarted"
  • "I added a check to prevent this case"
  • "It's probably a race condition somewhere"

These suggest symptom treatment, not root cause resolution.

Documentation Template

When root cause is found, document:

## Bug: [Description]

### Root Cause
[Clear explanation of why the bug occurred]

### Evidence
- [Evidence 1]
- [Evidence 2]

### Causal Chain
1. [Initial trigger]
2. [Intermediate cause]
3. [Root cause]
4. [Observed symptom]

### Fix
[Description of the fix and why it addresses root cause]

### Prevention
[How to prevent similar issues in the future]

Integration with Other Skills

After finding root cause:

  • Use testing/red-green-refactor to write a test that exposes the bug
  • Use planning/verification-gates to validate the fix
  • Consider collaboration/structured-review for complex fixes
Repository
rohitg00/skillkit
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.