Name: kopai/root-cause-analysis
Rating: 1 (1 reviews)
Author: kopai

kopai/root-cause-analysis

Analyze telemetry data for root cause analysis using Kopai CLI. Use when debugging errors, investigating latency issues, tracing request flows across services, or correlating logs with traces. Also use when users report production issues like "why is my API slow", "getting 500 errors", "service is down", "requests are timing out", or any symptom that needs telemetry-based investigation — even if they don't mention traces or observability explicitly.

100

Quality

100%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Passed

No known issues

title	impact	tags
Step 1: Find Error Traces	CRITICAL	workflow, errors, step1

Step 1: Find Error Traces

Impact: CRITICAL

First step in RCA workflow - locate errors across traces and logs.

1a. Find Error Traces

npx @kopai/cli traces search --status-code ERROR --limit 20 --json

1b. Find Error Logs by Severity Number

Use --severity-min 17 to catch all error-level logs regardless of text casing. This is the preferred approach because SeverityText is inconsistent across languages/frameworks (e.g. ERROR, error, Error, or empty), but SeverityNumber >= 17 always means error-level per the OTel Log Data Model.

npx @kopai/cli logs search --severity-min 17 --limit 20 --json

SeverityNumber	Level
1-4	TRACE
5-8	DEBUG
9-12	INFO
13-16	WARN
17-20	ERROR
21-24	FATAL

1c. Find Hidden Errors (fallback)

Some services log errors at INFO level or with no severity set. Search log body and attributes as a fallback.

npx @kopai/cli logs search --body "error" --limit 20 --json
npx @kopai/cli logs search --body "exception" --limit 20 --json
npx @kopai/cli logs search --body "failed" --limit 20 --json

Handling limit saturation

If a search returns exactly --limit results, there are likely more errors hidden beyond the limit. Do NOT stop — continue exploring:

Group by service to see if one noisy service dominates results
Exclude noisy services by re-running per-service queries for other services
Increase the limit or paginate to ensure you're not missing app-level errors
Always run the hidden error searches (1c) even if 1a/1b return results — real app errors are often logged at INFO severity or only appear in the body text

A single noisy service (e.g. otel collector infrastructure errors) can fill the entire result set and hide critical application errors.

Filter by Service

npx @kopai/cli traces search --status-code ERROR --service payment-api --json
npx @kopai/cli logs search --severity-min 17 --service payment-api --json

Filter by Time Range

# Timestamp in nanoseconds
npx @kopai/cli traces search --status-code ERROR --timestamp-min 1700000000000000000 --json

Reference

See references/trace-filters.md and references/log-filters.md for all filter options

references

rules

_sections.md

pattern-distributed.md

pattern-http-errors.md

pattern-log-driven.md

pattern-slow-requests.md

workflow-check-metrics.md

workflow-correlate-logs.md

workflow-find-errors.md

workflow-get-context.md

workflow-identify-cause.md

SKILL.md

tile.json