observability-rca

Use this skill when performing root cause analysis on incidents detected by Elastic Observability. Activate when the user reports a production issue, outage, degraded performance, or asks to investigate alerts.

Quality

59%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Fix and improve this skill with Tessl

tessl review fix ./packages/opencode/src/elastic/skills/observability-rca/SKILL.md

Elastic Observability Root Cause Analysis

Name: observability-rca
Rating: 54.400000000000006 (1 reviews)
Author: elastic

Investigation Framework

1. Assess Scope

Start with high-level health checks:

elastic es cluster health
elastic slos list

Check for widespread vs isolated issues by querying error rates across services:

elastic es query 'FROM logs-* | WHERE @timestamp > NOW() - 1 HOUR AND log.level == "error" | STATS errors = COUNT(*) BY service.name | SORT errors DESC | LIMIT 20'

2. Timeline Reconstruction

Establish when the issue started:

elastic es query 'FROM logs-* | WHERE service.name == "<affected-service>" AND log.level == "error" | STATS errors = COUNT(*) BY bucket = BUCKET(@timestamp, 1 minute) | SORT bucket | LIMIT 120'

Look for deployment or config changes around that time:

elastic es query 'FROM logs-* | WHERE @timestamp > NOW() - 4 HOURS AND (message LIKE "*deploy*" OR message LIKE "*restart*" OR message LIKE "*config*") | SORT @timestamp DESC | LIMIT 20'

3. Correlation Analysis

Cross-service dependencies:

elastic es query 'FROM traces-* | WHERE @timestamp > NOW() - 1 HOUR AND service.name == "<service>" | STATS avg_duration = AVG(transaction.duration.us), error_rate = COUNT_DISTINCT(CASE(event.outcome == "failure", trace.id)) BY service.target.name | SORT avg_duration DESC'

Infrastructure metrics:

elastic es query 'FROM metrics-system.cpu-* | WHERE @timestamp > NOW() - 2 HOURS | STATS cpu = AVG(system.cpu.total.norm.pct) BY host.name, bucket = BUCKET(@timestamp, 5 minute) | WHERE cpu > 0.8 | SORT bucket'

Network connectivity:

elastic es query 'FROM metrics-system.network-* | WHERE @timestamp > NOW() - 1 HOUR | STATS dropped = SUM(system.network.in.dropped), errors = SUM(system.network.in.errors) BY host.name | WHERE dropped > 0 OR errors > 0'

4. Common Root Causes

Symptom	Check	Likely Cause
High latency across services	CPU/memory metrics	Resource exhaustion
Intermittent 5xx errors	Dependency health	Downstream service failure
Connection timeouts	Network metrics	Network partition or DNS issue
Gradual degradation	Disk/memory trends	Resource leak
Sudden spike then recovery	Deploy logs	Bad deployment (auto-rolled-back)

5. Resolution Documentation

After identifying root cause, document:

What happened: observable symptoms
When: timeline with key events
Root cause: the underlying issue
Impact: affected services, users, SLOs
Remediation: what was done to fix it
Prevention: how to prevent recurrence

Repository: elastic/elastic-ramen
Commit: 2e200ec

Last updated: 23 days ago
Created: 23 days ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.