CtrlK
BlogDocsLog inGet started
Tessl Logo

observability-rca

Use this skill when performing root cause analysis on incidents detected by Elastic Observability. Activate when the user reports a production issue, outage, degraded performance, or asks to investigate alerts.

54

Quality

59%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./packages/opencode/src/elastic/skills/observability-rca/SKILL.md
SKILL.md
Quality
Evals
Security

Elastic Observability Root Cause Analysis

Investigation Framework

1. Assess Scope

Start with high-level health checks:

elastic es cluster health
elastic slos list

Check for widespread vs isolated issues by querying error rates across services:

elastic es query 'FROM logs-* | WHERE @timestamp > NOW() - 1 HOUR AND log.level == "error" | STATS errors = COUNT(*) BY service.name | SORT errors DESC | LIMIT 20'

2. Timeline Reconstruction

Establish when the issue started:

elastic es query 'FROM logs-* | WHERE service.name == "<affected-service>" AND log.level == "error" | STATS errors = COUNT(*) BY bucket = BUCKET(@timestamp, 1 minute) | SORT bucket | LIMIT 120'

Look for deployment or config changes around that time:

elastic es query 'FROM logs-* | WHERE @timestamp > NOW() - 4 HOURS AND (message LIKE "*deploy*" OR message LIKE "*restart*" OR message LIKE "*config*") | SORT @timestamp DESC | LIMIT 20'

3. Correlation Analysis

Cross-service dependencies:

elastic es query 'FROM traces-* | WHERE @timestamp > NOW() - 1 HOUR AND service.name == "<service>" | STATS avg_duration = AVG(transaction.duration.us), error_rate = COUNT_DISTINCT(CASE(event.outcome == "failure", trace.id)) BY service.target.name | SORT avg_duration DESC'

Infrastructure metrics:

elastic es query 'FROM metrics-system.cpu-* | WHERE @timestamp > NOW() - 2 HOURS | STATS cpu = AVG(system.cpu.total.norm.pct) BY host.name, bucket = BUCKET(@timestamp, 5 minute) | WHERE cpu > 0.8 | SORT bucket'

Network connectivity:

elastic es query 'FROM metrics-system.network-* | WHERE @timestamp > NOW() - 1 HOUR | STATS dropped = SUM(system.network.in.dropped), errors = SUM(system.network.in.errors) BY host.name | WHERE dropped > 0 OR errors > 0'

4. Common Root Causes

SymptomCheckLikely Cause
High latency across servicesCPU/memory metricsResource exhaustion
Intermittent 5xx errorsDependency healthDownstream service failure
Connection timeoutsNetwork metricsNetwork partition or DNS issue
Gradual degradationDisk/memory trendsResource leak
Sudden spike then recoveryDeploy logsBad deployment (auto-rolled-back)

5. Resolution Documentation

After identifying root cause, document:

  • What happened: observable symptoms
  • When: timeline with key events
  • Root cause: the underlying issue
  • Impact: affected services, users, SLOs
  • Remediation: what was done to fix it
  • Prevention: how to prevent recurrence
Repository
elastic/elastic-ramen
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.