tessl install github:ahmedasmar/devops-claude-skills --skill monitoring-observabilitygithub.com/ahmedasmar/devops-claude-skills
Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
Review Score
90%
Validation Score
12/16
Implementation Score
85%
Activation Score
100%
This skill provides comprehensive guidance for monitoring and observability workflows including metrics design, log aggregation, distributed tracing, alerting strategies, SLO/SLA management, and tool selection.
When to use this skill:
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"Every service should monitor:
For request-driven services, use the RED Method:
For infrastructure resources, use the USE Method:
Quick Start - Web Application Example:
# Rate (requests/sec)
sum(rate(http_requests_total[5m]))
# Errors (error rate %)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Duration (p95 latency)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)For comprehensive metric design guidance including:
→ Read: references/metrics_design.md
Detect anomalies and trends in your metrics:
# Analyze Prometheus metrics for anomalies
python3 scripts/analyze_metrics.py prometheus \
--endpoint http://localhost:9090 \
--query 'rate(http_requests_total[5m])' \
--hours 24
# Analyze CloudWatch metrics
python3 scripts/analyze_metrics.py cloudwatch \
--namespace AWS/EC2 \
--metric CPUUtilization \
--dimensions InstanceId=i-1234567890abcdef0 \
--hours 48→ Script: scripts/analyze_metrics.py
Every log entry should include:
Example structured log (JSON):
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-service",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"order_id": "ORD-456",
"error_type": "GatewayTimeout",
"duration_ms": 5000
}ELK Stack (Elasticsearch, Logstash, Kibana):
Grafana Loki:
CloudWatch Logs:
Analyze logs for errors, patterns, and anomalies:
# Analyze log file for patterns
python3 scripts/log_analyzer.py application.log
# Show error lines with context
python3 scripts/log_analyzer.py application.log --show-errors
# Extract stack traces
python3 scripts/log_analyzer.py application.log --show-traces→ Script: scripts/log_analyzer.py
For comprehensive logging guidance including:
→ Read: references/logging_guide.md
| Severity | Response Time | Example |
|---|---|---|
| Critical | Page immediately | Service down, SLO violation |
| Warning | Ticket, review in hours | Elevated error rate, resource warning |
| Info | Log for awareness | Deployment completed, scaling event |
Alert when error budget is consumed too quickly:
# Fast burn (1h window) - Critical
- alert: ErrorBudgetFastBurn
expr: |
(error_rate / 0.001) > 14.4 # 99.9% SLO
for: 2m
labels:
severity: critical
# Slow burn (6h window) - Warning
- alert: ErrorBudgetSlowBurn
expr: |
(error_rate / 0.001) > 6 # 99.9% SLO
for: 30m
labels:
severity: warningAudit your alert rules against best practices:
# Check single file
python3 scripts/alert_quality_checker.py alerts.yml
# Check all rules in directory
python3 scripts/alert_quality_checker.py /path/to/prometheus/rules/Checks for:
→ Script: scripts/alert_quality_checker.py
Production-ready alert rule templates:
→ Templates:
For comprehensive alerting guidance including:
→ Read: references/alerting_best_practices.md
Create comprehensive runbooks for your alerts:
→ Template: assets/templates/runbooks/incident-runbook-template.md
┌─────────────────────────────────────┐
│ Overall Health (Single Stats) │
│ [Requests/s] [Error%] [P95 Latency]│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Request Rate & Errors (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Latency Distribution (Graphs) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Resource Usage (Graphs) │
└─────────────────────────────────────┘Automatically generate dashboards from templates:
# Web application dashboard
python3 scripts/dashboard_generator.py webapp \
--title "My API Dashboard" \
--service my_api \
--output dashboard.json
# Kubernetes dashboard
python3 scripts/dashboard_generator.py kubernetes \
--title "K8s Production" \
--namespace production \
--output k8s-dashboard.json
# Database dashboard
python3 scripts/dashboard_generator.py database \
--title "PostgreSQL" \
--db-type postgres \
--instance db.example.com:5432 \
--output db-dashboard.jsonSupports:
→ Script: scripts/dashboard_generator.py
SLI (Service Level Indicator): Measurement of service quality
SLO (Service Level Objective): Target value for an SLI
Error Budget: Allowed failure amount = (100% - SLO)
| Availability | Downtime/Month | Use Case |
|---|---|---|
| 99% | 7.2 hours | Internal tools |
| 99.9% | 43.2 minutes | Standard production |
| 99.95% | 21.6 minutes | Critical services |
| 99.99% | 4.3 minutes | High availability |
Calculate compliance, error budgets, and burn rates:
# Show SLO reference table
python3 scripts/slo_calculator.py --table
# Calculate availability SLO
python3 scripts/slo_calculator.py availability \
--slo 99.9 \
--total-requests 1000000 \
--failed-requests 1500 \
--period-days 30
# Calculate burn rate
python3 scripts/slo_calculator.py burn-rate \
--slo 99.9 \
--errors 50 \
--requests 10000 \
--window-hours 1→ Script: scripts/slo_calculator.py
For comprehensive SLO/SLA guidance including:
→ Read: references/slo_sla_guide.md
Use distributed tracing when you need to:
Python example:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
try:
result = payment_service.charge(order_id)
span.set_attribute("payment.status", "success")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR))
span.record_exception(e)
raiseError-based sampling (always sample errors, 1% of successes):
class ErrorSampler(Sampler):
def should_sample(self, parent_context, trace_id, name, **kwargs):
attributes = kwargs.get('attributes', {})
if attributes.get('error', False):
return Decision.RECORD_AND_SAMPLE
if trace_id & 0xFF < 3: # ~1%
return Decision.RECORD_AND_SAMPLE
return Decision.DROPProduction-ready OpenTelemetry Collector configuration:
→ Template: assets/templates/otel-config/collector-config.yaml
Features:
For comprehensive tracing guidance including:
→ Read: references/tracing_guide.md
If your Datadog bill is growing out of control, start by identifying waste:
Automatically analyze your Datadog usage and find cost optimization opportunities:
# Analyze Datadog usage (requires API key and APP key)
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY
# Show detailed breakdown by category
python3 scripts/datadog_cost_analyzer.py \
--api-key $DD_API_KEY \
--app-key $DD_APP_KEY \
--show-detailsWhat it checks:
→ Script: scripts/datadog_cost_analyzer.py
1. Custom Metrics Optimization (typical savings: 20-40%):
2. Log Management (typical savings: 30-50%):
3. APM Optimization (typical savings: 15-25%):
4. Infrastructure Monitoring (typical savings: 10-20%):
If you're considering migrating to a more cost-effective open-source stack:
From Datadog → To Open Source Stack:
Estimated Cost Savings: 60-77% ($49.8k-61.8k/year for 100-host environment)
Phase 1: Run Parallel (Month 1-2):
Phase 2: Migrate Dashboards & Alerts (Month 2-3):
Phase 3: Migrate Logs & Traces (Month 3-4):
Phase 4: Decommission Datadog (Month 4-5):
When migrating dashboards and alerts, you'll need to translate Datadog queries to PromQL:
Quick examples:
# Average CPU
Datadog: avg:system.cpu.user{*}
Prometheus: avg(node_cpu_seconds_total{mode="user"})
# Request rate
Datadog: sum:requests.count{*}.as_rate()
Prometheus: sum(rate(http_requests_total[5m]))
# P95 latency
Datadog: p95:request.duration{*}
Prometheus: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate percentage
Datadog: (sum:requests.errors{*}.as_rate() / sum:requests.count{*}.as_rate()) * 100
Prometheus: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100→ Full Translation Guide: references/dql_promql_translation.md
Example: 100-host infrastructure
| Component | Datadog (Annual) | Open Source (Annual) | Savings |
|---|---|---|---|
| Infrastructure | $18,000 | $10,000 (self-hosted infra) | $8,000 |
| Custom Metrics | $600 | Included | $600 |
| Logs | $24,000 | $3,000 (storage) | $21,000 |
| APM/Traces | $37,200 | $5,000 (storage) | $32,200 |
| Total | $79,800 | $18,000 | $61,800 (77%) |
For comprehensive migration guidance including:
→ Read: references/datadog_migration.md
Choose Prometheus + Grafana if:
Choose Datadog if:
Choose Grafana Stack (LGTM) if:
Choose ELK Stack if:
Choose Cloud Native (CloudWatch/etc) if:
| Solution | Monthly Cost | Setup | Ops Burden |
|---|---|---|---|
| Prometheus + Loki + Tempo | $1,500 | Medium | Medium |
| Grafana Cloud | $3,000 | Low | Low |
| Datadog | $8,000 | Low | None |
| ELK Stack | $4,000 | High | High |
| CloudWatch | $2,000 | Low | Low |
For comprehensive tool comparison including:
→ Read: references/tool_comparison.md
Validate health check endpoints against best practices:
# Check single endpoint
python3 scripts/health_check_validator.py https://api.example.com/health
# Check multiple endpoints
python3 scripts/health_check_validator.py \
https://api.example.com/health \
https://api.example.com/readiness \
--verboseChecks for:
→ Script: scripts/health_check_validator.py
High Latency Investigation:
High Error Rate Investigation:
Service Down Investigation:
# Request rate
sum(rate(http_requests_total[5m]))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100# Check pod status
kubectl get pods -n <namespace>
# View pod logs
kubectl logs -f <pod-name> -n <namespace>
# Check pod resources
kubectl top pods -n <namespace>
# Describe pod for events
kubectl describe pod <pod-name> -n <namespace>
# Check recent deployments
kubectl rollout history deployment/<name> -n <namespace>Elasticsearch:
GET /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
}
}Loki (LogQL):
{job="app", level="error"} |= "error" | jsonCloudWatch Insights:
fields @timestamp, level, message
| filter level = "error"
| filter @timestamp > ago(1h)analyze_metrics.py - Detect anomalies in Prometheus/CloudWatch metricsalert_quality_checker.py - Audit alert rules against best practicesslo_calculator.py - Calculate SLO compliance and error budgetslog_analyzer.py - Parse logs for errors and patternsdashboard_generator.py - Generate Grafana dashboards from templateshealth_check_validator.py - Validate health check endpointsdatadog_cost_analyzer.py - Analyze Datadog usage and find cost wastemetrics_design.md - Four Golden Signals, RED/USE methods, metric typesalerting_best_practices.md - Alert design, runbooks, on-call practiceslogging_guide.md - Structured logging, aggregation patternstracing_guide.md - OpenTelemetry, distributed tracingslo_sla_guide.md - SLI/SLO/SLA definitions, error budgetstool_comparison.md - Comprehensive comparison of monitoring toolsdatadog_migration.md - Complete guide for migrating from Datadog to OSS stackdql_promql_translation.md - Datadog Query Language to PromQL translation referenceprometheus-alerts/webapp-alerts.yml - Production-ready web app alertsprometheus-alerts/kubernetes-alerts.yml - Kubernetes monitoring alertsotel-config/collector-config.yaml - OpenTelemetry Collector configurationrunbooks/incident-runbook-template.md - Incident response template