Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.
82
77%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/incident-response/skills/incident-runbook-templates/SKILL.mdProduction-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage, data loss | 15 min | Production down |
| SEV2 | Major degradation | 30 min | Critical feature broken |
| SEV3 | Minor impact | 2 hours | Non-critical bug |
| SEV4 | Minimal impact | Next business day | Cosmetic issue |
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix# [Service Name] Outage Runbook
## Overview
**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall
## Impact Assessment
- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?
## Detection
### Alerts
- `payment_error_rate > 5%` (PagerDuty)
- `payment_latency_p99 > 2s` (Slack)
- `payment_success_rate < 95%` (PagerDuty)
### Dashboards
- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)
## Initial Triage (First 5 Minutes)
### 1. Assess Scope
```bash
# Check service health
kubectl get pods -n payments -l app=payment-service
# Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"
```curl -I https://api.company.com/payments/health| Symptom | Likely Cause | Go To Section |
|---|---|---|
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |
# Step 1: Check pod status
kubectl get pods -n payments
# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100
# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments
# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments
# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10
# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool
# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"
# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"
# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health
# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20
# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments
# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'
# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"# Step 1: Check current request rate
kubectl top pods -n payments
# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20
# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments
# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range
EOF# Verify service is healthy
curl -s https://api.company.com/payments/health | jq
# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'
# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq
# Smoke test critical flows
./scripts/smoke-test-payments.sh# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments
# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION
# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'| Condition | Escalate To | Contact |
|---|---|---|
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |
🚨 INCIDENT: Payment Service Degradation
Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]
Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards
Updates in #payments-incidents📊 UPDATE: Payment Service Incident
Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes
Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas
Next Steps:
- Continuing to monitor
- Root cause analysis in progress
ETA to Resolution: ~15 minutes✅ RESOLVED: Payment Service Incident
Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4
Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully
Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress### Template 2: Database Incident Runbook
```markdown
# Database Incident Runbook
## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |
## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;
-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;
-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable# Check disk usage
df -h /var/lib/postgresql/data
# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"
# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"
# If emergency, delete old data or expand disk## Best Practices
### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress
### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident
## Troubleshooting
### Runbook steps work in staging but fail during a real incident
Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:
```bash
# Step: Check pod status
kubectl get pods -n payments
# Prerequisites: kubectl configured, kubeconfig points to correct cluster
# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`
# Expected output: pods in Running stateAdd a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:
## Quick Checklist
- [ ] 1. Declare incident severity and open war room
- [ ] 2. Check service health (Section 4.1)
- [ ] 3. Check recent deployments (Section 4.1)
- [ ] 4. Roll back if deploy is suspect (Section 4.1)
- [ ] 5. Post initial notification to #payments-incidents
- [ ] 6. Escalate if > 15 min unresolvedRunbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all curl endpoints and kubectl context names are still valid:
## Runbook Metadata
| Field | Value |
|---|---|
| Last verified | 2024-11-15 |
| Owner | @platform-team |
| Review cadence | After every SEV1/SEV2 |Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:
Update every 15 minutes (even if no new information):
- Current status (Investigating / Mitigating / Monitoring)
- Impact (what is broken, who is affected, % of traffic)
- What we are doing right now
- Next update in: 15 minutesAdd explicit warnings before destructive SQL commands and require a dry-run output check before executing:
-- WARNING: This terminates active connections. Verify count first.
-- DRY RUN (check count before terminating):
SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
-- EXECUTE only after verifying count is reasonable (< 50):
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '10 minutes';postmortem-writing - After resolving an incident, use postmortem templates to capture root cause and preventive actionson-call-handoff-patterns - Structure shift handoffs so the incoming responder has full context on active incidents91fe43e
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.