Defines service level objectives, creates error budget policies, designs incident response procedures, develops capacity models, and produces monitoring configurations and automation scripts for production systems. Use when defining SLIs/SLOs, managing error budgets, building reliable systems at scale, incident management, chaos engineering, toil reduction, or capacity planning.
98
100%
Does it follow best practices?
Impact
96%
1.07xAverage score across 6 eval scenarios
Passed
No known issues
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| SLO/SLI | references/slo-sli-management.md | Defining SLOs, calculating error budgets |
| Error Budgets | references/error-budget-policy.md | Managing budgets, burn rates, policies |
| Monitoring | references/monitoring-alerting.md | Golden signals, alert design, dashboards |
| Automation | references/automation-toil.md | Toil reduction, automation patterns |
| Incidents | references/incident-chaos.md | Incident response, chaos engineering |
When implementing SRE practices, provide:
# 99.9% availability SLO over a 30-day window
# Allowed downtime: (1 - 0.999) * 30 * 24 * 60 = 43.2 minutes/month
# Error budget (request-based): 0.001 * total_requests
# Example: 10M requests/month → 10,000 error budget requests
# If 5,000 errors consumed in week 1 → 50% budget burned in 25% of window
# → Trigger error budget policy: freeze non-critical releasesgroups:
- name: slo_availability
rules:
# Fast burn: 2% budget in 1h (14.4x burn rate)
- alert: HighErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 0.014400
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.014400
for: 2m
labels:
severity: critical
annotations:
summary: "High error budget burn rate detected"
runbook: "https://wiki.internal/runbooks/high-error-burn"
# Slow burn: 5% budget in 6h (1x burn rate sustained)
- alert: SlowErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > 0.001
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained error budget consumption"
runbook: "https://wiki.internal/runbooks/slow-error-burn"# Latency — 99th percentile request duration
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
# Traffic — requests per second by service
sum(rate(http_requests_total[5m])) by (service)
# Errors — error rate ratio
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
# Saturation — CPU throttling ratio
sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)#!/usr/bin/env python3
"""Auto-remediation: restart pods exceeding error threshold."""
import subprocess, sys, json
ERROR_THRESHOLD = 0.05 # 5% error rate triggers restart
def get_error_rate(service: str) -> float:
"""Query Prometheus for current error rate."""
import urllib.request
query = f'sum(rate(http_requests_total{{status=~"5..",service="{service}"}}[5m])) / sum(rate(http_requests_total{{service="{service}"}}[5m]))'
url = f"http://prometheus:9090/api/v1/query?query={urllib.request.quote(query)}"
with urllib.request.urlopen(url) as resp:
data = json.load(resp)
results = data["data"]["result"]
return float(results[0]["value"][1]) if results else 0.0
def restart_deployment(namespace: str, deployment: str) -> None:
subprocess.run(
["kubectl", "rollout", "restart", f"deployment/{deployment}", "-n", namespace],
check=True
)
print(f"Restarted {namespace}/{deployment}")
if __name__ == "__main__":
service, namespace, deployment = sys.argv[1], sys.argv[2], sys.argv[3]
rate = get_error_rate(service)
print(f"Error rate for {service}: {rate:.2%}")
if rate > ERROR_THRESHOLD:
restart_deployment(namespace, deployment)
else:
print("Within SLO threshold — no action required")5b76101
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.