Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Set up or improve observability for a service or platform component.
When invoked with no arguments, ask before proceeding:
Q1 — Mode?
What do you need?
1. instrument — add structured logs, Prometheus metrics, and OTel tracing to a service
2. dashboard — create a Grafana RED/USE dashboard for a service
3. alert — write Prometheus alerting rules for a service
4. slo — define SLIs, error budgets, and SLO burn-rate alerts
5. loadtest — write and run a k6 load test
6. capacity — estimate resource requirements and HPA configuration
Enter 1–6 or mode name:Q2 — Context (after mode selected):
What language/framework and which metrics/tracing backend (Prometheus, Datadog, Jaeger, Tempo)?Service name and which signal to lead with — request-based (RED) or resource-based (USE)?Service name and what SLIs matter most — error rate, latency, availability?Service name, expected availability target (e.g. 99.9%), and current p95 latency baseline:Target endpoint, expected peak RPS, and SLO thresholds (p95 latency, max error rate):Expected peak RPS, measured p99 latency at current load, and memory per pod:Add the three pillars — logs, metrics, traces — to a service.
Steps:
http_requests_total counter, http_request_duration_seconds histogram, per route and status/metrics scrape endpoint (Prometheus) or configure push exporter/healthz and /readyz health check endpointsReference: references/observability.md → Structured Logging, Prometheus Metrics, OpenTelemetry Tracing
→ Next: Run /platform-skills:observability alert to write alerting rules for the metrics just added, then /platform-skills:observability slo to wrap them in an error budget.
Create a Grafana dashboard for a service.
Steps:
Reference: references/observability.md → Grafana Dashboards
→ Next: Run /platform-skills:observability slo to add SLO burn-rate alerts and an error budget panel to this dashboard.
Write Prometheus alerting rules for a service.
Steps:
rate() over 5m windowsfor: duration ≥ 1m to suppress transient noiseseverity label (critical / warning) and runbook annotation to every alertAlert design rules:
Reference: references/observability.md → Alerting Rules
→ Next: Run /platform-skills:observability slo to promote these symptom alerts to proper SLO burn-rate alerts backed by an error budget.
Write and run a k6 load test.
Steps:
thresholds matching the SLOcheck() assertions on status code and response timek6 run --out json=results.json load-test.jsReference: references/observability.md → Load Testing
Estimate resource requirements and HPA configuration for a service.
Steps:
replicas = ceil((peak_rps × avg_latency_s) / target_concurrency_per_pod)Reference: references/observability.md → Capacity Planning
→ Next: After sizing, run /platform-skills:observability loadtest to validate the HPA triggers correctly under synthetic load.
Define SLIs, set error budgets, and generate SLO burn-rate alerts from first principles.
Steps:
Define SLIs — identify what "good" looks like for this service:
sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) < 0.3Set the SLO target — start conservative, tighten over time:
SLO: 99.9% availability over a 30-day rolling window
Error budget: 0.1% = 43.2 minutes per 30 daysGenerate burn-rate alerts — multiwindow, multi-burn-rate (Google SRE Book approach):
# Fast burn — consumes 5% of monthly budget in 1h → page immediately
- alert: SLOFastBurn
expr: |
(
rate(http_requests_total{status=~"5.."}[1h])
/ rate(http_requests_total[1h])
) > (14.4 * 0.001) # 14.4× burn rate exhausts budget in ~2 days
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn on {{ $labels.service }}"
runbook: "https://runbooks.internal/slo-fast-burn"
# Slow burn — consumes 10% of monthly budget in 6h → ticket
- alert: SLOSlowBurn
expr: |
(
rate(http_requests_total{status=~"5.."}[6h])
/ rate(http_requests_total[6h])
) > (6 * 0.001) # 6× burn rate exhausts budget in ~5 days
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn on {{ $labels.service }}"
runbook: "https://runbooks.internal/slo-slow-burn"Track remaining error budget — add to the Grafana dashboard:
# Error budget remaining (%) over 30d
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))
) / 0.001 # divide by error budget fraction (1 - SLO target)Error budget policy — state in writing what happens when the budget is depleted:
50% remaining → normal feature velocity
Key rules:
for: on fast burn should be short (1–2m); slow burn needs longer (10–15m)→ Next: Run /platform-skills:observability alert to add symptom-based alerts alongside SLO burn-rate alerts.
for: duration — without it, a single bad scrape triggers a page. Minimum 1m for slow alerts, 2m for fast SLO burnslo mode), then write alertshistogram_quantile interpolates inaccurately. Set le values at 0.1, 0.25, 0.3, 0.5, 1.0, 2.5user_id, session_id, or request_id as Prometheus label values. Cardinality explodes memory usage.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests