nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

Quality

84%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

GameDay Runbook Template

Name: nitinjain999/platform-skills
Rating: 67.2 (1 reviews)
Author: nitinjain999

Use this template for every planned chaos experiment. Fill in each section before injecting faults.

System is healthy when:

Probe	Type	Condition
`http://my-service/healthz` returns 200	HTTP	Must hold throughout experiment
`rate(http_errors[1m]) < 0.01`	Prometheus	Error rate below 1%

Acceptable degradation: [e.g., p99 latency may spike up to 2s for up to 10s]

Parameter	Value
Namespace	`my-namespace`
Label selector	`app=my-service`
Fault type	`pod-delete` / `network-loss` / `cpu-stress`
Affected pods	50% (first run)
Duration	60s
Environment	staging

kubectl apply -f examples/chaos/pod-delete-experiment.yaml
kubectl get chaosengine pod-delete-engine -n my-namespace -w

Root cause (if FAIL):

Metric	Observation
Change failure rate	Did this fault class appear in past incidents?
MTTR	Time from T+0 (fault) to steady-state restored
Recommendation	[e.g., add PDB, increase HPA min replicas, add circuit breaker]