CtrlK
BlogDocsLog inGet started
Tessl Logo

nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

67

Quality

84%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

gameday-runbook.mdexamples/chaos/

GameDay Runbook Template

Use this template for every planned chaos experiment. Fill in each section before injecting faults.


1. Steady-State Hypothesis

System is healthy when:

ProbeTypeCondition
http://my-service/healthz returns 200HTTPMust hold throughout experiment
rate(http_errors[1m]) < 0.01PrometheusError rate below 1%

Acceptable degradation: [e.g., p99 latency may spike up to 2s for up to 10s]


2. Blast Radius

ParameterValue
Namespacemy-namespace
Label selectorapp=my-service
Fault typepod-delete / network-loss / cpu-stress
Affected pods50% (first run)
Duration60s
Environmentstaging

3. Experiment

kubectl apply -f examples/chaos/pod-delete-experiment.yaml
kubectl get chaosengine pod-delete-engine -n my-namespace -w

4. Observation

TimeObservation
T+0sFault injected
T+?sAlert fired (if applicable)
T+?sProbe failure first seen
T+60sFault terminated
T+?sSteady-state probe restored

5. Verdict

  • PASS — steady-state probe held throughout; service recovered within SLO
  • FAIL — probe failed OR recovery exceeded SLO OR alert did not fire

Root cause (if FAIL):


6. DORA Impact

MetricObservation
Change failure rateDid this fault class appear in past incidents?
MTTRTime from T+0 (fault) to steady-state restored
Recommendation[e.g., add PDB, increase HPA min replicas, add circuit breaker]

examples

BEFORE_AFTER.md

CHANGELOG.md

CODE_OF_CONDUCT.md

COMMANDS.md

CONTRIBUTING.md

EDITOR_INTEGRATIONS.md

GETTING_STARTED.md

HOW_IT_WORKS.md

install.sh

INSTALLATION.md

LAUNCH.md

PROMPTS.md

QUICKSTART.md

README.md

renovate.json

SECURITY.md

SKILL.md

tessl.json

tile.json