Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Design, run, and debug Chaos Engineering experiments on Kubernetes.
When invoked with no arguments, ask before proceeding:
Q1 — Mode?
What do you need?
1. install — install Litmus Chaos or Chaos Mesh via Helm
2. experiment — design a fault injection experiment
3. schedule — wrap an experiment in a recurring schedule
4. gameday — run a structured GameDay experiment
5. debug — diagnose a failed or stuck experiment
6. report — summarize results after an experiment completes
Enter 1–6 or mode name:After collecting the mode, ask one follow-up:
Which tool? Litmus Chaos (recommended default) or Chaos Mesh (network/IO faults)?Describe the workload to target — name, namespace, fault type (pod-delete / network-loss / cpu-stress / node-drain):Provide the experiment YAML or describe the experiment to wrap in a schedule:State the steady-state hypothesis — what metric or probe proves the system is healthy?Describe the symptom or paste the ChaosEngine/ChaosResult status:Paste the ChaosResult output or describe the experiment that ran:Then proceed into the relevant mode below.
Install Litmus Chaos or Chaos Mesh via Helm.
Steps:
Choose the tool:
Install Litmus Chaos:
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm repo update
helm upgrade --install chaos litmuschaos/litmus \
--namespace litmus \
--create-namespace \
--version 3.9.0 \
-f examples/chaos/litmus-install-values.yamlInstall Chaos Mesh:
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
--namespace chaos-mesh \
--create-namespace \
--version 2.7.0 \
-f examples/chaos/chaos-mesh-install-values.yamlVerify:
# Litmus
kubectl get pods -n litmus
# Chaos Mesh
kubectl get pods -n chaos-meshExpected: all pods Running
Reference: references/chaos.md → Installation, Decision matrix
Generate a fault experiment from a description.
Steps:
duration or TOTAL_CHAOS_DURATION set (default 60s)PODS_AFFECTED_PERC: "50" for initial runs (safe); increase to 100% for full resilience testkubectl apply command + expected ChaosResult verdictSee examples/chaos/pod-delete-experiment.yaml and examples/chaos/network-loss-experiment.yaml for complete examples.
Reference: references/chaos.md → Fault taxonomy, Steady-state hypothesis, Litmus ChaosEngine structure, Chaos Mesh NetworkChaos structure
Wrap an experiment in a recurring schedule.
Steps:
Litmus ChaosSchedule:
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosSchedule
metadata:
name: pod-delete-weekly
namespace: my-namespace
spec:
schedule:
repeat:
properties:
minChaosInterval: "168h"
engineTemplateSpec:
appinfo:
appns: my-namespace
applabel: "app=my-service"
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60"Chaos Mesh Schedule:
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
name: network-loss-weekly
namespace: my-namespace
spec:
schedule: "0 2 * * 1"
type: NetworkChaos
historyLimit: 5
concurrencyPolicy: Forbid
networkChaosTemplate:
spec:
action: loss
mode: all
selector:
namespaces: [my-namespace]
labelSelectors:
app: my-service
loss:
loss: "20"
duration: 60sRecommend staging only — never schedule experiments in production without a change window
Reference: references/chaos.md → GitOps integration
Run a structured GameDay experiment.
Steps:
Define steady-state hypothesis:
Scope the blast radius:
<namespace>app=<service-name>Inject the fault (reference experiment from experiment mode output)
Observe:
Record the verdict:
Output DORA impact summary:
Reference: references/chaos.md → Steady-state hypothesis, Blast radius scoping, DORA feedback loop
Diagnose a failed or stuck experiment.
Checklist:
Is the controller running?
kubectl get pods -n litmus # Litmus
kubectl get pods -n chaos-mesh # Chaos MeshIs the ChaosEngine/fault stuck?
kubectl describe chaosengine <name> -n <namespace> # Litmus
kubectl describe podchaos <name> -n <namespace> # Chaos MeshCheck the ChaosResult (Litmus only):
kubectl get chaosresult -n <namespace>
kubectl describe chaosresult <engine-name>-<experiment-name> -n <namespace>Look for status.experimentStatus.verdict and probe failure messages.
Common causes:
| Symptom | Cause | Fix |
|---|---|---|
Stuck Initialized | Missing RBAC | kubectl get clusterrolebinding -l app.kubernetes.io/component=operator |
| No pods targeted | Label selector wrong | kubectl get pods -n <ns> -l <selector> |
| Probe failing | Wrong URL or PromQL | Test probe URL with curl from inside the namespace |
| Experiment won't end | Missing duration | Add duration: 60s (Chaos Mesh) or TOTAL_CHAOS_DURATION (Litmus) |
| Controller crashlooping | Version mismatch | Re-install matching chart version |
Reference: references/chaos.md → Troubleshooting
Summarize experiment results after completion.
Steps:
Collect ChaosResult:
kubectl get chaosresult -n <namespace> -o yamlReport:
status.experimentStatus.passedRuns) to probe passing againDORA impact:
benchmark mode: /platform-skills:dora benchmarkReference: references/chaos.md → DORA feedback loop
After completing any chaos mode, log findings while context is fresh:
ERR in .learnings/ERRORS.mdLRN in .learnings/LEARNINGS.mdUse /platform-skills:self-improve log for each entry. Do not defer to end of session.
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests