Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates. Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems, fault injection, Chaos Monkey, Litmus Chaos.
72
88%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Advisory
Suggest reviewing before use
Load detailed guidance based on context:
| Topic | Reference | Load When |
|---|---|---|
| Experiments | references/experiment-design.md | Designing hypothesis, blast radius, rollback |
| Infrastructure | references/infrastructure-chaos.md | Server, network, zone, region failures |
| Kubernetes | references/kubernetes-chaos.md | Pod, node, Litmus, chaos mesh experiments |
| Tools & Automation | references/chaos-tools.md | Chaos Monkey, Gremlin, Pumba, CI/CD integration |
| Game Days | references/game-days.md | Planning, executing, learning from game days |
Non-obvious constraints that must be enforced on every experiment:
When implementing chaos engineering, provide:
The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes.
# Verify baseline: p99 latency < 200ms, error rate < 0.1%
kubectl get deploy my-service -n production
kubectl top pods -n production -l app=my-service# chaos-pod-delete.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: my-service-pod-delete
namespace: production
spec:
appinfo:
appns: production
applabel: "app=my-service"
appkind: deployment
# Limit blast radius: only 1 replica at a time
engineState: active
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "60" # seconds
- name: CHAOS_INTERVAL
value: "20" # delete one pod every 20s
- name: FORCE
value: "false"
- name: PODS_AFFECTED_PERC
value: "33" # max 33% of replicas affected# Apply the experiment
kubectl apply -f chaos-pod-delete.yaml
# Watch experiment status
kubectl describe chaosengine my-service-pod-delete -n production
kubectl get chaosresult my-service-pod-delete-pod-delete -n production -w# Tail application logs for errors
kubectl logs -l app=my-service -n production --since=2m -f
# Check ChaosResult verdict when complete
kubectl get chaosresult my-service-pod-delete-pod-delete \
-n production -o jsonpath='{.status.experimentStatus.verdict}'# Immediately stop the experiment
kubectl patch chaosengine my-service-pod-delete \
-n production --type merge -p '{"spec":{"engineState":"stop"}}'
# Confirm all pods are healthy
kubectl rollout status deployment/my-service -n production# Install toxiproxy CLI
brew install toxiproxy # macOS; use the binary release on Linux
# Start toxiproxy server (runs alongside your service)
toxiproxy-server &
# Create a proxy for your downstream dependency
toxiproxy-cli create -l 0.0.0.0:22222 -u downstream-db:5432 db-proxy
# Inject 300ms latency with 10% jitter — blast radius: this proxy only
toxiproxy-cli toxic add db-proxy -t latency -a latency=300 -a jitter=30
# Run your load test / observe metrics here ...
# Remove the toxic to restore normal behaviour
toxiproxy-cli toxic remove db-proxy -n latency_downstream# chaos-monkey-config.yml — restrict to a single ASG
deployment:
enabled: true
regionIndependence: false
chaos:
enabled: true
meanTimeBetweenKillsInWorkDays: 2
minTimeBetweenKillsInWorkDays: 1
grouping: APP # kill one instance per app, not per cluster
exceptions:
- account: production
region: us-east-1
detail: "*-canary" # never kill canary instances
# Apply and trigger a manual kill for testing
chaos-monkey --app my-service --account staging --dry-run falsee8be415
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.