Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Use this template for every planned chaos experiment. Fill in each section before injecting faults.
System is healthy when:
| Probe | Type | Condition |
|---|---|---|
http://my-service/healthz returns 200 | HTTP | Must hold throughout experiment |
rate(http_errors[1m]) < 0.01 | Prometheus | Error rate below 1% |
Acceptable degradation: [e.g., p99 latency may spike up to 2s for up to 10s]
| Parameter | Value |
|---|---|
| Namespace | my-namespace |
| Label selector | app=my-service |
| Fault type | pod-delete / network-loss / cpu-stress |
| Affected pods | 50% (first run) |
| Duration | 60s |
| Environment | staging |
kubectl apply -f examples/chaos/pod-delete-experiment.yaml
kubectl get chaosengine pod-delete-engine -n my-namespace -w| Time | Observation |
|---|---|
| T+0s | Fault injected |
| T+?s | Alert fired (if applicable) |
| T+?s | Probe failure first seen |
| T+60s | Fault terminated |
| T+?s | Steady-state probe restored |
Root cause (if FAIL):
| Metric | Observation |
|---|---|
| Change failure rate | Did this fault class appear in past incidents? |
| MTTR | Time from T+0 (fault) to steady-state restored |
| Recommendation | [e.g., add PDB, increase HPA min replicas, add circuit breaker] |
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests