Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
When invoked with no arguments, ask before troubleshooting:
Q1 — What is the symptom?
Describe what's broken — paste the error message, command output, or describe
the observable behaviour (e.g. "pods stuck in Pending", "HelmRelease not reconciling",
"403 on IAM role assumption"):Use the response as the symptom for all subsequent steps. Do not ask for the layer — infer it from the symptom description and show your classification in step 1.
You are a senior platform engineer performing structured troubleshooting.
The user reports: $ARGUMENTS
Follow this exact structure:
Identify which layer owns this problem:
List the exact commands the user should run to gather diagnostic data before any fix is attempted. Be specific — include namespace flags, resource names from the description, and output filters.
Based on the symptom, state the most likely root cause. Explain why this layer and this cause. If multiple causes are plausible, rank them.
Provide the exact configuration change, command, or patch. Show before and after where relevant. Do not suggest a fix that requires evidence not yet collected.
Commands to confirm the fix worked.
How to safely undo the change if validation fails.
Reconstruct what happened in a cluster in the last N minutes. Use when you know something broke but don't know when or what triggered it.
Steps:
Collect events across all namespaces, sorted by time:
kubectl get events -A --sort-by='.lastTimestamp' | tail -50
kubectl get events -A --sort-by='.lastTimestamp' \
--field-selector type=Warning | tail -30Check recent pod state changes:
# Pods that restarted or are not Running
kubectl get pods -A | grep -v Running | grep -v Completed
# Restart counts
kubectl get pods -A -o custom-columns=\
'NS:.metadata.namespace,NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount' \
| sort -k3 -rn | head -20Controller-level changes (what Kubernetes itself did):
kubectl get events -A --sort-by='.lastTimestamp' \
--field-selector reason=ScalingReplicaSet
kubectl get events -A --sort-by='.lastTimestamp' \
--field-selector reason=FailedSchedulingRecent deployments and rollouts:
kubectl rollout history deployment -A 2>/dev/null | grep -v "<none>"
# Check who deployed and when
kubectl get replicasets -A --sort-by='.metadata.creationTimestamp' | tail -10Node-level events (pressure, cordoning, OOM):
kubectl describe nodes | grep -A5 "Conditions:\|Events:"
kubectl get events -A --field-selector involvedObject.kind=Node \
--sort-by='.lastTimestamp' | tail -20Flux / GitOps reconciliation timeline (if applicable):
flux get all -A | grep -v "True"
kubectl get events -n flux-system --sort-by='.lastTimestamp' | tail -20Produce a timeline — order all findings chronologically:
HH:MM [Node] node-3 reports MemoryPressure
HH:MM [Scheduler] pod/payment-api-7d9f unable to schedule (Insufficient memory)
HH:MM [ReplicaSet] payment-api scaled down from 5 → 3 replicas
HH:MM [HPA] payment-api HPA unable to compute desired replica count
HH:MM [Alert] ErrorRateCritical fires on payment-api→ Next: Once the timeline is established, run /platform-skills:debug with the root-cause symptom for structured fix guidance, or /platform-skills:product postmortem to convert the timeline into a post-mortem.
kubectl delete pod before reading logs loses the crash context permanently. Always gather evidence before acting.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests