Comprehensive Kubernetes toolkit for YAML generation, validation, and cluster debugging
94
94%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Mental Model: Kubernetes debugging follows a layered approach—start broad (cluster health), narrow to affected components (pods, services), then drill into specific failures (logs, events, resource constraints).
Decision Framework:
kubectl get events and describe resources first.When to use this skill:
# Quick assessment
kubectl get pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> # Events section is key
# Detailed diagnostics
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>
# Check previous logs if CrashLoopBackOff
kubectl logs <pod-name> -n <namespace> --previous# Verify service endpoints match pod IPs
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
# Network diagnostics
./scripts/network_debug.sh <namespace> <pod-name>
# Test from debug pod
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash# Resource usage by container
kubectl top pods -n <namespace> --containers
# Check for OOMKilled
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState
# Review recent logs
kubectl logs <pod-name> -n <namespace> --tail=100 --timestamps# Comprehensive health report
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txtFocus on non-obvious flags and patterns most useful during debugging:
# Cross-namespace pod overview
kubectl get pods -A -o wide --field-selector=status.phase!=Running
# Previous container logs (post-crash)
kubectl logs <pod-name> -n <namespace> --previous
# Multi-container pod: target specific container
kubectl logs <pod-name> -n <namespace> -c <container>
# Stream logs with timestamps
kubectl logs <pod-name> -n <namespace> -f --timestamps
# Describe for Events section — most useful first stop
kubectl describe pod <pod-name> -n <namespace>
# Full pod YAML including status conditions
kubectl get pod <pod-name> -n <namespace> -o yaml# Confirm endpoint IPs match running pod IPs (label selector mismatch shows empty)
kubectl get endpoints <service-name> -n <namespace>
# Test DNS from within the cluster
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default
# Sort events by time to find recent failures
kubectl get events -n <namespace> --sort-by='.lastTimestamp'# Per-container resource usage (reveals which container is the culprit)
kubectl top pod <pod-name> -n <namespace> --containers
# Resource quota consumption vs. limits
kubectl describe resourcequota -n <namespace>⚠️ These commands are destructive or disruptive. Follow the verification steps before and after each operation.
# Verify current rollout state first
kubectl rollout status deployment/<name> -n <namespace>
# Restart
kubectl rollout restart deployment/<name> -n <namespace>
# Verify rollout completes successfully
kubectl rollout status deployment/<name> -n <namespace> --timeout=120s# Check rollout history to pick the correct revision
kubectl rollout history deployment/<name> -n <namespace>
# Rollback to previous revision
kubectl rollout undo deployment/<name> -n <namespace>
# Confirm rollback success and pods are running
kubectl rollout status deployment/<name> -n <namespace>
kubectl get pods -n <namespace> -l app=<name># Confirm the pod is genuinely stuck (not just slow to terminate)
kubectl get pod <pod-name> -n <namespace> -w # Watch for 60s before proceeding
# Force delete only if pod remains Terminating with no progress
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0
# Verify the pod is gone and not rescheduled with an error state
kubectl get pod <pod-name> -n <namespace># Review what will be evicted before draining
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
# Cordon first to prevent new scheduling
kubectl cordon <node-name>
# Drain (evicts pods gracefully)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Verify node is drained and no workloads remain
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
# After maintenance, uncordon to restore scheduling
kubectl uncordon <node-name>
kubectl get node <node-name> # Confirm status returns to ReadyBAD:
# Immediately restarting without investigation
kubectl rollout restart deployment/appGOOD:
# Gather diagnostic context first
kubectl describe deployment/app
kubectl get events --sort-by='.lastTimestamp' | tail -20
kubectl logs -l app=app --tail=50
# THEN decide if restart is appropriateBAD:
# Assuming default namespace
kubectl get pods
kubectl logs my-podGOOD:
# Always specify namespace explicitly
kubectl get pods -n production
kubectl logs my-pod -n production
# Or use -A to search all namespaces
kubectl get pods -A | grep my-podBAD:
# Immediate force delete
kubectl delete pod stuck-pod --force --grace-period=0GOOD:
# Investigate why pod is stuck first
kubectl describe pod stuck-pod
kubectl get pod stuck-pod -o yaml | grep -A 10 finalizers
# Try normal delete first
kubectl delete pod stuck-pod
# Wait 60s, watch for termination
# Force delete only if confirmed stuckBAD:
# Exec into production pod and make changes
kubectl exec -it prod-pod -- /bin/bashGOOD:
# Create debug copy for investigation
kubectl debug prod-pod --copy-to=debug-pod --share-processes
# Or use ephemeral debug container (K8s 1.23+)
kubectl debug prod-pod -it --image=nicolaka/netshootService connectivity issues are almost always label mismatches:
# Verify service selector matches pod labels
kubectl get svc my-service -o jsonpath='{.spec.selector}'
kubectl get pods -l app=my-app --show-labels
kubectl get endpoints my-service # Should show pod IPs# Current logs may be empty if container crashed immediately
kubectl logs failing-pod --previous# Attach ephemeral debug container
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot
# Create debug copy of pod
kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container># Forward pod port to local machine
kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>
# Forward service port
kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port># Start kubectl proxy
kubectl proxy --port=8080
# Access API
curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name># Custom pod info
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP
# Node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taintsAfter any debugging intervention:
Confirm the fix: Don't assume—verify pods are Running and Ready
kubectl get pods -n <namespace> -w
kubectl rollout status deployment/<name>Check for side effects: Ensure fix didn't break other components
kubectl get events --sort-by='.lastTimestamp' | tail -20Test functionality: Validate the application works end-to-end
kubectl port-forward svc/<service> 8080:80
curl http://localhost:8080/healthzDocument the root cause: Add annotations to resources for future reference
kubectl annotate deployment/<name> debug.issue="ImagePullBackOff due to missing secret"See references/troubleshooting_workflow.md for:
See references/common_issues.md for: