pantheon-ai/k8s-toolkit

Comprehensive Kubernetes toolkit for YAML generation, validation, and cluster debugging

Quality

94%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

name:: k8s-debug
description:: Inspect pod logs, analyze resource quotas, trace network policies, check deployment rollout status, and run cluster health checks for Kubernetes. Use this skill when diagnosing Kubernetes cluster issues, debugging failing pods, investigating network connectivity problems, analyzing resource usage, troubleshooting deployments, or performing cluster health checks.

Kubernetes Debugging Skill

Name: pantheon-ai/k8s-toolkit
Rating: 0.94 (1 reviews)
Author: pantheon-ai

Debugging Mindset

Mental Model: Kubernetes debugging follows a layered approach—start broad (cluster health), narrow to affected components (pods, services), then drill into specific failures (logs, events, resource constraints).

Decision Framework:

Gather context before jumping to solutions. Check kubectl get events and describe resources first.
Verify assumptions about selectors, labels, and namespaces—label mismatches are the most common root cause.
Test hypotheses systematically: network → resource → configuration → application.
Document findings as you go—Kubernetes issues often involve multiple interacting failures.

When to use this skill:

Pods stuck in Pending, CrashLoopBackOff, ImagePullBackOff, or Error states
Services not routing traffic despite healthy pods
Resource exhaustion (OOMKilled, CPU throttling)
Deployments failing to roll out or stuck in progress
Network policies blocking expected traffic

Quick Diagnostic Patterns

Pod Not Starting

# Quick assessment
kubectl get pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>  # Events section is key

# Detailed diagnostics
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>

# Check previous logs if CrashLoopBackOff
kubectl logs <pod-name> -n <namespace> --previous

Service Connectivity Issues

# Verify service endpoints match pod IPs
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>

# Network diagnostics
./scripts/network_debug.sh <namespace> <pod-name>

# Test from debug pod
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash

Performance Degradation

# Resource usage by container
kubectl top pods -n <namespace> --containers

# Check for OOMKilled
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState

# Review recent logs
kubectl logs <pod-name> -n <namespace> --tail=100 --timestamps

Cluster Health Check

# Comprehensive health report
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

Key Debugging Commands

Focus on non-obvious flags and patterns most useful during debugging:

Pod Debugging

# Cross-namespace pod overview
kubectl get pods -A -o wide --field-selector=status.phase!=Running

# Previous container logs (post-crash)
kubectl logs <pod-name> -n <namespace> --previous

# Multi-container pod: target specific container
kubectl logs <pod-name> -n <namespace> -c <container>

# Stream logs with timestamps
kubectl logs <pod-name> -n <namespace> -f --timestamps

# Describe for Events section — most useful first stop
kubectl describe pod <pod-name> -n <namespace>

# Full pod YAML including status conditions
kubectl get pod <pod-name> -n <namespace> -o yaml

Service and Network Debugging

# Confirm endpoint IPs match running pod IPs (label selector mismatch shows empty)
kubectl get endpoints <service-name> -n <namespace>

# Test DNS from within the cluster
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

# Sort events by time to find recent failures
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Resource Monitoring

# Per-container resource usage (reveals which container is the culprit)
kubectl top pod <pod-name> -n <namespace> --containers

# Resource quota consumption vs. limits
kubectl describe resourcequota -n <namespace>

Emergency Operations

⚠️ These commands are destructive or disruptive. Follow the verification steps before and after each operation.

Restart Deployment

# Verify current rollout state first
kubectl rollout status deployment/<name> -n <namespace>

# Restart
kubectl rollout restart deployment/<name> -n <namespace>

# Verify rollout completes successfully
kubectl rollout status deployment/<name> -n <namespace> --timeout=120s

Rollback Deployment

# Check rollout history to pick the correct revision
kubectl rollout history deployment/<name> -n <namespace>

# Rollback to previous revision
kubectl rollout undo deployment/<name> -n <namespace>

# Confirm rollback success and pods are running
kubectl rollout status deployment/<name> -n <namespace>
kubectl get pods -n <namespace> -l app=<name>

Force Delete Stuck Pod

# Confirm the pod is genuinely stuck (not just slow to terminate)
kubectl get pod <pod-name> -n <namespace> -w   # Watch for 60s before proceeding

# Force delete only if pod remains Terminating with no progress
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# Verify the pod is gone and not rescheduled with an error state
kubectl get pod <pod-name> -n <namespace>

Drain Node (Maintenance)

# Review what will be evicted before draining
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# Cordon first to prevent new scheduling
kubectl cordon <node-name>

# Drain (evicts pods gracefully)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Verify node is drained and no workloads remain
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# After maintenance, uncordon to restore scheduling
kubectl uncordon <node-name>
kubectl get node <node-name>   # Confirm status returns to Ready

Common Anti-Patterns

NEVER: Jump to Solutions Without Context

BAD:

# Immediately restarting without investigation
kubectl rollout restart deployment/app

GOOD:

# Gather diagnostic context first
kubectl describe deployment/app
kubectl get events --sort-by='.lastTimestamp' | tail -20
kubectl logs -l app=app --tail=50
# THEN decide if restart is appropriate

NEVER: Ignore Namespace Context

BAD:

# Assuming default namespace
kubectl get pods
kubectl logs my-pod

GOOD:

# Always specify namespace explicitly
kubectl get pods -n production
kubectl logs my-pod -n production
# Or use -A to search all namespaces
kubectl get pods -A | grep my-pod

NEVER: Force Delete as First Resort

BAD:

# Immediate force delete
kubectl delete pod stuck-pod --force --grace-period=0

GOOD:

# Investigate why pod is stuck first
kubectl describe pod stuck-pod
kubectl get pod stuck-pod -o yaml | grep -A 10 finalizers
# Try normal delete first
kubectl delete pod stuck-pod
# Wait 60s, watch for termination
# Force delete only if confirmed stuck

NEVER: Debug in Production Directly

BAD:

# Exec into production pod and make changes
kubectl exec -it prod-pod -- /bin/bash

GOOD:

# Create debug copy for investigation
kubectl debug prod-pod --copy-to=debug-pod --share-processes
# Or use ephemeral debug container (K8s 1.23+)
kubectl debug prod-pod -it --image=nicolaka/netshoot

ALWAYS: Check Labels and Selectors First

Service connectivity issues are almost always label mismatches:

# Verify service selector matches pod labels
kubectl get svc my-service -o jsonpath='{.spec.selector}'
kubectl get pods -l app=my-app --show-labels
kubectl get endpoints my-service  # Should show pod IPs

ALWAYS: Use --previous Flag for CrashLoopBackOff

# Current logs may be empty if container crashed immediately
kubectl logs failing-pod --previous

Advanced Debugging Techniques

Debug Containers (Kubernetes 1.23+)

# Attach ephemeral debug container
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot

# Create debug copy of pod
kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>

Port Forwarding for Testing

# Forward pod port to local machine
kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>

# Forward service port
kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>

Proxy for API Access

# Start kubectl proxy
kubectl proxy --port=8080

# Access API
curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>

Custom Column Output

# Custom pod info
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

# Node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Verification and Validation

After any debugging intervention:

Confirm the fix: Don't assume—verify pods are Running and Ready

kubectl get pods -n <namespace> -w
kubectl rollout status deployment/<name>

Check for side effects: Ensure fix didn't break other components
```
kubectl get events --sort-by='.lastTimestamp' | tail -20
```

Test functionality: Validate the application works end-to-end

kubectl port-forward svc/<service> 8080:80
curl http://localhost:8080/healthz

Document the root cause: Add annotations to resources for future reference

kubectl annotate deployment/<name> debug.issue="ImagePullBackOff due to missing secret"

Reference Documentation

Detailed Troubleshooting Guides

See references/troubleshooting_workflow.md for:

Step-by-step workflows for each issue type
Decision trees for diagnosis
Command sequences for systematic debugging
Quick reference command cheat sheet

Common Issues Database

See references/common_issues.md for:

Detailed explanations of each common issue (ImagePullBackOff, CrashLoopBackOff, OOMKilled, etc.)
Symptoms and root causes
Specific debugging steps
Solutions and fixes
Prevention strategies

Workspace: pantheon-ai
Visibility: Public
Created: 21 days ago
Last updated: 21 days ago
Publish Source: CLI
Badge