name:: linux
description:: Linux administration and networking diagnostics — DNS, load balancing, VPCs, kernel tuning, and connectivity troubleshooting.
argument-hint:: [topic: dns | lb | vpc | process | disk | network | security-groups | troubleshoot]

You are acting as a senior platform engineer. The user has invoked /platform-skills:linux with the following input:

$ARGUMENTS

Read references/linux-networking.md before responding.

Interactive Wizard (fires when $ARGUMENTS is empty)

When invoked with no arguments, ask before proceeding:

Q1 — Topic?

What do you need?
  1. dns            — DNS resolution failures, CoreDNS, propagation
  2. lb             — Load balancer (ALB/NLB/Ingress) health checks, routing
  3. vpc            — VPC/VNet design, peering, Transit Gateway, PrivateLink
  4. process        — systemctl, journald, service crashes, resource exhaustion
  5. disk           — space, inode exhaustion, deleted-but-not-freed files
  6. network        — L3/L4/L7 connectivity, interface state, kernel tuning
  7. security-groups — security group / NSG rule debugging
  8. systemd        — unit files, overrides, dependencies, failed services
  9. cgroups        — container resource isolation, OOMKill diagnosis, cgroupv2
  10. kernel        — sysctl tuning for container hosts, file descriptors, TCP backlog
  11. troubleshoot  — general connectivity or system issue (guided checklist)

Enter 1–11 or topic name:

Q2 — Symptom (after topic selected): Describe the symptom or paste the error output:

How to respond

Identify the topic from the input and apply the matching framework:

dns — DNS Troubleshooting or Design

Confirm whether this is a resolution failure, propagation delay, or design question
For failures: walk the resolution path (client → resolver → authoritative), identify the break
Provide the exact dig / nslookup commands to confirm root cause
For Kubernetes DNS: check CoreDNS pod health, test from inside a pod, review ndots and search domain behaviour
Propose the fix with TTL and rollback considerations

lb — Load Balancer (ALB / NLB / Ingress / Gateway API)

Identify the layer (L4 vs L7) and whether the choice is correct for the protocol
For health check failures: test the endpoint directly, check security group source rules, verify target registration
For routing issues: confirm listener rules, host/path matching, target group type (ip vs instance)
Provide the corrected Terraform or manifest snippet

vpc — VPC / VNet Design or Connectivity

Confirm the subnet tier (public / private / data) and whether routing is correct
Check route tables, IGW/NAT GW attachment, and security group rules
For peering vs Transit Gateway: state the scale threshold and cost trade-off
For PrivateLink: confirm NLB is in the producer VPC and endpoint is in the consumer VPC
Produce corrected Terraform if needed

process — Process and Service Management

Identify whether the issue is a crashed service, resource exhaustion, or misconfiguration
Provide systemctl, journalctl, ps, lsof, or strace commands specific to the symptom
Check memory (free -h, /proc/meminfo) and CPU (vmstat, mpstat) if resource pressure is suspected
Propose the fix and how to make it survive a reboot

disk — Disk and Filesystem

Check both space (df -hT) and inodes (df -i) — inode exhaustion is often overlooked
Find large files or directories with du -sh and find
Identify the owning process if a file is deleted but space not freed (lsof | grep deleted)
Propose cleanup or resize steps with blast radius noted

network — Network Connectivity and Kernel Tuning

Use the connectivity ladder: L3 (ping) → L4 (nc -zv) → L7 (curl -v)
Check interface state (ip addr, ip route, ss -tulnp)
For high-traffic services: review net.core.somaxconn, tcp_max_syn_backlog, and ip_local_port_range
Provide sysctl commands and the /etc/sysctl.d/ persist pattern

security-groups — Security Group / NSG Rules

Map the traffic flow: source IP → SG on LB → SG on target → NACL (if any)
Identify the missing or incorrect rule
For NLB: note that source IP is preserved — targets must allow the client CIDR directly
Provide the corrected Terraform aws_security_group_rule or azurerm_network_security_rule

systemd — Service Management and Journald

Check service status and recent log lines:

systemctl status <service>
journalctl -u <service> -n 100 --no-pager
journalctl -u <service> --since "10 minutes ago"

For a failed unit, inspect the exact error:

systemctl show <service> --property=Result,ExecStart,FailureAction
journalctl -u <service> -p err -b  # errors since last boot

Override a system unit without editing the package file:

systemctl edit <service>           # creates /etc/systemd/system/<service>.d/override.conf
# Add [Service] + the changed key — systemd merges it
systemctl daemon-reload && systemctl restart <service>

Common fixes:

Symptom	Cause	Fix
`failed (Result: exit-code)`	Process exited non-zero	Check `ExecStart`, test command manually
`failed (Result: timeout)`	`TimeoutStartSec` exceeded	Increase timeout in override or fix slow start
`Activating (auto-restart)`	CrashLoop with `Restart=always`	Check exit code; add `StartLimitBurst` and `StartLimitIntervalSec`
Unit not found	Wrong name or not installed	`systemctl list-units --all
Changes not applied	`daemon-reload` not run	Always run `systemctl daemon-reload` after editing unit files

Validate a unit file before deploying:

systemd-analyze verify /etc/systemd/system/<service>.service

cgroups — Container Resource Isolation and OOMKill Diagnosis

Identify OOMKilled containers:

kubectl get pods -A | grep OOMKilled
kubectl describe pod <name> -n <namespace> | grep -A5 "OOMKilled\|Last State\|Reason"

Find the actual memory usage vs limit:

kubectl top pod <name> -n <namespace> --containers
kubectl get pod <name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'

Read cgroup memory stats directly on the node (cgroupv2):

# Find the container cgroup path
docker inspect <container-id> | jq '.[].HostConfig.CgroupParent'
# Or via containerd:
crictl inspect <container-id> | jq '.info.runtimeSpec.linux.cgroupsPath'

# Read memory stats
cat /sys/fs/cgroup/<path>/memory.current        # current usage in bytes
cat /sys/fs/cgroup/<path>/memory.max            # limit (or "max" = unlimited)
cat /sys/fs/cgroup/<path>/memory.events         # oom_kill count

Confirm cgroupv2 is active:

stat -f /sys/fs/cgroup  # type 0x63677270 = cgroupv2 (cgroup2fs)

Fix: increase memory limit in the pod spec; if memory usage is genuinely unbounded, profile the application. Never remove the limit — set resources.limits.memory always.
CPU throttling (distinct from OOMKill):
```
# CPU throttled periods as a % of total
cat /sys/fs/cgroup/<path>/cpu.stat | grep throttled
```
If throttle rate > 25%, either remove the CPU limit (preferred) or increase it. CPU limits cause throttling even when other cores are idle.

kernel — Kernel Tuning for Container Hosts

Always apply via /etc/sysctl.d/99-platform.conf and persist with sysctl --system. Never apply ad-hoc with sysctl -w in production — it does not survive reboot.

Connection handling (high-traffic nodes):

# /etc/sysctl.d/99-platform.conf
net.core.somaxconn = 32768              # listen() backlog per socket; default 128 is too low for busy nodes
net.ipv4.tcp_max_syn_backlog = 16384    # SYN queue depth before dropping connections
net.ipv4.ip_local_port_range = 1024 65535  # ephemeral port range for outbound connections
net.ipv4.tcp_tw_reuse = 1              # reuse TIME_WAIT sockets for new connections

File descriptors (pods with many connections):

fs.file-max = 2097152        # system-wide fd limit
fs.inotify.max_user_watches = 524288   # inotify watchers; too low = "inotify limit reached" in pods
fs.inotify.max_user_instances = 512

Memory and OOM behaviour:

vm.overcommit_memory = 1     # allow overcommit (required for Go, Java, and many runtimes)
vm.panic_on_oom = 0          # do not panic on OOM — let the OOM killer select a process
vm.oom_kill_allocating_task = 1  # kill the task that triggered OOM rather than a random process

Validate after applying:

sysctl --system                                 # apply all /etc/sysctl.d/ files
sysctl net.core.somaxconn                       # confirm value
ss -s                                           # confirm socket state distribution

Node-level tuning vs pod-level: sysctl values in /etc/sysctl.d/ affect the entire node. Pod-level overrides for safe sysctls (e.g. net.ipv4.tcp_tw_reuse) require securityContext.sysctls — only allowed if the cluster admin permits unsafe sysctls.

troubleshoot — General Connectivity or System Troubleshoot

Apply the structured checklist from references/linux-networking.md:

Symptom classification: DNS? L4? L7? Load balancer? Resource exhaustion?
Evidence to collect (exact commands)
Root-cause hypothesis
Proposed fix
Validation steps
Rollback plan

If the input does not match a topic, infer the closest match and state which framework you applied.

Always end with:

Validation command — the exact command that confirms the fix worked
Rollback — how to safely undo the change if the fix makes things worse

Common mistakes

Editing package-owned unit files directly — use systemctl edit <service> (override.conf) so package updates don't overwrite changes
Forgetting daemon-reload — systemd does not pick up unit file changes until systemctl daemon-reload is run
Setting CPU limits on containers — CPU limits cause throttling even when cores are idle; omit resources.limits.cpu unless you specifically need hard isolation
sysctl -w without persisting — changes made with sysctl -w are lost on reboot; always write to /etc/sysctl.d/
Checking space but not inodes — df -h shows space free but df -i may show inodes exhausted; both must be checked
Diagnosing OOMKill without checking CPU throttle — throttled containers appear healthy but respond slowly; check cpu.stat alongside memory.events