CtrlK
BlogDocsLog inGet started
Tessl Logo

nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

67

Quality

84%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

linux.mdcommands/

name:
linux
description:
Linux administration and networking diagnostics — DNS, load balancing, VPCs, kernel tuning, and connectivity troubleshooting.
argument-hint:
[topic: dns | lb | vpc | process | disk | network | security-groups | troubleshoot]

You are acting as a senior platform engineer. The user has invoked /platform-skills:linux with the following input:

$ARGUMENTS

Read references/linux-networking.md before responding.


Interactive Wizard (fires when $ARGUMENTS is empty)

When invoked with no arguments, ask before proceeding:

Q1 — Topic?

What do you need?
  1. dns            — DNS resolution failures, CoreDNS, propagation
  2. lb             — Load balancer (ALB/NLB/Ingress) health checks, routing
  3. vpc            — VPC/VNet design, peering, Transit Gateway, PrivateLink
  4. process        — systemctl, journald, service crashes, resource exhaustion
  5. disk           — space, inode exhaustion, deleted-but-not-freed files
  6. network        — L3/L4/L7 connectivity, interface state, kernel tuning
  7. security-groups — security group / NSG rule debugging
  8. systemd        — unit files, overrides, dependencies, failed services
  9. cgroups        — container resource isolation, OOMKill diagnosis, cgroupv2
  10. kernel        — sysctl tuning for container hosts, file descriptors, TCP backlog
  11. troubleshoot  — general connectivity or system issue (guided checklist)

Enter 1–11 or topic name:

Q2 — Symptom (after topic selected): Describe the symptom or paste the error output:


How to respond

Identify the topic from the input and apply the matching framework:

dns — DNS Troubleshooting or Design

  1. Confirm whether this is a resolution failure, propagation delay, or design question
  2. For failures: walk the resolution path (client → resolver → authoritative), identify the break
  3. Provide the exact dig / nslookup commands to confirm root cause
  4. For Kubernetes DNS: check CoreDNS pod health, test from inside a pod, review ndots and search domain behaviour
  5. Propose the fix with TTL and rollback considerations

lb — Load Balancer (ALB / NLB / Ingress / Gateway API)

  1. Identify the layer (L4 vs L7) and whether the choice is correct for the protocol
  2. For health check failures: test the endpoint directly, check security group source rules, verify target registration
  3. For routing issues: confirm listener rules, host/path matching, target group type (ip vs instance)
  4. Provide the corrected Terraform or manifest snippet

vpc — VPC / VNet Design or Connectivity

  1. Confirm the subnet tier (public / private / data) and whether routing is correct
  2. Check route tables, IGW/NAT GW attachment, and security group rules
  3. For peering vs Transit Gateway: state the scale threshold and cost trade-off
  4. For PrivateLink: confirm NLB is in the producer VPC and endpoint is in the consumer VPC
  5. Produce corrected Terraform if needed

process — Process and Service Management

  1. Identify whether the issue is a crashed service, resource exhaustion, or misconfiguration
  2. Provide systemctl, journalctl, ps, lsof, or strace commands specific to the symptom
  3. Check memory (free -h, /proc/meminfo) and CPU (vmstat, mpstat) if resource pressure is suspected
  4. Propose the fix and how to make it survive a reboot

disk — Disk and Filesystem

  1. Check both space (df -hT) and inodes (df -i) — inode exhaustion is often overlooked
  2. Find large files or directories with du -sh and find
  3. Identify the owning process if a file is deleted but space not freed (lsof | grep deleted)
  4. Propose cleanup or resize steps with blast radius noted

network — Network Connectivity and Kernel Tuning

  1. Use the connectivity ladder: L3 (ping) → L4 (nc -zv) → L7 (curl -v)
  2. Check interface state (ip addr, ip route, ss -tulnp)
  3. For high-traffic services: review net.core.somaxconn, tcp_max_syn_backlog, and ip_local_port_range
  4. Provide sysctl commands and the /etc/sysctl.d/ persist pattern

security-groups — Security Group / NSG Rules

  1. Map the traffic flow: source IP → SG on LB → SG on target → NACL (if any)
  2. Identify the missing or incorrect rule
  3. For NLB: note that source IP is preserved — targets must allow the client CIDR directly
  4. Provide the corrected Terraform aws_security_group_rule or azurerm_network_security_rule

systemd — Service Management and Journald

  1. Check service status and recent log lines:

    systemctl status <service>
    journalctl -u <service> -n 100 --no-pager
    journalctl -u <service> --since "10 minutes ago"
  2. For a failed unit, inspect the exact error:

    systemctl show <service> --property=Result,ExecStart,FailureAction
    journalctl -u <service> -p err -b  # errors since last boot
  3. Override a system unit without editing the package file:

    systemctl edit <service>           # creates /etc/systemd/system/<service>.d/override.conf
    # Add [Service] + the changed key — systemd merges it
    systemctl daemon-reload && systemctl restart <service>
  4. Common fixes:

    SymptomCauseFix
    failed (Result: exit-code)Process exited non-zeroCheck ExecStart, test command manually
    failed (Result: timeout)TimeoutStartSec exceededIncrease timeout in override or fix slow start
    Activating (auto-restart)CrashLoop with Restart=alwaysCheck exit code; add StartLimitBurst and StartLimitIntervalSec
    Unit not foundWrong name or not installed`systemctl list-units --all
    Changes not applieddaemon-reload not runAlways run systemctl daemon-reload after editing unit files
  5. Validate a unit file before deploying:

    systemd-analyze verify /etc/systemd/system/<service>.service

cgroups — Container Resource Isolation and OOMKill Diagnosis

  1. Identify OOMKilled containers:
    kubectl get pods -A | grep OOMKilled
    kubectl describe pod <name> -n <namespace> | grep -A5 "OOMKilled\|Last State\|Reason"
  2. Find the actual memory usage vs limit:
    kubectl top pod <name> -n <namespace> --containers
    kubectl get pod <name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
  3. Read cgroup memory stats directly on the node (cgroupv2):
    # Find the container cgroup path
    docker inspect <container-id> | jq '.[].HostConfig.CgroupParent'
    # Or via containerd:
    crictl inspect <container-id> | jq '.info.runtimeSpec.linux.cgroupsPath'
    
    # Read memory stats
    cat /sys/fs/cgroup/<path>/memory.current        # current usage in bytes
    cat /sys/fs/cgroup/<path>/memory.max            # limit (or "max" = unlimited)
    cat /sys/fs/cgroup/<path>/memory.events         # oom_kill count
  4. Confirm cgroupv2 is active:
    stat -f /sys/fs/cgroup  # type 0x63677270 = cgroupv2 (cgroup2fs)
  5. Fix: increase memory limit in the pod spec; if memory usage is genuinely unbounded, profile the application. Never remove the limit — set resources.limits.memory always.
  6. CPU throttling (distinct from OOMKill):
    # CPU throttled periods as a % of total
    cat /sys/fs/cgroup/<path>/cpu.stat | grep throttled
    If throttle rate > 25%, either remove the CPU limit (preferred) or increase it. CPU limits cause throttling even when other cores are idle.

kernel — Kernel Tuning for Container Hosts

Always apply via /etc/sysctl.d/99-platform.conf and persist with sysctl --system. Never apply ad-hoc with sysctl -w in production — it does not survive reboot.

Connection handling (high-traffic nodes):

# /etc/sysctl.d/99-platform.conf
net.core.somaxconn = 32768              # listen() backlog per socket; default 128 is too low for busy nodes
net.ipv4.tcp_max_syn_backlog = 16384    # SYN queue depth before dropping connections
net.ipv4.ip_local_port_range = 1024 65535  # ephemeral port range for outbound connections
net.ipv4.tcp_tw_reuse = 1              # reuse TIME_WAIT sockets for new connections

File descriptors (pods with many connections):

fs.file-max = 2097152        # system-wide fd limit
fs.inotify.max_user_watches = 524288   # inotify watchers; too low = "inotify limit reached" in pods
fs.inotify.max_user_instances = 512

Memory and OOM behaviour:

vm.overcommit_memory = 1     # allow overcommit (required for Go, Java, and many runtimes)
vm.panic_on_oom = 0          # do not panic on OOM — let the OOM killer select a process
vm.oom_kill_allocating_task = 1  # kill the task that triggered OOM rather than a random process

Validate after applying:

sysctl --system                                 # apply all /etc/sysctl.d/ files
sysctl net.core.somaxconn                       # confirm value
ss -s                                           # confirm socket state distribution

Node-level tuning vs pod-level: sysctl values in /etc/sysctl.d/ affect the entire node. Pod-level overrides for safe sysctls (e.g. net.ipv4.tcp_tw_reuse) require securityContext.sysctls — only allowed if the cluster admin permits unsafe sysctls.

troubleshoot — General Connectivity or System Troubleshoot

Apply the structured checklist from references/linux-networking.md:

  1. Symptom classification: DNS? L4? L7? Load balancer? Resource exhaustion?
  2. Evidence to collect (exact commands)
  3. Root-cause hypothesis
  4. Proposed fix
  5. Validation steps
  6. Rollback plan

If the input does not match a topic, infer the closest match and state which framework you applied.

Always end with:

  • Validation command — the exact command that confirms the fix worked
  • Rollback — how to safely undo the change if the fix makes things worse

Common mistakes

  • Editing package-owned unit files directly — use systemctl edit <service> (override.conf) so package updates don't overwrite changes
  • Forgetting daemon-reload — systemd does not pick up unit file changes until systemctl daemon-reload is run
  • Setting CPU limits on containers — CPU limits cause throttling even when cores are idle; omit resources.limits.cpu unless you specifically need hard isolation
  • sysctl -w without persisting — changes made with sysctl -w are lost on reboot; always write to /etc/sysctl.d/
  • Checking space but not inodesdf -h shows space free but df -i may show inodes exhausted; both must be checked
  • Diagnosing OOMKill without checking CPU throttle — throttled containers appear healthy but respond slowly; check cpu.stat alongside memory.events

BEFORE_AFTER.md

CHANGELOG.md

CODE_OF_CONDUCT.md

COMMANDS.md

CONTRIBUTING.md

EDITOR_INTEGRATIONS.md

GETTING_STARTED.md

HOW_IT_WORKS.md

install.sh

INSTALLATION.md

LAUNCH.md

PROMPTS.md

QUICKSTART.md

README.md

renovate.json

SECURITY.md

SKILL.md

tessl.json

tile.json