Operate the joelclaw Kubernetes cluster — Talos Linux on Colima (Mac Mini). Deploy services, check health, debug pods, recover from restarts, add ports, manage Helm releases, inspect logs, fix networking. Triggers on: 'kubectl', 'pods', 'deploy to k8s', 'cluster health', 'restart pod', 'helm install', 'talosctl', 'colima', 'nodeport', 'flannel', 'port mapping', 'k8s down', 'cluster not working', 'add a port', 'PVC', 'storage', any k8s/Talos/Colima infrastructure task. Also triggers on service-specific deploy: 'deploy redis', 'redeploy inngest', 'livekit helm', 'pds not responding'.
90
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Mac Mini (localhost ports)
└─ Lima SSH mux (~/.colima/_lima/colima/ssh.sock) ← NEVER KILL
└─ Colima VM (8 CPU, 16 GiB, 100 GiB, VZ framework, aarch64)
└─ Docker 29.x + buildx (joelclaw-builder, docker-container driver)
└─ Talos v1.12.4 container (joelclaw-controlplane-1)
└─ k8s v1.35.0 (single node, Flannel CNI)
└─ joelclaw namespace (privileged PSA)⚠️ Talos has NO shell. No bash, no /bin/sh, nothing. You cannot docker exec into the Talos container. Use talosctl for node operations and the Colima VM (ssh lima-colima) for host-level operations like modprobe.
| Setting | Value | Reason |
|---|---|---|
| CPU | 8 | Match k8s workload requests (~2.8 CPU, 72%) |
| Memory | 16 GiB | 32GB causes macOS memory pressure → VM kill |
| nestedVirtualization | OFF by default | Crashes VM under load (image builds, heavy scheduling). Toggle ON only for Firecracker testing |
| vmType | vz | Required for Apple Silicon |
| mountType | virtiofs | Fastest option with VZ |
nestedVirtualization: true is unstable on M4 Pro under load. It causes the Colima VM to silently crash during Docker builds/pushes. Each crash:
docker CLI on macOS disconnectsRecovery from Colima crash-loop:
colima stop && colima start — basic restartredis-check-aof --fix (see Redis AOF Recovery below)ssh -L /tmp/docker.sock:/var/run/docker.sockDocker image builds should use the buildx container builder (docker buildx build --builder joelclaw-builder) to isolate build IO from k8s workloads.
If Redis crash-loops after a VM restart with Bad file format reading the append only file:
# 1. Scale down Redis (or use a temp pod if StatefulSet can't mount PVC concurrently)
kubectl -n joelclaw apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: redis-fix
namespace: joelclaw
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: fix
image: redis:7-alpine
command: ["sh", "-c", "cd /data/appendonlydir && echo y | redis-check-aof --fix *.incr.aof && redis-check-aof *.incr.aof"]
volumeMounts:
- name: data
mountPath: /data
restartPolicy: Never
volumes:
- name: data
persistentVolumeClaim:
claimName: data-redis-0
EOF
# 2. Wait, check logs, then clean up
kubectl -n joelclaw logs redis-fix
kubectl -n joelclaw delete pod redis-fix --force
# 3. Restart Redis
kubectl -n joelclaw delete pod redis-0For port mappings, recovery procedures, and cluster recreation steps, read references/operations.md.
Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports assigned at container creation. All service ports (3111, 8288, 6379, etc.) ARE pinned 1:1.
When the Colima VM or Talos container restarts, Docker may reassign different random ports for 6443/50000. Kubeconfig goes stale, kubectl fails, and everything that depends on it (joelclaw CLI, health checks, pod inspection) breaks silently.
Symptoms: kubectl returns tls: internal error or connection refused. All pods are actually running — only the kubeconfig routing is wrong.
Fix:
# 1. Regenerate kubeconfig from talosctl (which has the correct port)
talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force
# 2. Switch to the new context
kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"
# 3. Clean stale contexts (optional)
kubectl config delete-context admin@joelclaw # if stale entry existsSelf-heal: health.sh now auto-detects and fixes this before running checks.
Root cause: Container was created without pinning these ports. To permanently fix, recreate the container with explicit port bindings for 6443:6443 and 50000:50000. This requires cluster recreation — a bigger operation.
kubectl get pods -n joelclaw # all pods
curl -s localhost:3111/api/inngest # system-bus-worker → 200
curl -s localhost:7880/ # LiveKit → "OK"
curl -s localhost:8108/health # Typesense → {"ok":true}
curl -s localhost:8288/health # Inngest → {"status":200}
curl -s localhost:9070/deployments # Restate admin → deployments list
curl -s localhost:9627/xrpc/_health # PDS → {"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping # → PONG
joelclaw restate cron status # Dkron scheduler → healthy via temporary CLI tunnel| Service | Type | Pod | Ports (Mac→NodePort) | Helm? |
|---|---|---|---|---|
| Redis | StatefulSet | redis-0 | 6379→6379 | No |
| Typesense | StatefulSet | typesense-0 | 8108→8108 | No |
| Inngest | StatefulSet | inngest-0 | 8288→8288, 8289→8289 | No |
| Restate | StatefulSet | restate-0 | 8080→8080, 9070→9070, 9071→9071 | No |
| system-bus-worker | Deployment | system-bus-worker-* | 3111→3111 | No |
| restate-worker | Deployment | restate-worker-* | in-cluster only (restate-worker:9080) | No |
| docs-api | Deployment | docs-api-* | 3838→3838 | No |
| LiveKit | Deployment | livekit-server-* | 7880→7880, 7881→7881 | Yes (livekit/livekit-server 1.9.0) |
| PDS | Deployment | bluesky-pds-* | 9627→3000 | Yes (nerkho/bluesky-pds 0.4.2) |
| MinIO | StatefulSet | minio-0 | 30900→30900, 30901→30901 | No |
| Dkron | StatefulSet | dkron-0 | in-cluster only (dkron-svc:8080) | No |
AIStor Operator (aistor ns) | Deployments | adminjob-operator, object-store-operator | n/a | Yes (minio/aistor-operator) |
AIStor ObjectStore (aistor ns) | StatefulSet | aistor-s3-pool-0-0 | 31000 (S3 TLS), 31001 (console) | Yes (minio/aistor-objectstore) |
deployment/restate-worker is intentionally privileged and mounts /dev/kvm (hostPath type "" — optional).firecracker-images at /tmp/firecracker-test stores kernel, rootfs, and snapshot artifacts.nestedVirtualization is OFF: /dev/kvm absent, microvm DAG handler fails, but shell/infer/noop handlers work normally.nestedVirtualization is ON: Firecracker one-shot exec works (create workspace ext4 → write command → boot VM → guest executes → poweroff → read results).rm -rf /restate-data/*, scale back up, re-register worker.curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'⚠️ PDS port trap: Docker maps 9627→3000 (host→container). NodePort must be 3000 to match the container-side port. If set to 9627, traffic won't route.
Rule: NodePort value = Docker's container-side port, not host-side.
Status: local sandbox remains the default/live path; the k8s backend is now code-landed and opt-in, but still needs supervised rollout before calling it earned runtime.
The agent runner executes sandboxed story runs as isolated k8s Jobs. Jobs are created dynamically via @joelclaw/agent-execution/job-spec — no static manifests.
See k8s/agent-runner.yaml for the full specification.
Required components:
claude and/or other installed CLIs)/workspace working directory/app/packages/agent-execution/src/job-runner.tsConfiguration via environment variables:
WORKFLOW_ID, REQUEST_ID, STORY_ID, SANDBOX_PROFILE, BASE_SHA, EXECUTION_BACKEND, JOB_NAME, JOB_NAMESPACEREPO_URL, REPO_BRANCH, optional HOST_REQUESTED_CWDAGENT_NAME, AGENT_MODEL, AGENT_VARIANT, AGENT_PROGRAMSESSION_ID, TIMEOUT_SECONDSTASK_PROMPT_B64 (base64-encoded)VERIFICATION_COMMANDS_B64 (base64-encoded JSON array)RESULT_CALLBACK_URL, RESULT_CALLBACK_TOKENExpected behavior:
TASK_PROMPT_B64REPO_URL / REPO_BRANCH at BASE_SHAAGENT_PROGRAMSandboxExecutionResult markers to stdout and POST the same result to /internal/agent-resultCurrent truthful limit:
pi remains local-backend only for now; do not pretend the pod runner can execute pi story runs yet.import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";
// 1. Generate Job spec
const spec = generateJobSpec(request, {
runtime: {
image: "ghcr.io/joelhooks/agent-runner:latest",
imagePullPolicy: "Always",
command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
},
namespace: "joelclaw",
imagePullSecret: "ghcr-pull",
resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});
// 2. Apply to cluster (via kubectl or k8s client library)
// 3. Job runs → Pod materializes repo, executes agent, posts SandboxExecutionResult callback
// 4. Host worker can recover the same terminal result from log markers if callback delivery fails
// 5. Job auto-deletes after TTL (default: 5 minutes)
// Cancel a running Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}500m request, 2 limit1Gi request, 4Gi limit1 hour5 minutes0 (no retries)# List agent runner Jobs
kubectl get jobs -n joelclaw -l app.kubernetes.io/name=agent-runner
# Check Job status
kubectl describe job <job-name> -n joelclaw
# View logs
kubectl logs job/<job-name> -n joelclaw
# Check for stale Jobs (should be auto-deleted by TTL)
kubectl get jobs -n joelclaw --show-allpackages/agent-execution/src/job-spec.ts)k8s/agent-runner.yaml)packages/agent-execution/__tests__/job-spec.test.ts)k8s pods can mount NAS storage over NFS via a LAN route through the Colima bridge.
k8s pod → Talos container (10.5.0.x) → Docker NAT → Colima VM
→ ip route 192.168.1.0/24 via 192.168.64.1 dev col0
→ macOS host (IP forwarding enabled) → LAN → NAS (192.168.1.163)Root cause of prior failures: VZ framework's shared networking on eth0 doesn't properly forward LAN-bound traffic. The fix routes LAN traffic through col0 (Colima bridge → macOS host) instead.
The LAN route is set in two places for reliability:
~/.colima/default/colima.yaml) — runs on colima start (cold boot)~/.local/bin/colima-tunnel) — runs on tunnel restart (covers warm resume)Both execute: ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0
| PV | NFS Path | Capacity | Access | Use |
|---|---|---|---|---|
nas-nvme | 192.168.1.163:/volume2/data | 1.5TB | RWX | NVMe RAID1: backups, snapshots, models, sessions |
nas-hdd | 192.168.1.163:/volume1/joelclaw | 50TB | RWX | HDD RAID5: books, docs-artifacts, archives, otel |
minio-nfs-pv | 192.168.1.163:/volume1/joelclaw | 1TB | RWO | HDD tier: MinIO object storage (same export) |
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-nvme
containers:
- volumeMounts:
- name: nas
mountPath: /nas
# Optional: subPath for specific dir
subPath: typesensenfsvers=3,tcp,resvport,noatime mount options. NFSv4 has issues with Asustor ADM.soft mount option — returns errors, doesn't hang pods.colima ssh -- ip route | grep 192.168.1.0# From Colima VM
colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS OK"
# From k8s pod
kubectl run nfs-test --image=busybox --restart=Never -n joelclaw \
--overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo OK"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}'
kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force# Manifests (redis, typesense, inngest, dkron)
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/
# Restate runtime
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/restate.yaml
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/firecracker-pvc.yaml
kubectl rollout status statefulset/restate -n joelclaw
~/Code/joelhooks/joelclaw/k8s/publish-restate-worker.sh
curl -fsS http://localhost:9070/deployments
# Dkron phase-1 scheduler (ClusterIP API + CLI-managed short-lived tunnel access)
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/dkron.yaml
kubectl rollout status statefulset/dkron -n joelclaw
joelclaw restate cron status
joelclaw restate cron sync-tier1 # seed/update ADR-0216 tier-1 jobs
# system-bus worker (build + push GHCR + apply + rollout wait)
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
# LiveKit (Helm + reconcile patches)
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh joelclaw
# AIStor (Helm operator + objectstore)
# Defaults to isolated `aistor` namespace to avoid service-name collisions with legacy `joelclaw/minio`.
# Cutover override (explicit only): AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh
# PDS (Helm) — always patch NodePort to 3000
# (export current values first if the release already exists)
helm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true
helm upgrade --install bluesky-pds nerkho/bluesky-pds \
-n joelclaw -f /tmp/pds-values-live.yaml
kubectl patch svc bluesky-pds -n joelclaw --type='json' \
-p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'.github/workflows/system-bus-worker-deploy.ymlmain touching packages/system-bus/** or worker deploy filesghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA} + :latestself-hosted runnerself-hosted runner is online on the Mac Mini.Cause: GITHUB_TOKEN (default Actions token) does not have packages:write scope for this repo. A dedicated PAT is required.
Fix already applied: Workflow uses secrets.GHCR_PAT (not secrets.GITHUB_TOKEN) for the GHCR login step. The PAT is stored in:
GHCR_PAT (set via GitHub UI)ghcr_pat (secrets lease ghcr_pat)If this breaks again: PAT may have expired. Regenerate at github.com → Settings → Developer settings → PATs, update both stores.
Local fallback (bypass GHA entirely):
DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.shNote: publish-system-bus-worker.sh uses gh auth token internally — if gh auth is stale, use the Docker login above before running the script, or patch it to use secrets lease ghcr_pat directly.
kubectl port-forward for persistent service exposure. All long-lived operator surfaces MUST use NodePort + Docker port mappings. The narrow exception is a CLI-managed, short-lived tunnel for an otherwise in-cluster-only control surface (for example joelclaw restate cron * tunneling to dkron-svc). Port-forwards silently die on idle/restart/pod changes, so do not leave them running.firecracker-images is stateful runtime data. Treat it like a real runtime PVC: kernel, rootfs, and snapshot loss will break the microVM path.colima ssh -- df -h /. Alert at >80%./opt/homebrew/bin. Colima shells to limactl, kubectl/talosctl live in homebrew. launchd's default PATH is /usr/bin:/bin:/usr/sbin:/sbin — no homebrew. The canonical PATH for infra plists is: /opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin. Discovered Feb 2026: missing PATH caused 6 days of silent recovery failures.export PATH="/opt/homebrew/bin:..." to the script itself.colima stop && start), the SSH port changes but the mux socket (~/.colima/_lima/colima/ssh.sock) caches the old connection. Symptoms: kubectl port-forward fails with "tls: internal error", kubectl get nodes may intermittently work then fail. Fix: rm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima", then re-establish tunnels with ssh -o ControlPath=none. Always verify SSH port with colima ssh-config | grep Port after restart.hostconfig.json edit. See references/operations.md for the procedure.colima status reports "Running" but docker socket / SSH tunnels are dead. All k8s ports unresponsive. colima start is a no-op. Only colima restart recovers. Detect with: ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info" — if that fails while colima status passes, it's a zombie. The heal script handles this automatically.docker exec into it. Kernel modules like br_netfilter must be loaded at the Colima VM level: ssh lima-colima "sudo modprobe br_netfilter".joelclaw, it can claim svc/minio and break legacy MinIO assumptions. Keep AIStor objectstore in isolated namespace (aistor) unless intentionally cutting over.helm upgrade can fail on MutatingWebhookConfiguration caBundle ownership conflict. Current mitigation in this cluster: set operators.object-store.webhook.enabled=false in k8s/aistor-operator-values.yaml.minio/minio:RELEASE.2025-10-15T17-29-55Z is not available on Docker Hub in this environment (ErrImagePull). Legacy fallback currently relies on minio/minio:latest.restate-worker privilege is intentional. Do not “harden” away /dev/kvm, privileged: true, or the unconfined seccomp profile unless you are simultaneously changing the Firecracker runtime contract.svc/dkron. Kubernetes injects DKRON_* env vars into pods, which collides with Dkron's own config parsing. Use dkron-peer and dkron-svc.dkron/dkron:latest currently needs root on the local-path PVC. Non-root hardening caused permission denied under /data/raft/snapshots/permTest and CrashLoopBackOff.| Path | What |
|---|---|
~/Code/joelhooks/joelclaw/k8s/*.yaml | Service manifests |
~/Code/joelhooks/joelclaw/k8s/livekit-values.yaml | LiveKit Helm values (source controlled) |
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh | LiveKit Helm deploy + post-upgrade reconcile |
~/Code/joelhooks/joelclaw/k8s/aistor-operator-values.yaml | AIStor operator Helm values |
~/Code/joelhooks/joelclaw/k8s/aistor-objectstore-values.yaml | AIStor objectstore Helm values |
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh | AIStor deploy + upgrade reconcile script |
~/Code/joelhooks/joelclaw/k8s/dkron.yaml | Dkron scheduler StatefulSet + services |
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh | Build/push/deploy system-bus worker to k8s |
~/Code/joelhooks/joelclaw/infra/k8s-reboot-heal.sh | Reboot auto-heal script for Colima/Talos/taint/flannel |
~/Code/joelhooks/joelclaw/infra/launchd/com.joel.k8s-reboot-heal.plist | launchd timer for reboot auto-heal |
~/Code/joelhooks/joelclaw/skills/k8s/references/operations.md | Cluster operations + recovery notes |
~/.talos/config | Talos client config |
~/.kube/config | Kubeconfig (context: admin@joelclaw-1) |
~/.colima/default/colima.yaml | Colima VM config |
~/.local/bin/colima-tunnel | Persistent SSH tunnel + NAS route (launchd: com.joel.colima-tunnel) |
~/.local/caddy/Caddyfile | Caddy HTTPS proxy (Tailscale) |
~/Code/joelhooks/joelclaw/k8s/nas-nvme-pv.yaml | NAS NVMe NFS PV/PVC (1.5TB) |
~/Code/joelhooks/joelclaw/k8s/nas-hdd-pv.yaml | NAS HDD NFS PV/PVC (50TB) |
Read references/operations.md for:
825972c
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.