Operate the joelclaw Kubernetes cluster — Talos Linux on Colima (Mac Mini). Deploy services, check health, debug pods, recover from restarts, add ports, manage Helm releases, inspect logs, fix networking. Triggers on: 'kubectl', 'pods', 'deploy to k8s', 'cluster health', 'restart pod', 'helm install', 'talosctl', 'colima', 'nodeport', 'flannel', 'port mapping', 'k8s down', 'cluster not working', 'add a port', 'PVC', 'storage', any k8s/Talos/Colima infrastructure task. Also triggers on service-specific deploy: 'deploy redis', 'redeploy inngest', 'livekit helm', 'pds not responding'.
89
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Mac Mini (localhost ports)
└─ Lima SSH mux (~/.colima/_lima/colima/ssh.sock) ← NEVER KILL
└─ Colima VM (8 CPU, 16 GiB, 100 GiB, VZ framework, aarch64)
└─ Docker 29.x + buildx (joelclaw-builder, docker-container driver)
└─ Talos v1.12.4 container (joelclaw-controlplane-1)
└─ k8s v1.35.0 (single node, Flannel CNI)
└─ joelclaw namespace (privileged PSA)⚠️ Talos has NO shell. No bash, no /bin/sh, nothing. You cannot docker exec into the Talos container. Use talosctl for node operations and the Colima VM (ssh lima-colima) for host-level operations like modprobe.
| Setting | Value | Reason |
|---|---|---|
| CPU | 8 | Match k8s workload requests (~2.8 CPU, 72%) |
| Memory | 16 GiB | 32GB causes macOS memory pressure → VM kill |
| nestedVirtualization | OFF by default | Crashes VM under load (image builds, heavy scheduling). Toggle ON only for Firecracker testing |
| vmType | vz | Required for Apple Silicon |
| mountType | virtiofs | Fastest option with VZ |
nestedVirtualization: true is unstable on M4 Pro under load. It causes the Colima VM to silently crash during Docker builds/pushes. Each crash:
docker CLI on macOS disconnectsRecovery from Colima crash-loop:
colima stop && colima start — basic restartredis-check-aof --fix (see Redis AOF Recovery below)ssh -L /tmp/docker.sock:/var/run/docker.sockDocker image builds should use the buildx container builder (docker buildx build --builder joelclaw-builder) to isolate build IO from k8s workloads.
If Redis crash-loops after a VM restart with Bad file format reading the append only file:
# 1. Scale down Redis (or use a temp pod if StatefulSet can't mount PVC concurrently)
kubectl -n joelclaw apply -f - <<'EOF'
apiVersion: v1
kind: Pod
metadata:
name: redis-fix
namespace: joelclaw
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
containers:
- name: fix
image: redis:7-alpine
command: ["sh", "-c", "cd /data/appendonlydir && echo y | redis-check-aof --fix *.incr.aof && redis-check-aof *.incr.aof"]
volumeMounts:
- name: data
mountPath: /data
restartPolicy: Never
volumes:
- name: data
persistentVolumeClaim:
claimName: data-redis-0
EOF
# 2. Wait, check logs, then clean up
kubectl -n joelclaw logs redis-fix
kubectl -n joelclaw delete pod redis-fix --force
# 3. Restart Redis
kubectl -n joelclaw delete pod redis-0For port mappings, recovery procedures, and cluster recreation steps, read references/operations.md.
infra/k8s-reboot-heal.sh runs under launchd as a fresh process every interval. Any recovery marker that only lives in shell memory dies at the end of that tick.
That means flannel/event-healing state must be persisted on disk. Canonical path:
~/.local/state/k8s-reboot-heal.envPersist at least:
COLIMA_START_EPOCHRECOVERY_START_EPOCHLAST_FLANNEL_RESTART_EPOCHCOLIMA_UNHEALTHY_STREAKLAST_COLIMA_UNHEALTHY_EPOCHLAST_COLIMA_FORCE_CYCLE_EPOCHLAST_COLIMA_FAILED_RECOVERY_EPOCHWhy this matters: kubelet FailedCreatePodSandBox events mentioning missing subnet.env can stay recent for minutes after the first repair. If the healer forgets that it already restarted flannel, the next launchd tick can bounce flannel again and knock healthy services like Typesense back into 503 warmup for no good reason. The extra failed-recovery marker also stops the system from counting a one-minute green flash as success and then force-cycling Colima again when the control path collapses.
Docker port mappings for k8s API (6443) and Talos API (50000) are not pinned — they use random host ports assigned at container creation. All service ports (3111, 8288, 6379, etc.) ARE pinned 1:1.
When the Colima VM or Talos container restarts, Docker may reassign different random ports for 6443/50000. Kubeconfig goes stale, kubectl fails, and everything that depends on it (joelclaw CLI, health checks, pod inspection) breaks silently.
Symptoms: kubectl returns tls: internal error or connection refused. All pods are actually running — only the kubeconfig routing is wrong.
Fix:
# 1. Regenerate kubeconfig from talosctl (which has the correct port)
talosctl --talosconfig ~/.talos/config --nodes 127.0.0.1 kubeconfig --force
# 2. Switch to the new context
kubectl config use-context "$(kubectl config get-contexts -o name | grep joelclaw | head -1)"
# 3. Clean stale contexts (optional)
kubectl config delete-context admin@joelclaw # if stale entry existsSelf-heal: health.sh now auto-detects and fixes this before running checks.
Root cause: Container was created without pinning these ports. To permanently fix, recreate the container with explicit port bindings for 6443:6443 and 50000:50000. This requires cluster recreation — a bigger operation.
A Colima restart is not recovery.
After any colima start / force-cycle, the system only counts recovery as real if a post-restart stability window stays healthy across repeated passes for:
If those regress during the verification window, classify the event as a failed recovery, capture proof artifacts, and stop repeated force-cycles for the configured hold period. The point is durability, not healer theatre.
kubectl get pods -n joelclaw # all pods
curl -s localhost:3111/api/inngest # system-bus-worker → 200
curl -s localhost:7880/ # LiveKit → "OK"
curl -s localhost:8108/health # Typesense → {"ok":true}
curl -s localhost:8288/health # Inngest → {"status":200}
curl -s localhost:9070/deployments # Restate admin → deployments list
curl -s localhost:9627/xrpc/_health # PDS → {"version":"..."}
kubectl exec -n joelclaw redis-0 -- redis-cli ping # → PONG
joelclaw restate cron status # Dkron scheduler → healthy via temporary CLI tunnel| Service | Type | Pod | Ports (Mac→NodePort) | Helm? |
|---|---|---|---|---|
| Redis | StatefulSet | redis-0 | 6379→6379 | No |
| Typesense | StatefulSet | typesense-0 | 8108→8108 | No |
| Inngest | StatefulSet | inngest-0 | 8288→8288, 8289→8289 | No |
| Restate | StatefulSet | restate-0 | 8080→8080, 9070→9070, 9071→9071 | No |
| system-bus-worker | Deployment | system-bus-worker-* | 3111→3111 | No |
| restate-worker | Deployment | restate-worker-* | in-cluster only (restate-worker:9080) | No |
| docs-api | Deployment | docs-api-* | 3838→3838 | No |
| LiveKit | Deployment | livekit-server-* | 7880→7880, 7881→7881 | Yes (livekit/livekit-server 1.9.0) |
| PDS | Deployment | bluesky-pds-* | 9627→3000 | Yes (nerkho/bluesky-pds 0.4.2) |
| MinIO | StatefulSet | minio-0 | 30900→30900, 30901→30901 | No |
| Dkron | StatefulSet | dkron-0 | in-cluster only (dkron-svc:8080) | No |
AIStor Operator (aistor ns) | Deployments | adminjob-operator, object-store-operator | n/a | Yes (minio/aistor-operator) |
AIStor ObjectStore (aistor ns) | StatefulSet | aistor-s3-pool-0-0 | 31000 (S3 TLS), 31001 (console) | Yes (minio/aistor-objectstore) |
deployment/restate-worker is intentionally privileged and mounts /dev/kvm (hostPath type "" — optional).firecracker-images at /tmp/firecracker-test stores kernel, rootfs, and snapshot artifacts.nestedVirtualization is OFF: /dev/kvm absent, microvm DAG handler fails, but shell/infer/noop handlers work normally.nestedVirtualization is ON: Firecracker one-shot exec works (create workspace ext4 → write command → boot VM → guest executes → poweroff → read results).rm -rf /restate-data/*, scale back up, re-register worker.curl -X POST http://localhost:9070/deployments -H 'content-type: application/json' -d '{"uri":"http://restate-worker:9080"}'⚠️ PDS port trap: Docker maps 9627→3000 (host→container). NodePort must be 3000 to match the container-side port. If set to 9627, traffic won't route.
Rule: NodePort value = Docker's container-side port, not host-side.
Status: local sandbox remains the default/live path; the k8s backend is now code-landed and opt-in, but still needs supervised rollout before calling it earned runtime.
The agent runner executes sandboxed story runs as isolated k8s Jobs. Jobs are created dynamically via @joelclaw/agent-execution/job-spec — no static manifests.
See k8s/agent-runner.yaml for the full specification.
Required components:
claude and/or other installed CLIs)/workspace working directory/app/packages/agent-execution/src/job-runner.tsConfiguration via environment variables:
WORKFLOW_ID, REQUEST_ID, STORY_ID, SANDBOX_PROFILE, BASE_SHA, EXECUTION_BACKEND, JOB_NAME, JOB_NAMESPACEREPO_URL, REPO_BRANCH, optional HOST_REQUESTED_CWDAGENT_NAME, AGENT_MODEL, AGENT_VARIANT, AGENT_PROGRAMSESSION_ID, TIMEOUT_SECONDSTASK_PROMPT_B64 (base64-encoded)VERIFICATION_COMMANDS_B64 (base64-encoded JSON array)RESULT_CALLBACK_URL, RESULT_CALLBACK_TOKENExpected behavior:
TASK_PROMPT_B64REPO_URL / REPO_BRANCH at BASE_SHAAGENT_PROGRAMSandboxExecutionResult markers to stdout and POST the same result to /internal/agent-resultCurrent truthful limit:
pi remains local-backend only for now; do not pretend the pod runner can execute pi story runs yet.import { generateJobSpec, generateJobDeletion } from "@joelclaw/agent-execution";
// 1. Generate Job spec
const spec = generateJobSpec(request, {
runtime: {
image: "ghcr.io/joelhooks/agent-runner:latest",
imagePullPolicy: "Always",
command: ["bun", "run", "/app/packages/agent-execution/src/job-runner.ts"],
},
namespace: "joelclaw",
imagePullSecret: "ghcr-pull",
resultCallbackUrl: "http://host.docker.internal:3111/internal/agent-result",
resultCallbackToken: process.env.OTEL_EMIT_TOKEN,
});
// 2. Apply to cluster (via kubectl or k8s client library)
// 3. Job runs → Pod materializes repo, executes agent, posts SandboxExecutionResult callback
// 4. Host worker can recover the same terminal result from log markers if callback delivery fails
// 5. Job auto-deletes after TTL (default: 5 minutes)
// Cancel a running Job
const deletion = generateJobDeletion("req-xyz");
// kubectl delete job ${deletion.name} -n ${deletion.namespace}500m request, 2 limit1Gi request, 4Gi limit1 hour5 minutes0 (no retries)# List agent runner Jobs
kubectl get jobs -n joelclaw -l app.kubernetes.io/name=agent-runner
# Check Job status
kubectl describe job <job-name> -n joelclaw
# View logs
kubectl logs job/<job-name> -n joelclaw
# Check for stale Jobs (should be auto-deleted by TTL)
kubectl get jobs -n joelclaw --show-allpackages/agent-execution/src/job-spec.ts)k8s/agent-runner.yaml)packages/agent-execution/__tests__/job-spec.test.ts)k8s pods can mount NAS storage over NFS via a LAN route through the Colima bridge.
k8s pod → Talos container (10.5.0.x) → Docker NAT → Colima VM
→ ip route 192.168.1.0/24 via 192.168.64.1 dev col0
→ macOS host (IP forwarding enabled) → LAN → NAS (192.168.1.163)Root cause of prior failures: VZ framework's shared networking on eth0 doesn't properly forward LAN-bound traffic. The fix routes LAN traffic through col0 (Colima bridge → macOS host) instead.
The LAN route is set in two places for reliability:
~/.colima/default/colima.yaml) — runs on colima start (cold boot)~/Code/joelhooks/joelclaw/infra/k8s-reboot-heal.sh) — reasserts the route during reboot recovery ticksBoth execute: ip route replace 192.168.1.0/24 via 192.168.64.1 dev col0
com.joel.colima-tunnel is deprecated. Colima/Lima already forwards the docker-published host ports for joelclaw-controlplane-1, so a second autossh daemon on those same ports is not redundancy — it's interference.
Rules:
com.joel.colima is the only boot/start helper for the VM; it must not keep a periodic StartIntervalcom.joel.colima-tunnel should be absent from /Library/LaunchDaemons/; install-critical-launchdaemons.sh removes it instead of reinstalling itjoelclaw-controlplane-1 (3838, 6379, 7880, 7881, 8108, 8288, 8289, 9627, 64784)ssh listeners on those host ports; that can kill Lima's own forwardersinfra/colima-tunnel.sh is now only a deprecated compatibility stub so stale launchd installs exit cleanly instead of fighting Limacom.joel.kube-operator-access is the allowed exception because it owns dedicated operator-only loopback ports that Colima/Lima do not publish themselves: 16443 -> 10.5.0.2:6443 for kube-apiserver and 15000 -> 10.5.0.2:50000 for Talosssh -F ~/.colima/_lima/colima/ssh.config -S none -o ControlPath=none -o ControlMaster=no -o ControlPersist=no; do not trust the generic Lima mux path for long-lived kubectl/talos access after a rebuildhttps://127.0.0.1:16443 and talosctl should use 127.0.0.1:15000com.joel.k8s-reboot-heal must use the same JSON status check; a plain colima status false-negative can force-cycle the VM and retrigger the flannel/NAS failure cascade during reboot recovery~/.local/state/k8s-reboot-heal.env so Talos and workload warmup can finish before another escalation is even considered192.168.1.0/24 via 192.168.64.1 dev col0 exists again and NFS is reachable from the Colima VMfailed to load flannel 'subnet.env' file; treat recent FailedCreatePodSandBox events with that message as a restart signal for the flannel pod| PV | NFS Path | Capacity | Access | Use |
|---|---|---|---|---|
nas-nvme | 192.168.1.163:/volume2/data | 1.5TB | RWX | NVMe RAID1: backups, snapshots, models, sessions |
nas-hdd | 192.168.1.163:/volume1/joelclaw | 50TB | RWX | HDD RAID5: books, docs-artifacts, archives, otel |
minio-nfs-pv | 192.168.1.163:/volume1/joelclaw | 1TB | RWO | HDD tier: MinIO object storage (same export) |
volumes:
- name: nas
persistentVolumeClaim:
claimName: nas-nvme
containers:
- volumeMounts:
- name: nas
mountPath: /nas
# Optional: subPath for specific dir
subPath: typesensenfsvers=3,tcp,resvport,noatime mount options. NFSv4 has issues with Asustor ADM.soft mount option — returns errors, doesn't hang pods.colima ssh -- ip route | grep 192.168.1.0# From Colima VM
colima ssh -- timeout 2 bash -c "echo > /dev/tcp/192.168.1.163/2049" && echo "NFS OK"
# From k8s pod
kubectl run nfs-test --image=busybox --restart=Never -n joelclaw \
--overrides='{"spec":{"tolerations":[{"key":"node-role.kubernetes.io/control-plane","operator":"Exists","effect":"NoSchedule"}],"containers":[{"name":"t","image":"busybox","command":["sh","-c","ls /nas && echo OK"],"volumeMounts":[{"name":"n","mountPath":"/nas"}]}],"volumes":[{"name":"n","persistentVolumeClaim":{"claimName":"nas-nvme"}}]}}'
kubectl logs nfs-test -n joelclaw && kubectl delete pod nfs-test -n joelclaw --force# Manifests (redis, typesense, inngest, dkron)
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/
# Restate runtime
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/restate.yaml
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/firecracker-pvc.yaml
kubectl rollout status statefulset/restate -n joelclaw
~/Code/joelhooks/joelclaw/k8s/publish-restate-worker.sh
curl -fsS http://localhost:9070/deployments
# Dkron phase-1 scheduler (ClusterIP API + CLI-managed short-lived tunnel access)
kubectl apply -f ~/Code/joelhooks/joelclaw/k8s/dkron.yaml
kubectl rollout status statefulset/dkron -n joelclaw
joelclaw restate cron status
joelclaw restate cron sync-tier1 # seed/update ADR-0216 tier-1 jobs
# system-bus worker (build + push GHCR + apply + rollout wait)
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh
# LiveKit (Helm + reconcile patches)
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh joelclaw
# AIStor (Helm operator + objectstore)
# Defaults to isolated `aistor` namespace to avoid service-name collisions with legacy `joelclaw/minio`.
# Cutover override (explicit only): AISTOR_OBJECTSTORE_NAMESPACE=joelclaw AISTOR_ALLOW_JOELCLAW_NAMESPACE=true
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh
# PDS (Helm) — always patch NodePort to 3000
# (export current values first if the release already exists)
helm get values bluesky-pds -n joelclaw > /tmp/pds-values-live.yaml 2>/dev/null || true
helm upgrade --install bluesky-pds nerkho/bluesky-pds \
-n joelclaw -f /tmp/pds-values-live.yaml
kubectl patch svc bluesky-pds -n joelclaw --type='json' \
-p='[{"op":"replace","path":"/spec/ports/0/nodePort","value":3000}]'.github/workflows/system-bus-worker-deploy.ymlmain touching packages/system-bus/** or worker deploy filesghcr.io/joelhooks/system-bus-worker:${GITHUB_SHA} + :latestself-hosted runnerself-hosted runner is online on the Mac Mini.Cause: GITHUB_TOKEN (default Actions token) does not have packages:write scope for this repo. A dedicated PAT is required.
Fix already applied: Workflow uses secrets.GHCR_PAT (not secrets.GITHUB_TOKEN) for the GHCR login step. The PAT is stored in:
GHCR_PAT (set via GitHub UI)ghcr_pat (secrets lease ghcr_pat)If this breaks again: PAT may have expired. Regenerate at github.com → Settings → Developer settings → PATs, update both stores.
Local fallback (bypass GHA entirely):
DOCKER_CONFIG_DIR=$(mktemp -d)
echo '{"credsStore":""}' > "$DOCKER_CONFIG_DIR/config.json"
export DOCKER_CONFIG="$DOCKER_CONFIG_DIR"
secrets lease ghcr_pat | docker login ghcr.io -u joelhooks --password-stdin
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.shNote: publish-system-bus-worker.sh uses gh auth token internally — if gh auth is stale, use the Docker login above before running the script, or patch it to use secrets lease ghcr_pat directly.
kubectl port-forward for persistent service exposure. All long-lived operator surfaces MUST use NodePort + Docker port mappings. The narrow exception is a CLI-managed, short-lived tunnel for an otherwise in-cluster-only control surface (for example joelclaw restate cron * tunneling to dkron-svc). Port-forwards silently die on idle/restart/pod changes, so do not leave them running.firecracker-images is stateful runtime data. Treat it like a real runtime PVC: kernel, rootfs, and snapshot loss will break the microVM path.colima ssh -- df -h /. Alert at >80%./opt/homebrew/bin. Colima shells to limactl, kubectl/talosctl live in homebrew. launchd's default PATH is /usr/bin:/bin:/usr/sbin:/sbin — no homebrew. The canonical PATH for infra plists is: /opt/homebrew/bin:/Users/joel/.local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin. Discovered Feb 2026: missing PATH caused 6 days of silent recovery failures.export PATH="/opt/homebrew/bin:..." to the script itself.colima stop && start), the SSH port changes but the mux socket (~/.colima/_lima/colima/ssh.sock) caches the old connection. Symptoms: kubectl port-forward fails with "tls: internal error", kubectl get nodes may intermittently work then fail. Fix: rm -f ~/.colima/_lima/colima/ssh.sock && pkill -f "ssh.*colima", then re-establish tunnels with ssh -o ControlPath=none. Always verify SSH port with colima ssh-config | grep Port after restart.hostconfig.json edit. See references/operations.md for the procedure.colima status reports "Running" but docker socket / SSH tunnels are dead. All k8s ports unresponsive. colima start is a no-op. Only colima restart recovers. Detect with: ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "docker info" — if that fails while colima status passes, it's a zombie. The heal script handles this automatically.docker exec into it. Kernel modules like br_netfilter must be loaded at the Colima VM level: ssh lima-colima "sudo modprobe br_netfilter".joelclaw, it can claim svc/minio and break legacy MinIO assumptions. Keep AIStor objectstore in isolated namespace (aistor) unless intentionally cutting over.helm upgrade can fail on MutatingWebhookConfiguration caBundle ownership conflict. Current mitigation in this cluster: set operators.object-store.webhook.enabled=false in k8s/aistor-operator-values.yaml.minio/minio:RELEASE.2025-10-15T17-29-55Z is not available on Docker Hub in this environment (ErrImagePull). Legacy fallback currently relies on minio/minio:latest.restate-worker privilege is intentional. Do not “harden” away /dev/kvm, privileged: true, or the unconfined seccomp profile unless you are simultaneously changing the Firecracker runtime contract.svc/dkron. Kubernetes injects DKRON_* env vars into pods, which collides with Dkron's own config parsing. Use dkron-peer and dkron-svc.dkron/dkron:latest currently needs root on the local-path PVC. Non-root hardening caused permission denied under /data/raft/snapshots/permTest and CrashLoopBackOff.localhost:8108 while typesense had been restored as ClusterIP only. The fix was to make k8s/typesense.yaml a NodePort service on 8108 again so host worker + CLI writes have a stable path without reviving a launchd port-forward sidecar.docs-api-env — the manifest depends on secret docs-api-env with key PDF_BRAIN_API_TOKEN. The token lives in agent-secrets as pdf_brain_api_token; recreate the k8s secret before applying k8s/docs-api.yaml on a rebuilt cluster or the Deployment will stay broken.system_knowledge is missing you will see 404 {"message":"Collection not found"}. The CLI now auto-heals this on first joelclaw knowledge search by recreating the collection and re-syncing ADRs + skills, but an explicit joelclaw knowledge sync is still the blunt proof command.bluesky-pds-secrets, reinstall the bluesky-pds Helm release, force the service nodePort back to 3000, then recreate Joel's account if the PVC was wiped. The new account returns a fresh DID, so update the pds_joel_did secret afterward or host dual-write will keep authenticating against a dead repo.com.atproto.server.createSession succeeded against joel.pds.panda.tail7af24.ts.net but rejected the raw DID. packages/system-bus/src/lib/pds.ts now resolves the handle from pds_joel_did via describeRepo before it asks for a session, which keeps the dual-write path aligned with reality.docs_chunks_v2 ~223k rows), Typesense exhausts its 1024 default FD limit and starts logging Fail to open /proc/self/fd: Too many open files. The pod stays Running but 503s everything, and the external symptom is identical to a raft leader-election failure. Fix: wrap the container command with ulimit -n 1048576 before exec-ing /opt/typesense-server. Canonical form lives in k8s/typesense.yaml since 2026-04-19.docker inspect joelclaw-controlplane-1 --format '{{.HostConfig.Memory}}' returns 4294967296 even though the Colima VM has 16 GiB. Every k8s pod inside Talos shares those 4 GiB. Under Typesense otel_events / docs_chunks_v2 load the Talos container pegs at 99.8% and typesense + kubectl tunnel both go unresponsive together. Live fix without restart: ssh -F ~/.colima/_lima/colima/ssh.config lima-colima "sudo docker update --memory=12g --memory-swap=12g joelclaw-controlplane-1". After that, bump per-pod resources.limits.memory to match the real workload (Typesense raised from 4 Gi → 8 Gi on 2026-04-20).| Path | What |
|---|---|
~/Code/joelhooks/joelclaw/k8s/*.yaml | Service manifests |
~/Code/joelhooks/joelclaw/k8s/livekit-values.yaml | LiveKit Helm values (source controlled) |
~/Code/joelhooks/joelclaw/k8s/reconcile-livekit.sh | LiveKit Helm deploy + post-upgrade reconcile |
~/Code/joelhooks/joelclaw/k8s/aistor-operator-values.yaml | AIStor operator Helm values |
~/Code/joelhooks/joelclaw/k8s/aistor-objectstore-values.yaml | AIStor objectstore Helm values |
~/Code/joelhooks/joelclaw/k8s/reconcile-aistor.sh | AIStor deploy + upgrade reconcile script |
~/Code/joelhooks/joelclaw/k8s/dkron.yaml | Dkron scheduler StatefulSet + services |
~/Code/joelhooks/joelclaw/k8s/publish-system-bus-worker.sh | Build/push/deploy system-bus worker to k8s |
~/Code/joelhooks/joelclaw/infra/k8s-reboot-heal.sh | Reboot auto-heal script for Colima/Talos/taint/flannel |
~/Code/joelhooks/joelclaw/infra/kube-operator-access.sh | launchd-managed kubectl/talos operator tunnel on 16443/15000 |
~/Code/joelhooks/joelclaw/infra/launchd/com.joel.k8s-reboot-heal.plist | launchd timer for reboot auto-heal |
~/Code/joelhooks/joelclaw/infra/launchd/com.joel.kube-operator-access.plist | launchd service for stable operator access |
~/Code/joelhooks/joelclaw/skills/k8s/references/operations.md | Cluster operations + recovery notes |
~/.talos/config | Talos client config (stable endpoint: 127.0.0.1:15000) |
~/.kube/config | Kubeconfig (stable server: https://127.0.0.1:16443) |
~/.colima/default/colima.yaml | Colima VM config |
~/Code/joelhooks/joelclaw/infra/colima-tunnel.sh | Deprecated compatibility stub; exits cleanly so stale launchd installs stop fighting Lima |
~/.local/bin/colima-tunnel | Compatibility wrapper for the deprecated tunnel stub |
~/.local/caddy/Caddyfile | Caddy HTTPS proxy (Tailscale) |
~/Code/joelhooks/joelclaw/k8s/nas-nvme-pv.yaml | NAS NVMe NFS PV/PVC (1.5TB) |
~/Code/joelhooks/joelclaw/k8s/nas-hdd-pv.yaml | NAS HDD NFS PV/PVC (50TB) |
Read references/operations.md for:
ce9ca8e
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.