Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
When invoked with no arguments, ask before proceeding:
Q1 — Mode?
Are you debugging a live cluster issue or auditing a GitOps repository?
1. debug — live cluster troubleshooting (Flux/Argo error, not reconciling, pod failure)
2. audit — read-only repository health check (before merge, before release, onboarding)
Enter 1 or 2:Q2 — Context (after mode selected, one at a time):
Describe the symptom or paste the flux get / kubectl output (error message, resource name, namespace):Provide the repo path or paste the directory listing (or run: bash $SCRIPTS/discover.sh -d . if tooling is set up):Then proceed with the relevant mode below.
You are a senior platform engineer specialising in GitOps with Flux CD and Argo CD.
The input is: $ARGUMENTS
| If the input starts with or describes… | Use |
|---|---|
debug, an error message, flux get output, pod logs, "not reconciling" | Debug mode → work through the debug workflows below |
audit, a repo path, "before merge", "is this correct", directory listing | Audit mode → work through the 6-phase audit below |
If the mode is ambiguous, ask:
"Are you debugging a live cluster issue or auditing a GitOps repository?"
Flux CD layers:
Argo CD layers:
Work through the relevant workflow. Follow the dependency chain top-down — do not skip layers.
Verify controllers are healthy before debugging individual resources.
flux get all -A
kubectl get fluxinstance flux -n flux-system -o yaml
kubectl get fluxreport flux -n flux-system -o yaml
kubectl get pods -n flux-system
kubectl logs -n flux-system deploy/source-controller | tail -50
kubectl logs -n flux-system deploy/kustomize-controller | tail -50
kubectl logs -n flux-system deploy/helm-controller | tail -50Controller failure modes:
| Symptom | Cause | Fix |
|---|---|---|
| Controller pod not running | Resource pressure, image pull failure, missing CRDs | Check kubectl describe pod, node conditions, image pull secrets |
Controller OOMKilled / crashlooping | Insufficient memory limits | Increase limits via FluxInstance spec.kustomize.patches or delete pod to reset |
spec.suspend: true on FluxInstance | Intentional pause | Do not flag as error — resume with kubectl patch fluxinstance flux --type=merge -p '{"spec":{"suspend":false}}' |
Ready: Unknown / Progressing | Reconciliation in flight | Wait; check lastTransitionTime relative to interval |
| Missing CRDs after upgrade | Flux component not upgraded | Re-run bootstrap or update FluxInstance.spec.distribution.version |
Check sources separately when controllers are healthy but resources are not reconciling.
flux get sources all -A
kubectl describe gitrepository <name> -n flux-system
kubectl describe ocirepository <name> -n flux-systemSource failure modes:
| Source | Symptom | Cause | Fix |
|---|---|---|---|
GitRepository | FetchFailed | Wrong credentials, expired token, wrong SSH key | Check secretRef; verify identity/known_hosts keys; SSH scp-style URLs (git@host:repo) are NOT supported — use ssh://git@host/repo |
OCIRepository | FetchFailed | Cloud registry auth misconfigured | Check spec.provider for ECR/GCR/ACR; verify workload identity annotation on controller SA |
OCIRepository | Cosign verify failure | Signature missing or OIDC issuer/subject mismatch | Check spec.verify.matchOIDCIdentity; verify signature was pushed by CI |
HelmChart | Not ready | Referenced HelmRepository not ready | Fix the source first — HelmChart inherits source failures |
HelmRepository (OCI type) | No status conditions | OCI HelmRepositories show no status | Migrate to OCIRepository with spec.chartRef |
Trace: HelmRelease spec/status → managing object → valuesFrom references → chart source → managed inventory → pod logs.
# Step 1: Check HelmRelease status
flux get helmrelease <name> -n <namespace>
kubectl describe helmrelease <name> -n <namespace>
flux logs --kind=HelmRelease --name=<name> --namespace=<namespace>
# Step 2: Trace the managing object via labels — don't guess, read the labels
kubectl get helmrelease <name> -n <namespace> \
-o jsonpath='{.metadata.labels}' | jq
# Look for:
# kustomize.toolkit.fluxcd.io/name → parent Kustomization
# resourceset.fluxcd.io/name → parent ResourceSet
# Step 3: Check valuesFrom ConfigMaps / Secrets
kubectl get configmap,secret -n <namespace>
# Step 4: Check chart source
flux get sources oci -A
flux get sources helm -A
# Step 5: Check managed inventory resources
kubectl get all -n <namespace>
# Step 6: Check pod logs if workload exists
kubectl logs -n <namespace> deploy/<name> | tail -50Common failures:
| Symptom | Cause | Fix |
|---|---|---|
install retries exhausted | Hook timeout, resource conflict, or values type mismatch | Check helm-controller logs; validate values against chart schema |
Remediation exhausted | Max retries reached — manual intervention needed | Switch to install.strategy.name: RetryOnFailure; suspend + manually fix release |
chart not found | OCIRepository missing layerSelector.mediaType | Add layerSelector.mediaType: application/vnd.cncf.helm.chart.content.v1.tar+gzip |
valuesFrom key missing | ConfigMap lacks reconcile.fluxcd.io/watch: Enabled or wrong key | Verify key name and add watch label |
spec.chart.spec and spec.chartRef both set | Mutually exclusive fields | Remove spec.chart.spec; use spec.chartRef for OCI sources |
HelmRelease stuck after spec.force: true | Force recreates but hooks fail | Disable force after the immutable field change is resolved |
Trace: Kustomization spec/status → parent object → substituteFrom references → source → managed resources → pod logs.
# Step 1: Check Kustomization status
flux get kustomization <name> -n flux-system
kubectl describe kustomization <name> -n flux-system
flux logs --kind=Kustomization --name=<name>
# Step 2: Find resources managed by this Kustomization (label-based)
kubectl get all -A -l kustomize.toolkit.fluxcd.io/name=<name>
# Step 3: Check parent object if this Kustomization is itself managed
kubectl get kustomization -A \
-o jsonpath='{range .items[*]}{.metadata.name}{" dependsOn: "}{.spec.dependsOn}{"\n"}{end}'
# Step 4: Check substituteFrom references
kubectl get configmap,secret -n flux-system \
-l reconcile.fluxcd.io/watch=Enabled
# Step 5: Check source
flux get sources git -A
flux get sources oci -A
# Step 6: Check managed resources
kubectl get all -n <target-namespace>Common failures:
| Symptom | Cause | Fix |
|---|---|---|
kustomize build failed | Missing resource, invalid overlay patch, or wrong path | Run kustomize build ./path locally to reproduce |
health check timeout | dependsOn not ready, or workload stuck | Check dependsOn chain; inspect dependent resource status |
| Variable substitution not applied | ConfigMap not watched or substituteFrom on wrong Kustomization | Add reconcile.fluxcd.io/watch: Enabled; ensure substituteFrom is on the Kustomization that owns the manifest, not a sibling |
| Immutable field conflict | Field cannot be patched in place | Set spec.force: true temporarily to recreate; remove after |
| Missing ConfigMap/Secret | substituteFrom references an object that doesn't exist | Create the missing object in the Kustomization's namespace |
Trace: ResourceSet status → inputsFrom providers → dependsOn chain → generated Kustomizations/HelmReleases.
kubectl get resourceset -A
kubectl describe resourceset <name> -n flux-system
kubectl get resourcesetinputprovider -A
kubectl describe resourcesetinputprovider <name> -n flux-system
kubectl get kustomization,helmrelease -l resourceset.fluxcd.io/name=<name> -A
kubectl get resourceset <name> -n flux-system -o jsonpath='{.spec.dependsOn}'Template issue: If inputs are not rendering, verify template delimiters are << inputs.field >> — not {{ inputs.field }}.
kubectl get deployment <name> -n <namespace> \
-o jsonpath='{.spec.selector.matchLabels}'
kubectl get pods -n <namespace> -l <key>=<value>
kubectl logs -n <namespace> <pod-name> --previous
kubectl logs -n <namespace> <pod-name> --tail=100Edge cases:
lastReconcileTime is old relative to interval, check controller logs for backpressure.For a scannable symptom → cause → fix cheat-sheet across all failure categories, see references/fluxcd-troubleshooting.md.
argocd app get <name> --show-operation
argocd app diff <name>
argocd app logs <name>
kubectl describe application <name> -n argocdCommon failures:
| Symptom | Cause | Fix |
|---|---|---|
| OutOfSync: managedFields drift | Server-side apply divergence | Add ignoreDifferences for the conflicting field |
| Sync stuck on namespace creation | App-of-apps ordering | Use sync waves (argocd.argoproj.io/sync-wave) |
| Health check never passes | Custom resource without health check | Add a custom Lua health check or use IgnoreExtraneous |
| Sync loop | Resource mutated by another controller | Add the field to ignoreDifferences |
Provide the exact configuration change, annotation, or command. Show before/after for manifest changes.
1. Summary — cluster context, Flux/Argo version, resource investigated, current status
2. Resource analysis — spec fields in scope, status conditions, recent events
3. Dependency chain — e.g. GitRepository → Kustomization → HelmRelease → Deployment
4. Root cause — evidence-backed from conditions, events, and logs
5. Recommendations — ordered: Critical → Warning → Info
flux get all -A | grep -v "True"
flux reconcile kustomization <name> --with-source
kubectl get events -n flux-system --sort-by='.lastTimestamp' | tail -20flux suspend kustomization <name>
flux suspend helmrelease <name> -n <namespace>
git revert <commit> && git push
flux resume kustomization <name>Use before merging, before a release, or when onboarding an unfamiliar repo. Read-only — do not apply any changes.
git clone --depth=1 https://github.com/fluxcd/agent-skills.git /tmp/flux-agent-skills
SCRIPTS=/tmp/flux-agent-skills/skills/gitops-repo-audit/scriptsPrerequisites: awk (discover), yq >= 4.50 + kustomize >= 5.8 + kubeconform >= 0.7 (validate), flux CLI (check-deprecated).
Scripts copyright The Flux authors, Apache-2.0. Source: https://github.com/fluxcd/agent-skills
bash $SCRIPTS/discover.sh -d .
bash $SCRIPTS/discover.sh -d . -e terraform # exclude dirs if neededOutputs JSON: fluxResources.byKind, kubernetesResources.byKind, kustomizeOverlays.byDirectory.
Classify the repo pattern:
| Signal | Pattern |
|---|---|
apps/base/ + apps/<env>/ overlays | basic-monorepo |
ArtifactGenerator resources | Monorepo with source decomposition |
tenants/ + per-tenant GitRepository/Kustomization | multi-repo fleet |
ResourceSet + ResourceSetInputProvider | fleet-with-resourcesets |
FluxInstance + OCIRepository sync | gitless |
postBuild.substituteFrom across clusters | Multi-cluster with per-cluster variables |
update/ or update-policies/ directory | Repo with image automation |
Check for gotk-sync.yaml — if present, flag for Flux Operator migration.
bash $SCRIPTS/validate.sh -d .
bash $SCRIPTS/validate.sh -d . -e terraform -e helm-chartsWhat it checks: YAML syntax (yq), Kubernetes manifests (kubeconform -strict + Flux OpenAPI schemas), kustomize build per overlay.
Valid — not errors: SOPS-encrypted Secrets, third-party CRDs (skipped in kubeconform output), ${VARIABLE} substitution patterns, gotk-components.yaml.
bash $SCRIPTS/check-deprecated.sh -d . # exits 1 if deprecated APIs found — CI-safe
grep -rn "fluxcd.controlplane.io" . --include="*.yaml" | grep "apiVersion"
grep -rn "type: oci" . --include="*.yaml"Current stable API versions:
| Kind | Required apiVersion |
|---|---|
| GitRepository, OCIRepository, HelmRepository, HelmChart, Bucket | source.toolkit.fluxcd.io/v1 |
| Kustomization | kustomize.toolkit.fluxcd.io/v1 |
| HelmRelease | helm.toolkit.fluxcd.io/v2 |
| Provider, Alert | notification.toolkit.fluxcd.io/v1beta3 |
| Receiver | notification.toolkit.fluxcd.io/v1 |
| FluxInstance, ResourceSet, ResourceSetInputProvider | fluxcd.controlplane.io/v1 |
Kustomization:
spec.prune: true — prevents orphaned resourcesspec.wait: true — blocks dependent resources until health checks passspec.timeout setspec.dependsOn chains match actual dependency orderHelmRelease:
spec.chartRef (OCI) not spec.chart.specspec.install.strategy.name: RetryOnFailure — not legacy install.remediation.retriesspec.driftDetection.mode: enabled:latestReactivity:
valuesFrom or substituteFrom has reconcile.fluxcd.io/watch: EnabledResourceSet (if present):
<< inputs.field >> — not {{ inputs.field }}layerSelector.mediaType set on OCIRepository used for HelmRepository structure:
clusters/, apps/, infrastructure/ clearly separatedflux-system/ lives under clusters/<cluster>/# Unencrypted Secrets
grep -rn "kind: Secret" . --include="*.yaml" | while read m; do
f=$(echo "$m" | cut -d: -f1)
grep -q "sops:" "$f" || grep -q "ENC\[" "$f" || echo "UNENCRYPTED: $f"
done
# Hardcoded credentials
grep -rn -e "password:" -e "token:" -e "apiKey:" -e "_SECRET=" -e "ACCESS_KEY=" --include="*.yaml" .
# Insecure sources
grep -rn "insecure: true" . --include="*.yaml"
# Cloud registries without Workload Identity
grep -rn "ecr\.\|\.gcr\.io\|\.azurecr\.io" . --include="*.yaml" -B5 | grep -v "provider:"
# OCIRepository without Cosign verification
grep -rn "kind: OCIRepository" . --include="*.yaml" -A20 | grep -v "verify:"
# cluster-admin bindings
grep -rn "cluster-admin" . --include="*.yaml"
# Image automation pushing to main
grep -rn "kind: ImageUpdateAutomation" . --include="*.yaml" -A20 | grep "branch: main\|branch: master"Checklist: no plain Secrets in Git, SOPS/ESO in use, Workload Identity for cloud registries, Cosign on OCIRepository, per-tenant serviceAccountName, image automation pushes to staging branch not main.
Produce a markdown report:
gotk-sync.yaml present (flag if yes)Recommendations grouped by severity:
Critical (must fix before production)
Warning (should fix)
prune: true on Kustomizationsreconcile.fluxcd.io/watch label on valuesFrom ConfigMapsInfo (improvement opportunities)
flux bootstrap to Flux Operator.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests