CtrlK
BlogDocsLog inGet started
Tessl Logo

nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

67

Quality

84%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

fluxcd-troubleshooting.mdreferences/

FluxCD Troubleshooting — Failure Pattern Quick Reference

Scannable incident cheat-sheet. Symptom → most likely cause → exact fix. For the full 5-workflow debug procedure, use /platform-skills:gitops debug.


Controller failures

SymptomCauseFix
Controller pod not runningResource pressure, image pull failure, CRDs not installedkubectl describe pod -n flux-system; check node conditions and image pull secrets
Controller OOMKilled / crashloopingMemory limit too lowRaise limits via FluxInstance spec.kustomize.patches; delete pod to trigger immediate reschedule
spec.suspend: true on FluxInstanceIntentional pause — not an errorConfirm intent; resume: kubectl patch fluxinstance flux --type=merge -p '{"spec":{"suspend":false}}'
Ready: Unknown / ProgressingReconciliation in flightWait; check lastTransitionTime vs interval; if older than 2× interval, check controller logs for backpressure
CRDs missing after upgradeFlux component not upgradedRe-run bootstrap or bump FluxInstance.spec.distribution.version
flux-system namespace emptyFlux never installedRun flux install or deploy FluxInstance via Flux Operator

Source failures

GitRepository

SymptomCauseFix
FetchFailed: authentication requiredMissing or expired SSH key / PATVerify spec.secretRef; check identity + known_hosts keys in the Secret
FetchFailed: repository not foundWrong URL or repo is privateCheck spec.url; verify credentials have read access
FetchFailed: reference not foundBranch / tag does not existCheck spec.ref.branch or spec.ref.tag; ensure it exists in Git
TLS errorSelf-signed cert or wrong CASet spec.secretRef with caFile key, or spec.insecure: true (non-production only)
SSH URL rejectedSCP-style URL (git@host:repo) not supportedChange to ssh://git@host/org/repo
Stale lastFetchedAtController backpressure or queue depthCheck kubectl logs -n flux-system deploy/source-controller for queue metrics
sparseCheckout slow / timeoutLarge repo with narrow checkout — race on first cloneIncrease spec.timeout; use spec.ignore patterns to reduce fetched content

OCIRepository

SymptomCauseFix
FetchFailed: unauthorizedCloud registry auth not configuredSet spec.provider: aws/gcp/azure; verify workload identity annotation on source-controller SA
Cosign verification failedSignature missing or OIDC issuer/subject mismatchCheck spec.verify.matchOIDCIdentity; verify CI pushed a signature after the image push step
FetchFailed: layer not foundlayerSelector.mediaType missing or wrongSet layerSelector.mediaType: application/vnd.cncf.helm.chart.content.v1.tar+gzip for Helm charts
OCI HelmRepository shows no statusOCI type HelmRepository does not report conditionsMigrate to OCIRepository + spec.chartRef on the HelmRelease

HelmChart / HelmRepository

SymptomCauseFix
HelmChart not readySource HelmRepository not readyFix the HelmRepository first — HelmChart inherits its source's failure
chart not found in repositoryWrong chart name or version constraintVerify spec.chart name and spec.version semver range against what the repo publishes
version constraint yields no candidatesSemver range too narrow or chart not yet publishedWiden the range or pin a specific version that exists

Kustomization failures

SymptomCauseFix
kustomize build failedMissing resource, invalid patch, or wrong pathRun kustomize build ./path locally to reproduce; check relative paths in kustomization.yaml
BuildFailed: accumulating resourcesresources: reference points to a file that doesn't existVerify filenames and paths; check for case sensitivity issues
Variable not substituted (${VAR} left as literal)substituteFrom ConfigMap/Secret missing reconcile.fluxcd.io/watch: Enabled, wrong key name, or substituteFrom on the wrong KustomizationAdd watch label; verify key names; ensure substituteFrom is on the Kustomization that owns the manifest — not a sibling
health check timeoutdependsOn resource not ready, or workload stuck in rolloutInspect dependsOn chain; check the dependent resource's own status
Orphaned resources remain after file deletionspec.prune: falseSet spec.prune: true unless orphan retention is intentional
pruning disabled alertSame — prune: falseSame fix
Immutable field conflict (field is immutable)Trying to patch a field Kubernetes won't allow changing in placeSet spec.force: true temporarily to recreate the resource; remove it after the conflict is resolved
Variable substitution error: missing ConfigMap/SecretsubstituteFrom references an object that doesn't existCreate the missing ConfigMap/Secret in the Kustomization's own namespace
RBAC: forbidden on applyController SA lacks permission to manage the target resourceAdd a RoleBinding/ClusterRoleBinding for the kustomize-controller SA or the Kustomization's serviceAccountName

HelmRelease failures

SymptomCauseFix
rendered manifests contain a resource that already exists / cannot be imported into the current release: invalid ownership metadataOwnership conflict — a resource exists in the cluster that Helm did not create (or was created by a different release). Layer: chart rendering. Evidence: (1) extract the conflicting resource name from the error message — it is often different from the HelmRelease name; (2) kubectl get <kind> <resource-name> -o yaml | grep "meta.helm.sh" — read the meta.helm.sh/release-name and meta.helm.sh/release-namespace annotations to identify the owning release; (3) helm list -A to find the owning release across all namespaces; (4) flux logs --kind=HelmRelease --name=<name> -n <ns> for full error context. Root cause: Helm cannot adopt a resource it did not create. Fix — choose one: (a) most common — helm uninstall <owning-release> -n <owning-ns> then flux reconcile helmrelease <name> -n <ns>; (b) if the resource is unowned/orphaned: kubectl delete <kind> <resource-name> then reconcile; (c) to adopt without deleting: add meta.helm.sh/release-name: <release> and meta.helm.sh/release-namespace: <ns> annotations plus app.kubernetes.io/managed-by: Helm label to the resource, then reconcile. Blast radius: helm uninstall deletes all Helm-managed resources including CRDs — for cert-manager this causes a brief certificate issuance lapse; confirm ownership before choosing option (a). Validation: flux get helmrelease <name> -n <ns> shows Ready=True; kubectl get pods -n <ns> shows all pods Running.
install retries exhaustedHook timeout, resource conflict, pre/post-install hook failedCheck helm-controller logs; validate values against chart schema; inspect hook job/pod logs
upgrade retries exhaustedSame root causes but on upgrade pathSuspend, manually helm rollback, fix values, then resume
Remediation exhaustedMax retries reached — Flux stops retryingSwitch to install.strategy.name: RetryOnFailure; suspend + manually fix the Helm release state
Legacy vs modern remediation mismatchBoth install.remediation.retries and install.strategy.name setUse modern only: install.strategy.name: RetryOnFailure and upgrade.strategy.name: RetryOnFailure
chart not foundOCIRepository missing layerSelector.mediaTypeAdd layerSelector.mediaType: application/vnd.cncf.helm.chart.content.v1.tar+gzip
valuesFrom key missingConfigMap/Secret key name doesn't match what HelmRelease expectsVerify valuesKey in valuesFrom[].valuesKey; check the actual key in the ConfigMap
valuesFrom ConfigMap not watchedConfigMap updated but HelmRelease not re-reconciledAdd reconcile.fluxcd.io/watch: Enabled label to the ConfigMap
spec.chart.spec and spec.chartRef both setMutually exclusive fieldsRemove spec.chart.spec; use only spec.chartRef for OCI sources
Drift detected but no correctionspec.driftDetection.mode: warn — detects but doesn't fixChange to mode: enabled to allow Flux to correct drift
HelmRelease stuck after spec.force: trueForce recreates but hooks fail on the recreated resourceDisable force after the immutable field conflict is resolved; check hook pod logs
Release in failed state, Flux not retryingRemediation exhausted — Flux gives upflux suspend helmrelease <name> -n <ns>; helm rollback <name> -n <ns>; fix; flux resume
Namespace created but never prunedtargetNamespace + createNamespace: true on HelmRelease — namespace created outside GitOps lifecycleRemove targetNamespace/createNamespace from HelmRelease; create the namespace in the parent Kustomization or ResourceSet instead

ResourceSet failures

SymptomCauseFix
No resources generatedResourceSetInputProvider not ready or returning empty inputsCheck provider status: kubectl describe resourcesetinputprovider <name> -n flux-system
Template rendering brokenWrong delimiter — using {{ }} instead of << >>Change all template expressions to << inputs.field >> format
Generated Kustomization/HelmRelease failsProblem in the generated resource, not the ResourceSet itselfDrill into the generated resource with Workflow 2 or 3
dependsOn not satisfiedResourceSet depends on a Kustomization/HelmRelease that is not readyCheck the named dependency's status; dependsOn can reference any kind, not just Flux resources
inputsFrom provider returns stale dataProvider polling interval too longReduce spec.interval on the ResourceSetInputProvider

Finding the managing object (label-based tracing)

Every resource created by Flux carries labels identifying its parent. Use these to trace the full ownership chain without guessing.

# Who manages this HelmRelease?
kubectl get helmrelease <name> -n <ns> -o jsonpath='{.metadata.labels}' | jq

# Key labels to look for:
#   kustomize.toolkit.fluxcd.io/name    → parent Kustomization
#   kustomize.toolkit.fluxcd.io/namespace
#   resourceset.fluxcd.io/name          → parent ResourceSet
#   resourceset.fluxcd.io/namespace

# Find all resources managed by a specific Kustomization:
kubectl get all -A -l kustomize.toolkit.fluxcd.io/name=<kustomization-name>

# Find all resources generated by a ResourceSet:
kubectl get kustomization,helmrelease -A \
  -l resourceset.fluxcd.io/name=<resourceset-name>

General debugging checklist

Run in this order — each step narrows the failure layer:

  1. flux get all -A — anything not Ready: True?
  2. kubectl get pods -n flux-system — all controllers running?
  3. kubectl get fluxinstance flux -n flux-system -o yaml — FluxInstance healthy? (Flux Operator only)
  4. kubectl get fluxreport flux -n flux-system -o yaml — cluster-wide reconciliation summary
  5. flux get sources all -A — source fetching cleanly?
  6. Check spec.dependsOn — is a dependency blocking?
  7. Find the managing object via labels (see above)
  8. Check controller SA RBAC: kubectl auth can-i --list --as=system:serviceaccount:flux-system:kustomize-controller
  9. Check pod logs for the affected workload: kubectl logs -n <ns> deploy/<name> --tail=100
  10. Check node pressure: kubectl describe nodes | grep -A5 Conditions

BEFORE_AFTER.md

CHANGELOG.md

CODE_OF_CONDUCT.md

COMMANDS.md

CONTRIBUTING.md

EDITOR_INTEGRATIONS.md

GETTING_STARTED.md

HOW_IT_WORKS.md

install.sh

INSTALLATION.md

LAUNCH.md

PROMPTS.md

QUICKSTART.md

README.md

renovate.json

SECURITY.md

SKILL.md

tessl.json

tile.json