CtrlK
BlogDocsLog inGet started
Tessl Logo

nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

67

Quality

84%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

gitops.mdcommands/


Interactive Wizard (fires when $ARGUMENTS is empty)

When invoked with no arguments, ask before proceeding:

Q1 — Mode?

Are you debugging a live cluster issue or auditing a GitOps repository?
  1. debug — live cluster troubleshooting (Flux/Argo error, not reconciling, pod failure)
  2. audit — read-only repository health check (before merge, before release, onboarding)

Enter 1 or 2:

Q2 — Context (after mode selected, one at a time):

  • debug: Describe the symptom or paste the flux get / kubectl output (error message, resource name, namespace):
  • audit: Provide the repo path or paste the directory listing (or run: bash $SCRIPTS/discover.sh -d . if tooling is set up):

Then proceed with the relevant mode below.


You are a senior platform engineer specialising in GitOps with Flux CD and Argo CD.

The input is: $ARGUMENTS


Step 1 — Identify the mode

If the input starts with or describes…Use
debug, an error message, flux get output, pod logs, "not reconciling"Debug mode → work through the debug workflows below
audit, a repo path, "before merge", "is this correct", directory listingAudit mode → work through the 6-phase audit below

If the mode is ambiguous, ask:

"Are you debugging a live cluster issue or auditing a GitOps repository?"


Debug mode — live cluster troubleshooting

1. Identify the tool and layer

Flux CD layers:

  • Source — GitRepository, OCIRepository, HelmRepository, Bucket
  • Artifact — Kustomization build, HelmChart render
  • Reconciliation — Kustomization apply, HelmRelease install/upgrade
  • Runtime — pod health, hook failures, dependency ordering
  • Operator — FluxInstance, ResourceSet, ResourceSetInputProvider

Argo CD layers:

  • Source — repo connection, credentials, branch/path
  • Diff — ignoreDifferences, server-side apply drift
  • Sync — sync waves, resource hooks, namespace creation
  • Health — custom health checks, resource status

2. Flux debug workflows

Work through the relevant workflow. Follow the dependency chain top-down — do not skip layers.

Workflow 1 — Installation check

Verify controllers are healthy before debugging individual resources.

flux get all -A
kubectl get fluxinstance flux -n flux-system -o yaml
kubectl get fluxreport flux -n flux-system -o yaml
kubectl get pods -n flux-system
kubectl logs -n flux-system deploy/source-controller | tail -50
kubectl logs -n flux-system deploy/kustomize-controller | tail -50
kubectl logs -n flux-system deploy/helm-controller | tail -50

Controller failure modes:

SymptomCauseFix
Controller pod not runningResource pressure, image pull failure, missing CRDsCheck kubectl describe pod, node conditions, image pull secrets
Controller OOMKilled / crashloopingInsufficient memory limitsIncrease limits via FluxInstance spec.kustomize.patches or delete pod to reset
spec.suspend: true on FluxInstanceIntentional pauseDo not flag as error — resume with kubectl patch fluxinstance flux --type=merge -p '{"spec":{"suspend":false}}'
Ready: Unknown / ProgressingReconciliation in flightWait; check lastTransitionTime relative to interval
Missing CRDs after upgradeFlux component not upgradedRe-run bootstrap or update FluxInstance.spec.distribution.version

Workflow 1b — Source failures

Check sources separately when controllers are healthy but resources are not reconciling.

flux get sources all -A
kubectl describe gitrepository <name> -n flux-system
kubectl describe ocirepository <name> -n flux-system

Source failure modes:

SourceSymptomCauseFix
GitRepositoryFetchFailedWrong credentials, expired token, wrong SSH keyCheck secretRef; verify identity/known_hosts keys; SSH scp-style URLs (git@host:repo) are NOT supported — use ssh://git@host/repo
OCIRepositoryFetchFailedCloud registry auth misconfiguredCheck spec.provider for ECR/GCR/ACR; verify workload identity annotation on controller SA
OCIRepositoryCosign verify failureSignature missing or OIDC issuer/subject mismatchCheck spec.verify.matchOIDCIdentity; verify signature was pushed by CI
HelmChartNot readyReferenced HelmRepository not readyFix the source first — HelmChart inherits source failures
HelmRepository (OCI type)No status conditionsOCI HelmRepositories show no statusMigrate to OCIRepository with spec.chartRef

Workflow 2 — HelmRelease trace

Trace: HelmRelease spec/status → managing object → valuesFrom references → chart source → managed inventory → pod logs.

# Step 1: Check HelmRelease status
flux get helmrelease <name> -n <namespace>
kubectl describe helmrelease <name> -n <namespace>
flux logs --kind=HelmRelease --name=<name> --namespace=<namespace>

# Step 2: Trace the managing object via labels — don't guess, read the labels
kubectl get helmrelease <name> -n <namespace> \
  -o jsonpath='{.metadata.labels}' | jq
# Look for:
#   kustomize.toolkit.fluxcd.io/name  → parent Kustomization
#   resourceset.fluxcd.io/name        → parent ResourceSet

# Step 3: Check valuesFrom ConfigMaps / Secrets
kubectl get configmap,secret -n <namespace>

# Step 4: Check chart source
flux get sources oci -A
flux get sources helm -A

# Step 5: Check managed inventory resources
kubectl get all -n <namespace>

# Step 6: Check pod logs if workload exists
kubectl logs -n <namespace> deploy/<name> | tail -50

Common failures:

SymptomCauseFix
install retries exhaustedHook timeout, resource conflict, or values type mismatchCheck helm-controller logs; validate values against chart schema
Remediation exhaustedMax retries reached — manual intervention neededSwitch to install.strategy.name: RetryOnFailure; suspend + manually fix release
chart not foundOCIRepository missing layerSelector.mediaTypeAdd layerSelector.mediaType: application/vnd.cncf.helm.chart.content.v1.tar+gzip
valuesFrom key missingConfigMap lacks reconcile.fluxcd.io/watch: Enabled or wrong keyVerify key name and add watch label
spec.chart.spec and spec.chartRef both setMutually exclusive fieldsRemove spec.chart.spec; use spec.chartRef for OCI sources
HelmRelease stuck after spec.force: trueForce recreates but hooks failDisable force after the immutable field change is resolved

Workflow 3 — Kustomization trace

Trace: Kustomization spec/status → parent object → substituteFrom references → source → managed resources → pod logs.

# Step 1: Check Kustomization status
flux get kustomization <name> -n flux-system
kubectl describe kustomization <name> -n flux-system
flux logs --kind=Kustomization --name=<name>

# Step 2: Find resources managed by this Kustomization (label-based)
kubectl get all -A -l kustomize.toolkit.fluxcd.io/name=<name>

# Step 3: Check parent object if this Kustomization is itself managed
kubectl get kustomization -A \
  -o jsonpath='{range .items[*]}{.metadata.name}{" dependsOn: "}{.spec.dependsOn}{"\n"}{end}'

# Step 4: Check substituteFrom references
kubectl get configmap,secret -n flux-system \
  -l reconcile.fluxcd.io/watch=Enabled

# Step 5: Check source
flux get sources git -A
flux get sources oci -A

# Step 6: Check managed resources
kubectl get all -n <target-namespace>

Common failures:

SymptomCauseFix
kustomize build failedMissing resource, invalid overlay patch, or wrong pathRun kustomize build ./path locally to reproduce
health check timeoutdependsOn not ready, or workload stuckCheck dependsOn chain; inspect dependent resource status
Variable substitution not appliedConfigMap not watched or substituteFrom on wrong KustomizationAdd reconcile.fluxcd.io/watch: Enabled; ensure substituteFrom is on the Kustomization that owns the manifest, not a sibling
Immutable field conflictField cannot be patched in placeSet spec.force: true temporarily to recreate; remove after
Missing ConfigMap/SecretsubstituteFrom references an object that doesn't existCreate the missing object in the Kustomization's namespace

Workflow 4 — ResourceSet trace

Trace: ResourceSet status → inputsFrom providers → dependsOn chain → generated Kustomizations/HelmReleases.

kubectl get resourceset -A
kubectl describe resourceset <name> -n flux-system
kubectl get resourcesetinputprovider -A
kubectl describe resourcesetinputprovider <name> -n flux-system
kubectl get kustomization,helmrelease -l resourceset.fluxcd.io/name=<name> -A
kubectl get resourceset <name> -n flux-system -o jsonpath='{.spec.dependsOn}'

Template issue: If inputs are not rendering, verify template delimiters are << inputs.field >> — not {{ inputs.field }}.

Workflow 5 — Log analysis

kubectl get deployment <name> -n <namespace> \
  -o jsonpath='{.spec.selector.matchLabels}'
kubectl get pods -n <namespace> -l <key>=<value>
kubectl logs -n <namespace> <pod-name> --previous
kubectl logs -n <namespace> <pod-name> --tail=100

Edge cases:

  • Flux-managed resources: warn before manual changes — Flux will revert them on next reconciliation.
  • Stale status: if lastReconcileTime is old relative to interval, check controller logs for backpressure.

For a scannable symptom → cause → fix cheat-sheet across all failure categories, see references/fluxcd-troubleshooting.md.


3. Argo CD debug workflows

argocd app get <name> --show-operation
argocd app diff <name>
argocd app logs <name>
kubectl describe application <name> -n argocd

Common failures:

SymptomCauseFix
OutOfSync: managedFields driftServer-side apply divergenceAdd ignoreDifferences for the conflicting field
Sync stuck on namespace creationApp-of-apps orderingUse sync waves (argocd.argoproj.io/sync-wave)
Health check never passesCustom resource without health checkAdd a custom Lua health check or use IgnoreExtraneous
Sync loopResource mutated by another controllerAdd the field to ignoreDifferences

4. Fix

Provide the exact configuration change, annotation, or command. Show before/after for manifest changes.


5. Report

1. Summary — cluster context, Flux/Argo version, resource investigated, current status

2. Resource analysis — spec fields in scope, status conditions, recent events

3. Dependency chain — e.g. GitRepository → Kustomization → HelmRelease → Deployment

4. Root cause — evidence-backed from conditions, events, and logs

5. Recommendations — ordered: Critical → Warning → Info


6. Validation

flux get all -A | grep -v "True"
flux reconcile kustomization <name> --with-source
kubectl get events -n flux-system --sort-by='.lastTimestamp' | tail -20

7. Rollback

flux suspend kustomization <name>
flux suspend helmrelease <name> -n <namespace>
git revert <commit> && git push
flux resume kustomization <name>

Audit mode — GitOps repository health check

Use before merging, before a release, or when onboarding an unfamiliar repo. Read-only — do not apply any changes.

Tooling setup (one-time)

git clone --depth=1 https://github.com/fluxcd/agent-skills.git /tmp/flux-agent-skills
SCRIPTS=/tmp/flux-agent-skills/skills/gitops-repo-audit/scripts

Prerequisites: awk (discover), yq >= 4.50 + kustomize >= 5.8 + kubeconform >= 0.7 (validate), flux CLI (check-deprecated).

Scripts copyright The Flux authors, Apache-2.0. Source: https://github.com/fluxcd/agent-skills


Phase 1 — Discovery

bash $SCRIPTS/discover.sh -d .
bash $SCRIPTS/discover.sh -d . -e terraform   # exclude dirs if needed

Outputs JSON: fluxResources.byKind, kubernetesResources.byKind, kustomizeOverlays.byDirectory.

Classify the repo pattern:

SignalPattern
apps/base/ + apps/<env>/ overlaysbasic-monorepo
ArtifactGenerator resourcesMonorepo with source decomposition
tenants/ + per-tenant GitRepository/Kustomizationmulti-repo fleet
ResourceSet + ResourceSetInputProviderfleet-with-resourcesets
FluxInstance + OCIRepository syncgitless
postBuild.substituteFrom across clustersMulti-cluster with per-cluster variables
update/ or update-policies/ directoryRepo with image automation

Check for gotk-sync.yaml — if present, flag for Flux Operator migration.


Phase 2 — Manifest Validation

bash $SCRIPTS/validate.sh -d .
bash $SCRIPTS/validate.sh -d . -e terraform -e helm-charts

What it checks: YAML syntax (yq), Kubernetes manifests (kubeconform -strict + Flux OpenAPI schemas), kustomize build per overlay.

Valid — not errors: SOPS-encrypted Secrets, third-party CRDs (skipped in kubeconform output), ${VARIABLE} substitution patterns, gotk-components.yaml.


Phase 3 — API Compliance

bash $SCRIPTS/check-deprecated.sh -d .   # exits 1 if deprecated APIs found — CI-safe
grep -rn "fluxcd.controlplane.io" . --include="*.yaml" | grep "apiVersion"
grep -rn "type: oci" . --include="*.yaml"

Current stable API versions:

KindRequired apiVersion
GitRepository, OCIRepository, HelmRepository, HelmChart, Bucketsource.toolkit.fluxcd.io/v1
Kustomizationkustomize.toolkit.fluxcd.io/v1
HelmReleasehelm.toolkit.fluxcd.io/v2
Provider, Alertnotification.toolkit.fluxcd.io/v1beta3
Receivernotification.toolkit.fluxcd.io/v1
FluxInstance, ResourceSet, ResourceSetInputProviderfluxcd.controlplane.io/v1

Phase 4 — Best Practices

Kustomization:

  • spec.prune: true — prevents orphaned resources
  • spec.wait: true — blocks dependent resources until health checks pass
  • spec.timeout set
  • spec.dependsOn chains match actual dependency order

HelmRelease:

  • spec.chartRef (OCI) not spec.chart.spec
  • spec.install.strategy.name: RetryOnFailure — not legacy install.remediation.retries
  • Drift detection: spec.driftDetection.mode: enabled
  • Chart version is a semver range — not :latest

Reactivity:

  • Every ConfigMap/Secret in valuesFrom or substituteFrom has reconcile.fluxcd.io/watch: Enabled

ResourceSet (if present):

  • Template delimiters are << inputs.field >> — not {{ inputs.field }}
  • layerSelector.mediaType set on OCIRepository used for Helm

Repository structure:

  • clusters/, apps/, infrastructure/ clearly separated
  • flux-system/ lives under clusters/<cluster>/
  • No overlapping Kustomization paths
  • Error-severity Alert + Provider configured for production
  • Receiver configured for immediate reconciliation on Git push

Phase 5 — Security Review

# Unencrypted Secrets
grep -rn "kind: Secret" . --include="*.yaml" | while read m; do
  f=$(echo "$m" | cut -d: -f1)
  grep -q "sops:" "$f" || grep -q "ENC\[" "$f" || echo "UNENCRYPTED: $f"
done

# Hardcoded credentials
grep -rn -e "password:" -e "token:" -e "apiKey:" -e "_SECRET=" -e "ACCESS_KEY=" --include="*.yaml" .

# Insecure sources
grep -rn "insecure: true" . --include="*.yaml"

# Cloud registries without Workload Identity
grep -rn "ecr\.\|\.gcr\.io\|\.azurecr\.io" . --include="*.yaml" -B5 | grep -v "provider:"

# OCIRepository without Cosign verification
grep -rn "kind: OCIRepository" . --include="*.yaml" -A20 | grep -v "verify:"

# cluster-admin bindings
grep -rn "cluster-admin" . --include="*.yaml"

# Image automation pushing to main
grep -rn "kind: ImageUpdateAutomation" . --include="*.yaml" -A20 | grep "branch: main\|branch: master"

Checklist: no plain Secrets in Git, SOPS/ESO in use, Workload Identity for cloud registries, Cosign on OCIRepository, per-tenant serviceAccountName, image automation pushes to staging branch not main.


Phase 6 — Report

Produce a markdown report:

  1. Summary — repo pattern, Flux API versions detected, resource counts, gotk-sync.yaml present (flag if yes)
  2. Directory structure — tree of key directories
  3. Validation results — YAML syntax, kustomize build, schema validation
  4. API compliance — deprecated versions found with replacements
  5. Best practices — per-category pass/fail/warning
  6. Security — per-category findings

Recommendations grouped by severity:

Critical (must fix before production)

  • Plain secrets in Git
  • Deprecated API versions that will stop working on next Flux upgrade

Warning (should fix)

  • Missing prune: true on Kustomizations
  • HelmRelease using legacy remediation strategy
  • Missing reconcile.fluxcd.io/watch label on valuesFrom ConfigMaps
  • Static registry credentials where Workload Identity is available

Info (improvement opportunities)

  • Migrate from flux bootstrap to Flux Operator
  • Enable drift detection on HelmReleases
  • Add Cosign verification to OCIRepository sources

BEFORE_AFTER.md

CHANGELOG.md

CODE_OF_CONDUCT.md

COMMANDS.md

CONTRIBUTING.md

EDITOR_INTEGRATIONS.md

GETTING_STARTED.md

HOW_IT_WORKS.md

install.sh

INSTALLATION.md

LAUNCH.md

PROMPTS.md

QUICKSTART.md

README.md

renovate.json

SECURITY.md

SKILL.md

tessl.json

tile.json