Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Use when troubleshooting, implementing, reviewing, or auditing platform infrastructure as a system — where Kubernetes, GitOps, CI/CD, and security concerns intersect. Apply these patterns when generating or reviewing code across Kubernetes, Flux CD, Argo CD, Terraform, GitHub Actions (composite actions, OIDC, SHA pinning), AWS, Azure, GKE, Linkerd, KEDA, supply chain security (Cosign, SBOM, SLSA), Falco, Chaos Engineering, DORA metrics, Datadog/Dynatrace/LLM observability, SOC 2, and PR review. Every answer includes blast radius, validation steps, and rollback plan.
If the user mentions "platform-skills", asks for a platform/DevOps/cloud/SRE review, or asks for production readiness, apply these instructions even if they do not reference this file explicitly. In Copilot Chat, slash commands such as /platform-skills:review are not required; translate them into the equivalent natural-language workflow.
When generating infrastructure code, respect these boundaries:
| Layer | Owns | Does NOT Own |
|---|---|---|
| Terraform | Cloud primitives, cluster bootstrap, IAM, networking, secrets backends | In-cluster workloads, Helm releases |
| Flux / Argo CD | In-cluster state, Helm releases, workload promotion | Cloud resources, IAM roles |
| GitHub Actions | CI checks, plan gates, artifact publish, promotion triggers | Long-lived environment state |
| Kubernetes | Workload specs, RBAC, network policy, resource limits | Cloud account structure |
Always generate workloads with:
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
memory: "256Mi" # Always set memory limit; omit cpu limit — causes throttling
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
securityContext:
runAsNonRoot: true
runAsUser: 1000 # Omit on OpenShift — SCC assigns the UID
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
spec:
interval: 10m
chart:
spec:
version: "1.2.3" # Never use "*" or ranges
install:
remediation:
retries: 3
upgrade:
remediation:
retries: 3
remediateLastFailure: trueTroubleshooting order: source → artifact → reconciliation → chart rendering → runtime
Prefer the current Flux CD API shape:
chartRef for OCIRepository-backed charts when availableGitRepository/OCIRepository, artifact status, Kustomization, HelmRelease, and controller logs in that orderspec:
project: platform # Never leave as "default"
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- PruneLast=true// ❌ Never generate
{ "Action": "s3:*", "Resource": "*" }
// ✅ Always generate
{
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"]
}Always prefer IRSA (EKS), Workload Identity (AKS), or Workload Identity Federation/WIF (GKE) over static credentials.
Module structure: main.tf, variables.tf (with validation blocks), outputs.tf, versions.tf, README.md.
Always include variable validation:
variable "environment" {
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "environment must be dev, staging, or production"
}
}Pipeline order: terraform fmt -check → terraform validate → tflint → checkov/tfsec → plan.
# ❌ Never
- uses: actions/checkout@v4
# ✅ Always — pin to full SHA
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
permissions:
contents: read
id-token: write # only if OIDC is requiredNever use pull_request_target with code checkout from forks.
Validation pipeline order: helm lint --strict → helm template --debug → kubeconform -strict -summary → checkov → helm test.
Never use helm upgrade --set to pass secrets. selectorLabels must NOT include app.kubernetes.io/version — it is immutable after creation.
For event-driven autoscaling, require explicit minReplicaCount, maxReplicaCount, cooldown, fallback behavior where supported, and trigger authentication ownership. For AWS triggers on EKS, prefer IRSA over static keys. For Prometheus triggers, validate query cardinality and failure behavior before recommending scale-to-zero.
For container and deployment pipelines, prefer:
For runtime security, treat Falco findings as signals that need triage, ownership, severity, and follow-up policy where appropriate. Do not convert every runtime signal directly into a blocking admission policy without checking noise, scope, and rollout plan.
Always use the new CEL-based policy types — never kyverno.io/v1 ClusterPolicy for new work:
apiVersion: policies.kyverno.io/v1
kind: ValidatingPolicy
metadata:
name: require-team-labels
annotations:
policies.kyverno.io/title: Require team labels
policies.kyverno.io/severity: medium
spec:
validationActions: [Audit] # Always start in Audit; promote to Deny after zero violations
matchConstraints:
resourceRules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
resources: ["deployments"]
operations: ["CREATE", "UPDATE"]
matchConditions:
- name: exclude-system-namespaces
expression: "!(['kube-system','kube-public','flux-system'].exists(ns, ns == object.metadata.namespace))"
validations:
- expression: "object.metadata.labels != null && 'app.kubernetes.io/team' in object.metadata.labels"
message: "Deployment must have app.kubernetes.io/team label"Promotion: kubectl patch validatingpolicy <name> --type merge -p '{"spec":{"validationActions":["Deny"]}}' — only after confirmed zero violations in PolicyReport.
Never generate:
validationFailureAction: Enforce — use validationActions: [Deny] insteadspec.rules[].match.any[].resources — use matchConstraints.resourceRules instead# METADATA
# title: IAM least privilege
# entrypoint: true
package terraform.iam
import rego.v1
deny contains msg if {
some name
policy := input.resource.aws_iam_policy[name]
some statement in policy.policy.Statement
statement.Action == "*"
msg := sprintf("IAM policy '%s' must not use wildcard Action", [name])
}Always add import rego.v1. Rules must be named deny, warn, or violation.
Pipeline: conftest fmt --check → regal lint → conftest verify → conftest test.
Never generate:
publicly_accessible = true on RDS, Redshift, or OpenSearchencrypted = false on any storage resourceskip_final_snapshot = true on production databasesdeletion_protection = false on production databasesis_multi_region_trail = false on CloudTrailenable_log_file_validation = false on CloudTrailenable = false on GuardDuty:latestAlways include:
resource "aws_kms_key" "..." { enable_key_rotation = true } # CC6.7
resource "aws_cloudtrail" "..." {
is_multi_region_trail = true
enable_log_file_validation = true
} # CC7.2
resource "aws_guardduty_detector" "..." { enable = true } # CC7.1
resource "aws_db_instance" "..." {
backup_retention_period = 35
deletion_protection = true
} # A1.2Subject line: <type>(<scope>): <imperative WHY, ≤72 chars, lowercase, no period>.
Types: feat, fix, refactor, chore, ci, docs, test, perf.
Never add Co-authored-by: Claude or any AI attribution.
When reviewing any PR that touches infrastructure, check all six dimensions:
:latest imagesreferences/kubernetes.md — cluster baselines, RBAC, network policyreferences/openshift.md — routes, SCCs, operatorsreferences/fluxcd.md — GitOps reconciliation, troubleshootingreferences/fluxcd-helmrelease.md — HelmRelease chartRef, drift detection, remediationreferences/fluxcd-kustomization.md — CEL readyExpr, postBuild, SOPS, SSA annotationsreferences/fluxcd-security.md — source auth, OCI supply chain, RBAC, image automationreferences/fluxcd-troubleshooting.md — controller-specific incident diagnosisreferences/argocd.md — app design, ApplicationSetsreferences/aws.md — IAM, EKS, account modelreferences/aws-cloudfront.md — CloudFront distributions, OAC, cache/security policiesreferences/aws-waf.md — Web ACLs, managed rules, rate limiting, Shield/FMSreferences/aws-mcp-profiles.md — AWS MCP profile management and multi-account authreferences/azure.md — AKS, workload identity, RBACreferences/terraform.md — module design, state, testingreferences/github-actions.md — workflow security, OIDCreferences/composite-actions.md — composite actions patterns, multi-cloud k8s deploy (EKS/AKS/GKE OIDC), private repo access, reusable-workflow decision guidereferences/platform-operating-model.md — cross-cutting architecturereferences/platform-mindset.md — DevEx, RFC/ADR, incident communicationreferences/secrets.md — External Secrets Operator, Sealed Secrets, provider setupreferences/compliance.md — SOC 2 controls in Terraformreferences/helm.md — chart scaffolding, lint pipelinereferences/mcp.md — MCP protocol, TypeScript/Python SDKsreferences/observability.md — logging, metrics, tracing, alertingreferences/documentation.md — docstrings, OpenAPI 3.1, doc sitesreferences/datadog.md — Agent setup, APM, monitors, SLOsreferences/dynatrace.md — Operator, instrumentation, SLOsreferences/conventional-commits.md — commit spec, toolingreferences/opa.md — Rego v1, rule types, testing, Conftest CLIreferences/kyverno.md — CEL policies, Audit→Deny, PolicyExceptionreferences/pr-review.md — cost, drift, ownership, compliance, upgrade, rollbackreferences/keda.md — ScaledObject, ScaledJob, TriggerAuthentication, all scalers, IRSA, GitOps integrationreferences/agent-self-improve.md — .learnings/ setup, WAL protocol, VFM scoring, proactive agent behaviorreferences/supply-chain.md — Cosign signing, Syft SBOM, Trivy CVE gates, SLSA Level 2, Kyverno enforcementreferences/runtime-security.md — Falco eBPF, custom rules, Falcosidekick alert routing, Kyverno bridgereferences/chaos.md — Litmus Chaos v3, Chaos Mesh v2, steady-state hypothesis, GameDay workflowreferences/dora.md — Deployment Frequency, Lead Time, Change Failure Rate, MTTR — GitHub Actions + Prometheusreferences/llm-observability.md — Datadog LLMObs instrumentation, eval bootstrap, trace RCAreferences/awesome-docs.md — animated GitHub-safe SVG doc generation, 4 patterns, timing math, GitHub constraintsreferences/renovate.md — dependency update automation, regex managers, private registriesexamples/ — working, production-ready code examples.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests