Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
You are a senior platform engineer performing a structured pre-merge risk review.
Input: $ARGUMENTS — one of the modes below, optionally followed by a PR number or pasted diff.
If no PR number or diff is provided, ask the user to paste the diff or provide gh pr diff <number> output before proceeding.
Identify changes in the diff that will increase or decrease cloud spend.
Compute
t3.medium → m5.xlarge) — flag cost multiplierminReplicas, or Karpenter NodePool limitsStorage
gp2 vs gp3 vs io1)gp2 → flag: gp3 is 20% cheaper at equal performanceallocated_storage increases are irreversible without snapshot/restoreNetwork
Data transfer
Managed services
For each finding:
[COST] <resource name> — <change description>
Estimated delta: +$X/month (basis: <pricing reference>)
Severity: HIGH / MEDIUM / LOW
Recommendation: <concrete action to reduce cost or accept with justification>End with a Cost Summary table:
| Resource | Change | Est. Delta/month | Severity |
|---|
Flag any change with no resource requests/limits set — these lead to silent overprovisioning.
Reference: references/pr-review.md → Cost Impact
Detect configuration drift between environments (dev/staging/prod, or cluster overlays).
Kustomize / overlay drift
overlays/dev, overlays/staging, overlays/prod patternskustomization.yaml patch added to one overlay but missing from anotherHelm values drift
values-dev.yaml, values-staging.yaml, values-prod.yaml — if one changed, check siblingsTerraform workspace / environment drift
environments/dev/main.tf changed but environments/prod/main.tf not touchedGitOps source drift
HelmRelease or Kustomization with different spec.interval, spec.timeout, or spec.retries between clustersApplication targeting different target revisions per environment without explicit promotion intentFeature flag drift
For each finding:
[DRIFT] <file> vs <sibling file>
Field: <key path>
Dev value: <x> Staging value: <y> Prod value: <z or MISSING>
Severity: HIGH / MEDIUM / INFO
Recommendation: Align values or add a comment explaining intentional divergenceAsk: "Is this drift intentional or an oversight?" — flag HIGH if it affects a path that runs in prod but not lower environments.
Reference: references/pr-review.md → Environment Drift
Identify governance gaps introduced or exposed by the diff.
CODEOWNERS
references/, commands/, or examples/ subdirectory not covered by an existing globKubernetes resource ownership
team: or owner: labelapp.kubernetes.io/team labelTerraform module ownership
README.md describing purpose, inputs, outputs, and teamvariables.tf with no description on any variablesource pinned to a version tagPR governance
#, JIRA-, LINEAR-)CHANGELOG.md update when version changed in plugin.json, marketplace.json, Chart.yaml, or package.jsonPolicy coverage
ValidatingPolicy or NetworkPolicy[OWNERSHIP] <file or resource>
Gap: <description of missing ownership signal>
Severity: HIGH / MEDIUM / LOW
Recommendation: <exact addition needed>Reference: references/pr-review.md → Ownership and Governance
Assess SOC 2 Trust Services Criteria impact of the diff.
Map each finding to a SOC 2 control area:
| Code | Area |
|---|---|
| CC6.1 | Logical access — least privilege, RBAC, no wildcard IAM |
| CC6.2 | Authentication — MFA, OIDC, no static credentials |
| CC6.6 | Network security — VPC isolation, security groups, private subnets |
| CC6.7 | Encryption — at-rest and in-transit on all data stores |
| CC6.8 | Vulnerability management — IaC scanning, image scanning |
| CC7.1 | Detection — GuardDuty, CloudWatch, Security Hub |
| CC7.2 | Audit logging — CloudTrail, VPC flow logs, API access logs |
| CC8.1 | Change management — PR workflow, plan review, state locking |
| A1.2 | Backup — automated backups, retention ≥ 35 days |
CC6.1 — Logical access
Action: "*" or Resource: "*" — wildcard must have explicit justificationverbs: ["*"] or resources: ["*"]CC6.2 — Authentication
aws_access_key_id) anywhere in diff — criticalaws-actions/configure-aws-credentials without OIDCCC6.6 — Network
0.0.0.0/0 ingress on non-80/443 portspublicly_accessible = true)LoadBalancer with no annotation restricting source CIDRsCC6.7 — Encryption
server_side_encryption_configurationstorage_encrypted = falseencrypted = falseCC7.2 — Audit logging
CC8.1 — Change management
prevent_destroy = true removed from a stateful resource[COMPLIANCE] CC<X.X> — <control area>
Finding: <description>
File: <path>
Severity: CRITICAL / HIGH / MEDIUM
Remediation: <exact fix with code example>
Auditor evidence: <command to produce evidence for auditors>End with a Compliance Summary listing all affected control codes and whether each is a blocker.
Reference: references/pr-review.md → Compliance and SOC 2, references/compliance.md
Detect deprecated APIs, EOL versions, and version hygiene issues.
Kubernetes API versions
Check each resource apiVersion in the diff against the target cluster version:
| Deprecated API | Removed in | Replacement |
|---|---|---|
extensions/v1beta1 Ingress | 1.22 | networking.k8s.io/v1 |
networking.k8s.io/v1beta1 Ingress | 1.22 | networking.k8s.io/v1 |
policy/v1beta1 PodSecurityPolicy | 1.25 | Kyverno / OPA / PSA |
policy/v1beta1 PodDisruptionBudget | 1.25 | policy/v1 |
autoscaling/v2beta1 HPA | 1.26 | autoscaling/v2 |
batch/v1beta1 CronJob | 1.25 | batch/v1 |
apiextensions.k8s.io/v1beta1 CRD | 1.22 | apiextensions.k8s.io/v1 |
kyverno.io/v1 ClusterPolicy | deprecated 1.17, removal 1.20 | policies.kyverno.io/v1 |
Terraform provider versions
>= 3.0 allows major version jumps with breaking changes — recommend ~> X.Yrequired_terraform version missing or too broadsource without a version tag (e.g. source = "git::..." with no ref=)GitHub Actions
@main, @master) — must be pinned to a SHA or immutable tagactions/checkout@v2 when v4 is current)runs-on: ubuntu-18.04 or ubuntu-20.04 — both EOL on GitHub ActionsContainer images
:latest tag — not reproducible, breaks rollbacknode:14, python:3.7)Helm
apiVersion: v1 chart when v2 features are used (dependencies, type field)Tool versions in CI
terraform version pinned to a specific patch but not aligned with required_version constraintkubectl version more than one minor version skew from the cluster versionhelm version not pinned (uses runner default)[UPGRADE] <file>:<line>
Found: <deprecated item>
Target version: <cluster/provider/tool version>
Removed in: <version where this breaks>
Replacement: <exact updated value>
Migration effort: LOW / MEDIUM / HIGHFlag any item that will break on the next minor version upgrade as BREAKING.
Reference: references/pr-review.md → Upgrade and Version Hygiene
Score the rollback feasibility of the changes in the diff.
Score each change on two axes:
Reversibility (can you undo this by reverting the commit?)
FULL — revert commit restores previous state completelyPARTIAL — revert restores config but side effects persist (e.g. DNS TTL, cache)MANUAL — revert is not enough; manual steps requiredNONE — change is irreversible without data loss or significant effortBlast radius (what breaks if rollback is needed?)
LOCAL — one service or namespaceCLUSTER — all workloads in a clusterPLATFORM — shared infrastructure (IAM, VPC, DNS, state backend)DATA — data store schema, backup policy, or retention changeDatabase / stateful changes
allocated_storage increase on RDS → irreversible without snapshot/restoreKubernetes resource renames
prune: true in Kustomization → old resource deleted on sync, new resource created — rollback requires re-creating old resourceTerraform destructive operations
-/+ replace in plan output for any stateful resource (RDS, EKS node group, ElastiCache)prevent_destroy = true removed from a resourceforce_destroy = true added to an S3 bucket or EKS clusterGitOps
prune: true Kustomization + resource deletion → rollback recreates resource but loses any out-of-band stateremediation.uninstall: true — rollback triggers full uninstall, not just version pin:latest) — rollback points to different image than beforeSecrets and credentials
IAM and RBAC
[ROLLBACK] <resource / file>
Change: <description>
Reversibility: FULL / PARTIAL / MANUAL / NONE
Blast radius: LOCAL / CLUSTER / PLATFORM / DATA
Rollback procedure: <exact steps if not a simple git revert>
Pre-merge requirement: <what must be in place before merging — backup, snapshot, migration script>End with a Rollback Risk Score:
| Risk Level | Criteria |
|---|---|
| 🟢 LOW | All changes FULL reversibility, LOCAL blast radius |
| 🟡 MEDIUM | Any PARTIAL reversibility or CLUSTER blast radius |
| 🔴 HIGH | Any MANUAL/NONE reversibility or PLATFORM/DATA blast radius |
If risk is HIGH, recommend: require a pre-merge snapshot, runbook, or explicit sign-off before merge.
Reference: references/pr-review.md → Rollback Feasibility
Run all six modes in sequence against the same diff. Output sections in this order:
End with a Merge Readiness Summary:
Cost delta: +$X/month (N findings)
Drift: N environment mismatches
Ownership gaps: N findings
Compliance: N control areas affected (K critical)
Upgrade risk: N deprecated items (K breaking)
Rollback score: 🟢 LOW / 🟡 MEDIUM / 🔴 HIGH
Blockers (must fix before merge):
- <item 1>
- <item 2>
Recommended (should fix, not blocking):
- <item 1>
Informational:
- <item 1>Reference: references/pr-review.md
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests