Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Guidance for platform engineers on treating developers as customers, communicating across teams, and proactively solving systemic problems.
A platform team that does not think like a product team builds infrastructure nobody uses. The shift:
| Infrastructure Team Thinking | Platform Product Thinking |
|---|---|
| "We provide the tooling" | "We solve developer problems" |
| Feature-driven roadmap | Problem-driven roadmap |
| Adoption assumed | Adoption measured |
| Documentation as afterthought | Documentation as product surface |
| Support as interruption | Support as signal |
A golden path is the opinionated, supported, low-friction route from idea to production. It does not prevent other paths — it makes the right path the easy path.
What a golden path covers:
Anti-patterns:
Backstage is the catalog and golden-path delivery surface. Use it to:
Catalog catalog-info.yaml minimum viable:
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payments-api
description: Handles payment processing
annotations:
github.com/project-slug: org/payments-api
backstage.io/techdocs-ref: dir:.
spec:
type: service
lifecycle: production
owner: group:payments-team
system: checkoutUse the SPACE framework to avoid measuring only velocity:
| Dimension | Example Metrics |
|---|---|
| Satisfaction | Developer NPS, quarterly survey scores |
| Performance | Deployment frequency, lead time, change failure rate |
| Activity | PR merge rate, pipeline success rate |
| Communication | PR review turnaround, incident MTTR |
| Efficiency | Time from commit to production, CI duration |
DORA four key metrics as a baseline:
Where to collect signals:
Run a friction audit before starting roadmap planning:
Common friction sources and platform responses:
| Friction | Platform Response |
|---|---|
| "I don't know what secrets to set" | Self-service secret template in Backstage |
| "CI keeps failing on flaky tests" | Quarantine lane + test reliability SLO |
| "I have to wait for infra ticket" | Self-service Terraform module via Atlantis PR |
| "Staging is always broken" | Environment health dashboard, owner alerts |
| "I don't know who owns this" | Backstage catalog with on-call surfaced |
Platform engineers work across engineering, security, finance, and product. Each audience needs a different frame:
| Audience | What They Care About | How to Frame Platform Work |
|---|---|---|
| Engineering teams | Speed, reliability, not blocked | Reduce toil, faster deploys, self-service |
| Security | Risk, compliance, audit | Controls enforced by default, audit trail |
| Finance/FinOps | Cost, waste, forecast | Resource tagging, rightsizing, showback |
| Product/Leadership | Outcomes, not infrastructure | DORA metrics, incident reduction, time to market |
Follow the Context → Problem → Solution → Trade-offs structure:
Context: What is the system and who uses it? Problem: What specific failure mode or gap exists? Solution: What change addresses the root cause? Trade-offs: What does the solution cost or constrain?
Avoid leading with the technology. Lead with the outcome.
Bad:
"We need to implement a service mesh with mTLS and SPIFFE-based identity."
Good:
"Services can currently make arbitrary calls to each other with no authentication. If one service is compromised, it can reach any other. We want to enforce that only the payments service can call the billing service — that requires identity at the network layer, which a service mesh provides."
Use Request for Comments (RFC) for decisions that affect more than one team, and Architecture Decision Records (ADR) to capture what was decided and why.
RFC template minimum:
# RFC: [title]
## Status: Draft | In Review | Accepted | Rejected
## Problem
What is broken or missing? Why does this matter now?
## Proposal
What change are you proposing? Be concrete.
## Alternatives considered
What else did you evaluate and why did you reject it?
## Impact
Which teams are affected? What migration is required?
## Open questions
What is not yet decided?ADR template minimum:
# ADR-0042: [title]
## Status: Accepted
## Context
What situation forced this decision?
## Decision
What did we decide?
## Consequences
What becomes easier? What becomes harder? What must we monitor?Store ADRs in docs/decisions/ in the relevant repo.
Structure incident updates for the broadest useful audience:
[STATUS] [SEVERITY] [COMPONENT] - [IMPACT STATEMENT]
Time: 14:32 UTC
Severity: SEV-2
Affected: Payment checkout — ~15% of transactions failing
Status: Investigating
What we know: Elevated error rate started at 14:28 after deploy of payments-api v2.3.1
What we are doing: Rollback in progress, ETA 10 minutes
Next update: 14:50 UTC or sooner if status changesAvoid:
Structure:
Blameless means: the system made it possible for a human to make that mistake. Fix the system.
Do not wait for incidents. Use these signals to find problems before they surface:
| Signal | Where to Look | What to Ask |
|---|---|---|
| Error budget burn rate | SLO dashboard | Which service is burning fastest? |
| P99 latency trends | APM (Datadog, Grafana) | What is climbing week-over-week? |
| CI failure rate | GitHub Actions / pipeline metrics | Which test is flaky? Which step adds 5 min? |
| Support ticket volume | Slack/Jira categories | What is the top category this sprint? |
| Cost anomalies | AWS Cost Explorer / Azure Cost Mgmt | What resource class is growing unexpectedly? |
When a problem recurs, ask: is this a systemic issue or a local one?
Local fix: correct the specific instance. Systemic fix: remove the class of problem.
| Problem | Local Fix | Systemic Fix |
|---|---|---|
| Secret rotated manually | Rotate the secret | Automate rotation with ESO + Vault TTL |
| Developer opened port 22 | Close the port | AWS Config rule + auto-remediation Lambda |
| Helm values had wrong image tag | Fix the tag | Pin tags in CI artifact promotion |
| Node OOMKilled | Increase memory limit | Add VPA + alerting on limit utilization |
Prefer systemic fixes. Local fixes accrue as operational debt.
Run monthly:
Run this quarterly with the team:
Developer Experience
Operations
Security and Compliance
Cost
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests