nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

Quality

84%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Platform Mindset Reference

Name: nitinjain999/platform-skills
Rating: 67.2 (1 reviews)
Author: nitinjain999

Guidance for platform engineers on treating developers as customers, communicating across teams, and proactively solving systemic problems.

Product Mindset: Developers as Customers

The Core Shift

A platform team that does not think like a product team builds infrastructure nobody uses. The shift:

Infrastructure Team Thinking	Platform Product Thinking
"We provide the tooling"	"We solve developer problems"
Feature-driven roadmap	Problem-driven roadmap
Adoption assumed	Adoption measured
Documentation as afterthought	Documentation as product surface
Support as interruption	Support as signal

Golden Paths

A golden path is the opinionated, supported, low-friction route from idea to production. It does not prevent other paths — it makes the right path the easy path.

What a golden path covers:

Repo template with CI already wired
Service scaffold with observability, secrets, and RBAC pre-configured
One-command local environment that mirrors production
Documented promotion flow: dev → staging → production
Runbook template embedded in the service repo

Anti-patterns:

Golden paths that require a ticket to the platform team to use
Golden paths that only work for one language or framework
Undocumented golden paths where knowledge lives in Slack

Developer Portal (Backstage)

Backstage is the catalog and golden-path delivery surface. Use it to:

Register every service with owner, runbook, on-call, and SLO metadata
Surface software templates (scaffolding) for new services
Display TechDocs alongside the service it documents
Expose CI status, deploy history, and alert state in one place

Catalog catalog-info.yaml minimum viable:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  description: Handles payment processing
  annotations:
    github.com/project-slug: org/payments-api
    backstage.io/techdocs-ref: dir:.
spec:
  type: service
  lifecycle: production
  owner: group:payments-team
  system: checkout

Measuring Developer Experience (DevEx)

Use the SPACE framework to avoid measuring only velocity:

Dimension	Example Metrics
Satisfaction	Developer NPS, quarterly survey scores
Performance	Deployment frequency, lead time, change failure rate
Activity	PR merge rate, pipeline success rate
Communication	PR review turnaround, incident MTTR
Efficiency	Time from commit to production, CI duration

DORA four key metrics as a baseline:

Deployment frequency — how often you deploy to production
Lead time for changes — commit to production in minutes/hours
Change failure rate — percentage of deployments causing incidents
MTTR — time to restore after a failure

Where to collect signals:

CI system (pipeline duration, failure rate)
Git platform (PR age, review latency)
Developer surveys (2x/year minimum)
Support ticket volume and categories (most actionable signal for friction)

Reducing Friction: The Audit Approach

Run a friction audit before starting roadmap planning:

Shadow a developer — watch them onboard a new service end-to-end
Collect ticket categories — group support requests by root cause
Measure wait times — how long does each handoff take?
Ask "why five times" — each manual step has a root cause removable by automation or documentation

Common friction sources and platform responses:

Friction	Platform Response
"I don't know what secrets to set"	Self-service secret template in Backstage
"CI keeps failing on flaky tests"	Quarantine lane + test reliability SLO
"I have to wait for infra ticket"	Self-service Terraform module via Atlantis PR
"Staging is always broken"	Environment health dashboard, owner alerts
"I don't know who owns this"	Backstage catalog with on-call surfaced

Collaboration and Communication

Working with Cross-Functional Teams

Platform engineers work across engineering, security, finance, and product. Each audience needs a different frame:

Audience	What They Care About	How to Frame Platform Work
Engineering teams	Speed, reliability, not blocked	Reduce toil, faster deploys, self-service
Security	Risk, compliance, audit	Controls enforced by default, audit trail
Finance/FinOps	Cost, waste, forecast	Resource tagging, rightsizing, showback
Product/Leadership	Outcomes, not infrastructure	DORA metrics, incident reduction, time to market

Explaining Complex Technical Concepts

Follow the Context → Problem → Solution → Trade-offs structure:

Context: What is the system and who uses it? Problem: What specific failure mode or gap exists? Solution: What change addresses the root cause? Trade-offs: What does the solution cost or constrain?

Avoid leading with the technology. Lead with the outcome.

Bad:

"We need to implement a service mesh with mTLS and SPIFFE-based identity."

Good:

"Services can currently make arbitrary calls to each other with no authentication. If one service is compromised, it can reach any other. We want to enforce that only the payments service can call the billing service — that requires identity at the network layer, which a service mesh provides."

RFC and ADR Process

Use Request for Comments (RFC) for decisions that affect more than one team, and Architecture Decision Records (ADR) to capture what was decided and why.

RFC template minimum:

# RFC: [title]

## Status: Draft | In Review | Accepted | Rejected

## Problem
What is broken or missing? Why does this matter now?

## Proposal
What change are you proposing? Be concrete.

## Alternatives considered
What else did you evaluate and why did you reject it?

## Impact
Which teams are affected? What migration is required?

## Open questions
What is not yet decided?

ADR template minimum:

# ADR-0042: [title]

## Status: Accepted

## Context
What situation forced this decision?

## Decision
What did we decide?

## Consequences
What becomes easier? What becomes harder? What must we monitor?

Store ADRs in docs/decisions/ in the relevant repo.

Incident Communication

Structure incident updates for the broadest useful audience:

[STATUS] [SEVERITY] [COMPONENT] - [IMPACT STATEMENT]

Time: 14:32 UTC
Severity: SEV-2
Affected: Payment checkout — ~15% of transactions failing
Status: Investigating

What we know: Elevated error rate started at 14:28 after deploy of payments-api v2.3.1
What we are doing: Rollback in progress, ETA 10 minutes
Next update: 14:50 UTC or sooner if status changes

Avoid:

Technical jargon in customer-facing updates
Uncertainty about timelines without a next-update time
Silence longer than 30 minutes during an active incident

Post-Mortem / Blameless Retrospective

Structure:

Timeline — what happened, in order, with timestamps
Impact — duration, affected users, business impact
Root cause — the systemic cause, not the human who made a change
Contributing factors — what made the root cause possible
Action items — with owner and due date, not vague "improve X"

Blameless means: the system made it possible for a human to make that mistake. Fix the system.

Problem-Solving at Scale

Proactive Problem Identification

Do not wait for incidents. Use these signals to find problems before they surface:

Signal	Where to Look	What to Ask
Error budget burn rate	SLO dashboard	Which service is burning fastest?
P99 latency trends	APM (Datadog, Grafana)	What is climbing week-over-week?
CI failure rate	GitHub Actions / pipeline metrics	Which test is flaky? Which step adds 5 min?
Support ticket volume	Slack/Jira categories	What is the top category this sprint?
Cost anomalies	AWS Cost Explorer / Azure Cost Mgmt	What resource class is growing unexpectedly?

Systemic Fix vs. Local Fix

When a problem recurs, ask: is this a systemic issue or a local one?

Local fix: correct the specific instance. Systemic fix: remove the class of problem.

Problem	Local Fix	Systemic Fix
Secret rotated manually	Rotate the secret	Automate rotation with ESO + Vault TTL
Developer opened port 22	Close the port	AWS Config rule + auto-remediation Lambda
Helm values had wrong image tag	Fix the tag	Pin tags in CI artifact promotion
Node OOMKilled	Increase memory limit	Add VPA + alerting on limit utilization