Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Measure, benchmark, and instrument DORA metrics for production engineering teams.
Add DORA event emission to a GitHub Actions workflow.
Steps:
Identify which events to capture:
Detect: does a Prometheus Pushgateway exist in the stack? If not, it must be deployed before instrumentation can work:
kubectl get svc -A | grep pushgatewayIf absent, deploy via Helm before proceeding:
helm upgrade --install prometheus-pushgateway prometheus-community/prometheus-pushgateway \
--namespace monitoring \
--create-namespaceGenerate GitHub Actions steps for each event type:
dora_deployment_timestamp and dora_lead_time_seconds to Pushgatewaydora_incident_start_timestampdora_incident_duration_seconds and dora_incident_caused_by_deployOutput: exact YAML to append to the existing workflow, using the Pushgateway job name convention job/dora/instance/<repo-owner_repo-name>. Sanitize owner/repo → owner_repo to avoid breaking Pushgateway path segments.
Example deploy event step:
- name: Push DORA deploy metrics
if: success()
env:
PUSHGATEWAY_URL: ${{ secrets.PUSHGATEWAY_URL }}
REPO: ${{ github.repository }}
run: |
DEPLOY_TS=$(date +%s)
# Lead time from first commit in this batch — requires fetch-depth: 0 in checkout.
FIRST_COMMIT_TS=$(git log --reverse --format="%ct" origin/main..HEAD | head -1)
FIRST_COMMIT_TS=${FIRST_COMMIT_TS:-$DEPLOY_TS}
LEAD_TIME=$((DEPLOY_TS - FIRST_COMMIT_TS))
INSTANCE="${REPO//\//_}"
cat <<EOF | curl --data-binary @- "${PUSHGATEWAY_URL}/metrics/job/dora/instance/${INSTANCE}"
# TYPE dora_deployment_timestamp gauge
dora_deployment_timestamp{repo="${REPO}",env="production"} ${DEPLOY_TS}
# TYPE dora_lead_time_seconds gauge
dora_lead_time_seconds{repo="${REPO}",env="production"} ${LEAD_TIME}
EOFWarn: Change Failure Rate requires incident source integration — a rate of 0% without incident data is a configuration gap, not a real metric. Never report 0% CFR without confirmed incident source connectivity.
Reference: references/dora.md → Open-source instrumentation pattern
Generate a Grafana dashboard for all four DORA metrics.
Steps:
Confirm the four recording rules are deployed before building the dashboard:
dora:deployment_frequency:rate30ddora:lead_time_seconds:p50dora:change_failure_rate:ratio30ddora:mttr_seconds:p50Verify with:
curl -s 'http://prometheus:9090/api/v1/query?query=dora:deployment_frequency:rate30d' | jq .If any query returns no data, check the recording rule deployment — see references/dora.md for the full rule set.
Dashboard structure: four panels arranged in a 2×2 grid, one per DORA metric:
Each panel includes DORA performance band overlays as threshold regions (Elite/High/Medium/Low). A time range selector provides 30/60/90 day views.
Import the complete dashboard JSON from examples/dora/grafana-dashboard.json:
# Import via Grafana API
curl -s -X POST http://grafana:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d @examples/dora/grafana-dashboard.jsonThreshold values for each performance tier are embedded directly in the panel JSON as threshold bands. Update them in the panel thresholds field if your organisation uses different band definitions.
Reference: references/dora.md → DORA performance bands
Classify current metric values against DORA performance bands.
Steps:
Accept current metric values — either provided directly or queried from Prometheus:
# Query current values
curl -s 'http://prometheus:9090/api/v1/query?query=dora:deployment_frequency:rate30d'
curl -s 'http://prometheus:9090/api/v1/query?query=dora:lead_time_seconds:p50'
curl -s 'http://prometheus:9090/api/v1/query?query=dora:change_failure_rate:ratio30d'
curl -s 'http://prometheus:9090/api/v1/query?query=dora:mttr_seconds:p50'Map each metric to Elite / High / Medium / Low using the 2023 DORA performance bands:
| Metric | Elite | High | Medium | Low |
|---|---|---|---|---|
| Deployment Frequency | Multiple/day | Weekly–monthly | Monthly–6 months | Less than 6 months |
| Lead Time for Changes | < 1 hour | 1 day – 1 week | 1 week – 1 month | > 1 month |
| Change Failure Rate | < 5% | 5–10% | 10–15% | > 15% |
| MTTR | < 1 hour | < 1 day | 1 day – 1 week | > 1 week |
Identify the weakest metric — the one furthest from Elite — as the highest-leverage improvement target.
Suggest the most impactful improvement for that specific metric:
Output: tier table + one-sentence recommendation per metric.
Reference: references/dora.md → DORA performance bands
Diagnose gaps in DORA metric data.
Steps:
Missing deployment events
on: push to main/release branch)?
grep -A5 '^on:' .github/workflows/*.yamlcurl http://pushgateway:9091/-/healthycurl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job == "pushgateway")'Missing MTTR (or stuck at 0)
repository_dispatch events for incident lifecycle?
grep -r 'repository_dispatch' .github/workflows/Change failure rate is exactly 0%
Metrics stop at a specific date
kubectl describe pod -n monitoring -l app=prometheus-pushgateway | grep -i retention--persistence.file.Reference: references/dora.md → Anti-pattern detection
After completing any DORA mode, log findings while context is fresh:
ERR in .learnings/ERRORS.mdLRN in .learnings/LEARNINGS.mdUse /platform-skills:self-improve log for each entry. Do not defer to end of session.
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests