CtrlK
BlogDocsLog inGet started
Tessl Logo

nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

67

Quality

84%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

copilot-instructions.md.github/

Platform Engineering Instructions for GitHub Copilot

Version: 1.28.0

Source: https://github.com/nitinjain999/platform-skills

Scope: project-level — applies to every Copilot Chat session in this workspace

Install: ./install.sh --copilot --target /path/to/your-project

Upgrade: git pull in the platform-skills clone → rerun install.sh or copy this file → commit

Use when troubleshooting, implementing, reviewing, or auditing platform infrastructure as a system — where Kubernetes, GitOps, CI/CD, and security concerns intersect. Apply these patterns when generating or reviewing code across Kubernetes, Flux CD, Argo CD, Terraform, GitHub Actions (composite actions, OIDC, SHA pinning), AWS, Azure, GKE, Linkerd, KEDA, supply chain security (Cosign, SBOM, SLSA), Falco, Chaos Engineering, DORA metrics, Datadog/Dynatrace/LLM observability, SOC 2, and PR review. Every answer includes blast radius, validation steps, and rollback plan.

If the user mentions "platform-skills", asks for a platform/DevOps/cloud/SRE review, or asks for production readiness, apply these instructions even if they do not reference this file explicitly. In Copilot Chat, slash commands such as /platform-skills:review are not required; translate them into the equivalent natural-language workflow.

Response Contract

  • For reviews: lead with findings ordered by severity, then give validation and rollback notes
  • For generation: include runnable code, assumptions, validation commands, and rollback path
  • For troubleshooting: use evidence-first diagnosis before proposing changes
  • For risky production changes: state blast radius before the change plan
  • For uncertain or environment-specific advice: ask for the missing fact or mark the assumption clearly

Core Principles

  • Production-first: Always include blast radius, rollback plan, and validation steps
  • Root-cause over symptoms: Explain why a problem occurs, not just how to fix it
  • Least privilege by default: Never suggest wildcard IAM actions or resources
  • GitOps pull model: Cluster state lives in Git, not in pipeline scripts
  • Explicit over implicit: Make security choices, environment differences, and promotion flows visible

Layer Ownership

When generating infrastructure code, respect these boundaries:

LayerOwnsDoes NOT Own
TerraformCloud primitives, cluster bootstrap, IAM, networking, secrets backendsIn-cluster workloads, Helm releases
Flux / Argo CDIn-cluster state, Helm releases, workload promotionCloud resources, IAM roles
GitHub ActionsCI checks, plan gates, artifact publish, promotion triggersLong-lived environment state
KubernetesWorkload specs, RBAC, network policy, resource limitsCloud account structure

Kubernetes & OpenShift

Always generate workloads with:

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    memory: "256Mi"        # Always set memory limit; omit cpu limit — causes throttling

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

securityContext:
  runAsNonRoot: true
  runAsUser: 1000            # Omit on OpenShift — SCC assigns the UID
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop: ["ALL"]

Flux CD

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
spec:
  interval: 10m
  chart:
    spec:
      version: "1.2.3"   # Never use "*" or ranges
  install:
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
      remediateLastFailure: true

Troubleshooting order: source → artifact → reconciliation → chart rendering → runtime

Prefer the current Flux CD API shape:

  • Use chartRef for OCIRepository-backed charts when available
  • Keep source auth, image automation, and workload reconciliation ownership separate
  • For stuck reconciliations, inspect GitRepository/OCIRepository, artifact status, Kustomization, HelmRelease, and controller logs in that order

Argo CD

spec:
  project: platform          # Never leave as "default"
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - PruneLast=true

AWS IAM

// ❌ Never generate
{ "Action": "s3:*", "Resource": "*" }

// ✅ Always generate
{
  "Action": ["s3:GetObject", "s3:ListBucket"],
  "Resource": ["arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*"]
}

Always prefer IRSA (EKS), Workload Identity (AKS), or Workload Identity Federation/WIF (GKE) over static credentials.

Terraform

Module structure: main.tf, variables.tf (with validation blocks), outputs.tf, versions.tf, README.md.

Always include variable validation:

variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "environment must be dev, staging, or production"
  }
}

Pipeline order: terraform fmt -checkterraform validatetflintcheckov/tfsecplan.

GitHub Actions

# ❌ Never
- uses: actions/checkout@v4

# ✅ Always — pin to full SHA
- uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11  # v4.1.1

permissions:
  contents: read
  id-token: write   # only if OIDC is required

Never use pull_request_target with code checkout from forks.

Helm

Validation pipeline order: helm lint --stricthelm template --debugkubeconform -strict -summarycheckovhelm test.

Never use helm upgrade --set to pass secrets. selectorLabels must NOT include app.kubernetes.io/version — it is immutable after creation.

KEDA

For event-driven autoscaling, require explicit minReplicaCount, maxReplicaCount, cooldown, fallback behavior where supported, and trigger authentication ownership. For AWS triggers on EKS, prefer IRSA over static keys. For Prometheus triggers, validate query cardinality and failure behavior before recommending scale-to-zero.

Supply Chain and Runtime Security

For container and deployment pipelines, prefer:

  • Cosign keyless signing and verification
  • Syft SBOM generation and attestation
  • Trivy or Grype severity gates
  • SLSA provenance for build artifacts
  • Kyverno image verification at admission

For runtime security, treat Falco findings as signals that need triage, ownership, severity, and follow-up policy where appropriate. Do not convert every runtime signal directly into a blocking admission policy without checking noise, scope, and rollout plan.

Kyverno (policies.kyverno.io/v1)

Always use the new CEL-based policy types — never kyverno.io/v1 ClusterPolicy for new work:

apiVersion: policies.kyverno.io/v1
kind: ValidatingPolicy
metadata:
  name: require-team-labels
  annotations:
    policies.kyverno.io/title: Require team labels
    policies.kyverno.io/severity: medium
spec:
  validationActions: [Audit]   # Always start in Audit; promote to Deny after zero violations
  matchConstraints:
    resourceRules:
      - apiGroups: ["apps"]
        apiVersions: ["v1"]
        resources: ["deployments"]
        operations: ["CREATE", "UPDATE"]
  matchConditions:
    - name: exclude-system-namespaces
      expression: "!(['kube-system','kube-public','flux-system'].exists(ns, ns == object.metadata.namespace))"
  validations:
    - expression: "object.metadata.labels != null && 'app.kubernetes.io/team' in object.metadata.labels"
      message: "Deployment must have app.kubernetes.io/team label"

Promotion: kubectl patch validatingpolicy <name> --type merge -p '{"spec":{"validationActions":["Deny"]}}' — only after confirmed zero violations in PolicyReport.

Never generate:

  • validationFailureAction: Enforce — use validationActions: [Deny] instead
  • spec.rules[].match.any[].resources — use matchConstraints.resourceRules instead

OPA / Conftest (Rego)

# METADATA
# title: IAM least privilege
# entrypoint: true
package terraform.iam

import rego.v1

deny contains msg if {
    some name
    policy := input.resource.aws_iam_policy[name]
    some statement in policy.policy.Statement
    statement.Action == "*"
    msg := sprintf("IAM policy '%s' must not use wildcard Action", [name])
}

Always add import rego.v1. Rules must be named deny, warn, or violation. Pipeline: conftest fmt --checkregal lintconftest verifyconftest test.

SOC 2 Compliance (Terraform)

Never generate:

  • publicly_accessible = true on RDS, Redshift, or OpenSearch
  • encrypted = false on any storage resource
  • skip_final_snapshot = true on production databases
  • deletion_protection = false on production databases
  • is_multi_region_trail = false on CloudTrail
  • enable_log_file_validation = false on CloudTrail
  • enable = false on GuardDuty
  • Images tagged :latest

Always include:

resource "aws_kms_key" "..." { enable_key_rotation = true }          # CC6.7
resource "aws_cloudtrail" "..." {
  is_multi_region_trail      = true
  enable_log_file_validation = true
}                                                                      # CC7.2
resource "aws_guardduty_detector" "..." { enable = true }             # CC7.1
resource "aws_db_instance" "..." {
  backup_retention_period = 35
  deletion_protection     = true
}                                                                      # A1.2

Conventional Commits

Subject line: <type>(<scope>): <imperative WHY, ≤72 chars, lowercase, no period>. Types: feat, fix, refactor, chore, ci, docs, test, perf. Never add Co-authored-by: Claude or any AI attribution.

PR Review dimensions

When reviewing any PR that touches infrastructure, check all six dimensions:

  1. Cost — replica count, instance type, storage, NAT Gateways, data transfer
  2. Drift — dev/staging/prod overlay and values file alignment
  3. Ownership — CODEOWNERS coverage, team labels, Terraform README
  4. Compliance — SOC 2 CC6.1–CC8.1 control impact
  5. Upgrade — deprecated K8s APIs, loose Terraform provider constraints, :latest images
  6. Rollback — score each change: FULL / PARTIAL / MANUAL / NONE × LOCAL / CLUSTER / PLATFORM / DATA

Troubleshooting Structure

  1. Symptom — exact error and observable behavior
  2. Evidence to collect — exact commands to run
  3. Root cause — why this happens
  4. Fix — specific change with justification
  5. Validation — how to verify it worked
  6. Prevention — how to avoid in future
  7. Rollback — how to safely undo

Reference Files

  • references/kubernetes.md — cluster baselines, RBAC, network policy
  • references/openshift.md — routes, SCCs, operators
  • references/fluxcd.md — GitOps reconciliation, troubleshooting
  • references/fluxcd-helmrelease.md — HelmRelease chartRef, drift detection, remediation
  • references/fluxcd-kustomization.md — CEL readyExpr, postBuild, SOPS, SSA annotations
  • references/fluxcd-security.md — source auth, OCI supply chain, RBAC, image automation
  • references/fluxcd-troubleshooting.md — controller-specific incident diagnosis
  • references/argocd.md — app design, ApplicationSets
  • references/aws.md — IAM, EKS, account model
  • references/aws-cloudfront.md — CloudFront distributions, OAC, cache/security policies
  • references/aws-waf.md — Web ACLs, managed rules, rate limiting, Shield/FMS
  • references/aws-mcp-profiles.md — AWS MCP profile management and multi-account auth
  • references/azure.md — AKS, workload identity, RBAC
  • references/terraform.md — module design, state, testing
  • references/github-actions.md — workflow security, OIDC
  • references/composite-actions.md — composite actions patterns, multi-cloud k8s deploy (EKS/AKS/GKE OIDC), private repo access, reusable-workflow decision guide
  • references/platform-operating-model.md — cross-cutting architecture
  • references/platform-mindset.md — DevEx, RFC/ADR, incident communication
  • references/secrets.md — External Secrets Operator, Sealed Secrets, provider setup
  • references/compliance.md — SOC 2 controls in Terraform
  • references/helm.md — chart scaffolding, lint pipeline
  • references/mcp.md — MCP protocol, TypeScript/Python SDKs
  • references/observability.md — logging, metrics, tracing, alerting
  • references/documentation.md — docstrings, OpenAPI 3.1, doc sites
  • references/datadog.md — Agent setup, APM, monitors, SLOs
  • references/dynatrace.md — Operator, instrumentation, SLOs
  • references/conventional-commits.md — commit spec, tooling
  • references/opa.md — Rego v1, rule types, testing, Conftest CLI
  • references/kyverno.md — CEL policies, Audit→Deny, PolicyException
  • references/pr-review.md — cost, drift, ownership, compliance, upgrade, rollback
  • references/keda.md — ScaledObject, ScaledJob, TriggerAuthentication, all scalers, IRSA, GitOps integration
  • references/agent-self-improve.md.learnings/ setup, WAL protocol, VFM scoring, proactive agent behavior
  • references/supply-chain.md — Cosign signing, Syft SBOM, Trivy CVE gates, SLSA Level 2, Kyverno enforcement
  • references/runtime-security.md — Falco eBPF, custom rules, Falcosidekick alert routing, Kyverno bridge
  • references/chaos.md — Litmus Chaos v3, Chaos Mesh v2, steady-state hypothesis, GameDay workflow
  • references/dora.md — Deployment Frequency, Lead Time, Change Failure Rate, MTTR — GitHub Actions + Prometheus
  • references/llm-observability.md — Datadog LLMObs instrumentation, eval bootstrap, trace RCA
  • references/awesome-docs.md — animated GitHub-safe SVG doc generation, 4 patterns, timing math, GitHub constraints
  • references/renovate.md — dependency update automation, regex managers, private registries
  • examples/ — working, production-ready code examples

BEFORE_AFTER.md

CHANGELOG.md

CODE_OF_CONDUCT.md

COMMANDS.md

CONTRIBUTING.md

EDITOR_INTEGRATIONS.md

GETTING_STARTED.md

HOW_IT_WORKS.md

install.sh

INSTALLATION.md

LAUNCH.md

PROMPTS.md

QUICKSTART.md

README.md

renovate.json

SECURITY.md

SKILL.md

tessl.json

tile.json