Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Every /platform-skills:<command> slash command available in Claude Code, with all modes, what to pass, and what to expect back.
claude plugin install platform-skills
# Then restart Claude CodeCommands work in any conversation — type the slash command or describe your problem and the skill activates automatically.
| Command | What it's for |
|---|---|
| /platform-skills:review | Production-readiness review of any config |
| /platform-skills:debug | Structured troubleshooting for any symptom |
| /platform-skills:terraform | Terraform validation pipeline + blast radius |
| /platform-skills:gitops | Flux CD / Argo CD — debug live cluster issues or audit a GitOps repo |
| /platform-skills:linkerd | Linkerd mTLS, injection, policy, multi-cluster |
| /platform-skills:linux | Linux, DNS, load balancing, VPC/VNet, networking |
| /platform-skills:helmcheck | Helm chart scaffold, review, security audit |
| /platform-skills:commit | Conventional commit message generation |
| /platform-skills:observability | Instrument, alert, dashboard, load test, capacity |
| /platform-skills:opa | OPA/Conftest Rego policy generate, test, validate |
| /platform-skills:kyverno | Kyverno policy generate, test, audit, debug, migrate |
| /platform-skills:compliance | SOC 2 gap analysis and Terraform remediation |
| /platform-skills:datadog | Datadog setup, APM, monitors, SLOs, incidents |
| /platform-skills:dynatrace | Dynatrace Operator, OneAgent, SLOs, incidents |
| /platform-skills:document | Docstrings, OpenAPI specs, docs sites, guides |
| /platform-skills:mcp | MCP server scaffold, review, debug |
| /platform-skills:aws-profile | AWS profile management for MCP servers — discover, switch, login, org-scan |
| /platform-skills:product | DevEx, RFC/ADR, post-mortems, capacity, cost |
| /platform-skills:pr-review | Comprehensive PR risk review |
| /platform-skills:triage | Triage and resolve PR comments |
| /platform-skills:keda | KEDA ScaledObject/ScaledJob — generate, debug, review, scale |
| /platform-skills:self-improve | Bootstrap, log, review, or promote agent self-improvement entries |
| /platform-skills:chaos | Install Litmus Chaos or Chaos Mesh, generate fault experiments, schedule chaos, run GameDay, debug, report |
| /platform-skills:dora | Instrument DORA metrics, generate Grafana dashboards, benchmark against performance bands, debug metric gaps |
| /platform-skills:awesome-docs | Generate any animated Markdown doc (README, architecture guide, runbook, tutorial, RFC, post-mortem, or custom), convert existing Markdown, update/diff/audit, preview, export |
| /platform-skills:aws | CloudFront, WAF, Lambda@Edge, Firewall Manager multi-account enforcement, and Terraform module generation |
| /platform-skills:composite-actions | Generate, review, secure, and test composite GitHub Actions |
| /platform-skills:fluxcd | FluxCD entry point — routes to debug, audit, or helm review based on your input |
| /platform-skills:renovate | Generate renovate.json (with private registry support), pre-commit hook, or GHA validation workflow |
/platform-skills:reviewWhat it does: Senior-engineer production-readiness review of any platform config. Evaluates correctness → security → operational safety → deprecations. Returns findings as Critical / Improvement / Note.
Works on: Kubernetes manifests, Terraform modules, GitHub Actions workflows, Helm values, RBAC configs, network policies, Dockerfiles, any YAML.
/platform-skills:review [paste file content or describe what to review]What gets checked:
| Priority | Checks |
|---|---|
| 1. Correctness | API versions, required fields, label/namespace consistency, will it do what it intends? |
| 2. Security | Least-privilege RBAC/IAM, no plaintext secrets, non-root containers, SHA-pinned actions, scoped IAM |
| 3. Operational safety | Rollback path, blast radius, resource limits, liveness/readiness probes, GitOps prune behaviour |
| 4. Deprecations | Deprecated APIs, action versions, fields that will break on next minor version |
Examples:
Review a Deployment manifest — paste it inline:
/platform-skills:review
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-service
spec:
replicas: 1
template:
spec:
containers:
- name: app
image: my-registry/orders:latest
env:
- name: DB_PASSWORD
value: "hunter2"Review a GitHub Actions workflow file by path:
/platform-skills:review .github/workflows/deploy.ymlReview a Terraform IAM module inline:
/platform-skills:review
resource "aws_iam_role_policy" "app" {
policy = jsonencode({
Statement = [{ Effect = "Allow", Action = "*", Resource = "*" }]
})
}Review Helm values against a specific chart:
/platform-skills:review my values.yaml for the ingress-nginx chart — are resource limits set? Is the service account locked down?What comes back: A structured report grouped by severity. Critical items block merge; Improvements are non-blocking; Notes are informational.
/platform-skills:debugWhat it does: Structured troubleshooting using a 6-step framework: classify layer → collect evidence → root-cause hypothesis → fix → validate → rollback. Forces you to gather evidence before applying a fix.
Works on: Any platform symptom across Terraform, Kubernetes, OpenShift, Flux CD, Argo CD, Linkerd, GitHub Actions, AWS/Azure, Secrets management.
/platform-skills:debug [symptom or error message]The 6-step output you get:
Examples:
Pod stuck in CrashLoopBackOff:
/platform-skills:debug orders-service pod CrashLoopBackOff: exit code 137Flux reconciliation not picking up merged changes:
/platform-skills:debug Flux Kustomization stuck in NotReady — "context deadline exceeded" — changes merged 20 minutes ago but cluster not updatedGitHub Actions OIDC failing:
/platform-skills:debug GitHub Actions OIDC error: "Error assuming role with web identity: not authorized to perform sts:AssumeRoleWithWebIdentity"Mysterious 503s after a deploy:
/platform-skills:debug 503 errors spiked immediately after deploying v2.3.0 of payments-service, rolling back didn't fully stop themAWS resource creation failing:
/platform-skills:debug Terraform apply fails: "Error creating IAM role: LimitExceeded: Cannot exceed quota for InstanceSessionsPerInstanceProfile: 1"/platform-skills:terraformWhat it does: Full Terraform review covering the validation pipeline, blast radius analysis, IAM/security audit, state impact, and module design — in that order.
/platform-skills:terraform [paste terraform code, plan output, or describe the change]The 6-section output:
| Section | What it covers |
|---|---|
| 1. Validation pipeline | fmt → validate → tflint → tfsec/checkov — pass/fail per gate with exact errors |
| 2. Blast radius | What gets created/modified/destroyed, what gets replaced (destructive), downstream dependencies, mid-apply failure impact |
| 3. IAM and security | Wildcard actions/resources, default_tags enforcement, sensitive variables, state backend encryption |
| 4. State impact | Migration requirements, unmanaged resources that could conflict, state isolation by environment |
| 5. Module design | Variable validation blocks, typed outputs, provider placement in caller not module |
| 6. Recommended actions | Exact HCL snippets for each fix |
Examples:
Review an IAM policy for wildcards:
/platform-skills:terraform
resource "aws_iam_policy" "app" {
name = "app-policy"
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["s3:*"]
Resource = "*"
}]
})
}Blast radius check before applying:
/platform-skills:terraform I'm about to run terraform apply on a change that modifies aws_db_subnet_group — what could break and how do I validate it?Review a Terraform plan output:
/platform-skills:terraform [paste output of terraform plan]Check a module for reuse quality:
/platform-skills:terraform review my EKS module — does it follow module design best practices? Are variables validated?Check state isolation strategy:
/platform-skills:terraform we have one state file for all environments in s3://company-tfstate — is this a problem?/platform-skills:gitopsWhat it does: Two modes for all GitOps work. debug for live cluster issues — five structured Flux workflows (installation, source, HelmRelease, Kustomization, ResourceSet) plus Argo CD diagnostics, producing a 5-section report with root cause and rollback. audit for repo health — 6-phase read-only analysis (discovery, manifest validation, API compliance, best practices, security) producing a prioritized Critical / Warning / Info report.
Usage: /platform-skills:gitops <mode>
/platform-skills:gitops debug [describe symptom or paste flux/argocd output]
/platform-skills:gitops audit [repo path or paste directory listing]Examples:
Flux Kustomization not ready:
/platform-skills:gitops debug flux get kustomizations -A shows apps NotReady: "dependency not found: infrastructure"HelmRelease stuck on upgrade:
/platform-skills:gitops debug HelmRelease orders-service stuck in "upgrade retries exhausted"Argo CD perpetually OutOfSync:
/platform-skills:gitops debug ArgoCD application shows OutOfSync despite successful manual syncAudit a repo before a release:
/platform-skills:gitops audit ./clustersSecurity-focused audit:
/platform-skills:gitops audit — check for plain secrets and missing Cosign verification/platform-skills:linkerdWhat it does: Linkerd-specific diagnostics across 8 problem classes, with exact linkerd CLI evidence commands, root-cause hypothesis, fix, and validation using linkerd viz edges / linkerd check.
/platform-skills:linkerd [describe the Linkerd symptom or paste linkerd check / viz output]Problem classes:
| Class | What it diagnoses |
|---|---|
| Injection | Proxies not being injected; annotation vs namespace annotation conflicts |
| mTLS | Edges showing plaintext; certificate expiry; trust anchor mismatch |
| Authorization policy | Traffic denied by Server/AuthorizationPolicy; identity string mismatches |
| Observability | Missing metrics; PodMonitor selector not matching; linkerd viz not showing data |
| Traffic management | HTTPRoute not splitting traffic; retries not firing; timeout not respected |
| Multi-cluster | Mirrored services unreachable; gateway health; firewall blocking port 4143 |
| Performance | High proxy latency; proxy CPU/memory pressure |
| Control plane | identity/destination/proxy-injector component failures |
Evidence commands provided:
linkerd check
linkerd check --proxy
linkerd viz edges deployment -n <namespace>
linkerd viz stat deploy -n <namespace>
linkerd viz tap deploy/<name> -n <namespace>
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.linkerd\.io/proxy-version}{"\n"}{end}'
kubectl get secret linkerd-identity-issuer -n linkerd -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -datesExamples:
mTLS not working between services:
/platform-skills:linkerd linkerd viz edges shows plaintext between orders-service and payments-serviceProxy injection not happening:
/platform-skills:linkerd pods in the checkout namespace are not getting the linkerd-proxy sidecar after I added the namespace annotationTraffic not splitting on HTTPRoute:
/platform-skills:linkerd HTTPRoute for canary deploy is configured 90/10 but all traffic is going to stable — no traffic to canary podsMulti-cluster service unreachable:
/platform-skills:linkerd mirrored service payments-service-remote is unreachable from the primary cluster — gateway shows healthy but curl times outCertificate expiry:
/platform-skills:linkerd linkerd check shows "Certificate not yet valid" on the identity issuer — cert renews in 2 days but already failing/platform-skills:linuxWhat it does: Linux administration and networking diagnostics across 7 topic areas. Each topic has its own structured framework. Always ends with a validation command and rollback.
/platform-skills:linux [topic: dns | lb | vpc | process | disk | network | security-groups | troubleshoot]Topic frameworks:
dnsWalks the resolution path client → resolver → authoritative; identifies the break; provides exact dig/nslookup commands; covers CoreDNS health inside Kubernetes clusters, ndots behaviour, and search domain issues.
/platform-skills:linux dns pod cannot resolve payments-service.checkout.svc.cluster.local/platform-skills:linux dns external DNS propagation delay after updating Route 53 record — TTL is 300s but clients still seeing old IP after 10 minuteslbDiagnoses L4 vs L7 choice, health check failures, routing issues, target group type, source IP preservation on NLB.
/platform-skills:linux lb ALB returning 502 — target group shows healthy but upstream still getting 502s intermittently/platform-skills:linux lb should I use ALB or NLB for a gRPC service? What are the implications?vpcSubnet tier review, route table correctness, IGW/NAT GW attachment, security group rules. Peering vs Transit Gateway scale/cost trade-off. PrivateLink producer/consumer placement.
/platform-skills:linux vpc EKS pods in private subnet cannot reach ECR — what route and security group changes are needed?/platform-skills:linux vpc 12 VPCs need to talk to each other — peering or Transit Gateway?processCrashed services, resource exhaustion, misconfigurations. Provides systemctl, journalctl, ps, lsof, strace commands. Covers memory (free -h, /proc/meminfo) and CPU (vmstat, mpstat).
/platform-skills:linux process nginx keeps restarting — how do I find the root cause and make the fix survive a reboot?/platform-skills:linux process Java process OOMKilled by the OS but -Xmx is set to 2g and node has 8g RAMdiskChecks both space (df -hT) and inodes (df -i). Finds large files with du -sh, identifies deleted-but-open files with lsof | grep deleted.
/platform-skills:linux disk /var/log is at 95% — how do I find what's consuming space without taking down the service?/platform-skills:linux disk disk shows full but df shows 80% — how to find the "missing" space?networkFull connectivity ladder: L3 (ping) → L4 (nc -zv) → L7 (curl -v). Interface state, route table, socket stats. Kernel tuning for high-traffic services (somaxconn, tcp_max_syn_backlog, ip_local_port_range).
/platform-skills:linux network two pods can ping each other but TCP connection is refused on port 8080 — same namespace, same node/platform-skills:linux network high connection reset rate under load — suspect kernel TCP backlog limitssecurity-groupsMaps traffic flow source → SG on LB → SG on target → NACL. Identifies missing or incorrect rules. Notes NLB source IP preservation behaviour.
/platform-skills:linux security-groups ALB to ECS task health check failing — security group rule looks correct but target health shows unhealthy/platform-skills:linux security-groups NLB target is unhealthy but the same target passes curl from within the VPCtroubleshootGeneral-purpose structured checklist when you don't know the topic. Classifies symptom first, then applies the appropriate framework.
/platform-skills:linux troubleshoot service was reachable 30 minutes ago, now timing out — no recent deploys/platform-skills:helmcheckWhat it does: Three modes — scaffold a production-ready chart from scratch, review an existing chart for structural issues, or run a security audit.
/platform-skills:helmcheck [create <workload-type> | review | security] [chart path or description]createGenerates a complete, production-ready Helm chart based on workload type.
| Workload type | Resources generated |
|---|---|
| Web service | Deployment + Service + Ingress |
| Worker | Deployment only (no Service) |
| CronJob | CronJob + ServiceAccount |
| Stateful | StatefulSet + PVC + Headless Service |
What gets generated:
Chart.yaml with name, description, type, version, appVersion_helpers.tpl with all 6 standard helpers: name, fullname, chart, labels, selectorLabels, serviceAccountName — selectorLabels correctly excludes app.kubernetes.io/version (immutable after creation)values.yaml — every key commented, defaults that work with zero overrides, hardened securityContext (non-root, readOnlyRootFilesystem, drop ALL), resource requests+limits presentdeployment.yaml, service.yaml, serviceaccount.yaml (with automountServiceAccountToken: false)ingress.yaml, hpa.yaml (autoscaling/v2), pdb.yaml (policy/v1), networkpolicy.yamlhelm lint --strict, helm template --debug, kubeconform -strictExamples:
/platform-skills:helmcheck create web service for a Node.js REST API — needs ingress, HPA, and network policy/platform-skills:helmcheck create worker for a Python background job that reads from SQS — no inbound traffic/platform-skills:helmcheck create stateful for PostgreSQL with PVC and headless servicereviewChecks a chart against a severity table. Reports Critical/High/Medium/Low findings with exact fixes.
Checks include: missing _helpers.tpl, no resource limits, no probes, hardcoded image tags, wrong label immutability, missing NOTES.txt, automountServiceAccountToken: true, undocumented values.
/platform-skills:helmcheck review ./charts/orders-service/platform-skills:helmcheck review — the chart has no liveness probes and I suspect the values.yaml has secrets hardcodedsecurityFull security audit across pod security, RBAC, network, and secrets.
Checks include:
securityContext, running as root, readOnlyRootFilesystem: false, capabilities not dropped, privileged: true, allowPrivilegeEscalation: trueautomountServiceAccountToken: true, ClusterRole used where Role would do, wildcard verbshostNetwork/hostPID/hostIPC: true/platform-skills:helmcheck security ./charts/payments-service/platform-skills:helmcheck security — our chart runs as root and mounts the host docker socket, help me fix it/platform-skills:commitWhat it does: Generates, validates, and stages Conventional Commits messages. Four modes: analyze → generate → stage → validate.
/platform-skills:commit [analyze|generate|stage|validate] [optional: type/scope override or description]Commit message format:
<type>(<scope>): <imperative subject — 72 chars max>
<body — explains WHY, wraps at 72 chars>
<footers — BREAKING CHANGE:, Fixes #N>analyzeInspects staged diff (or unstaged if nothing staged). Groups files by logical concern, detects type and scope, flags breaking changes.
Type detected from:
| Type | When |
|---|---|
feat | New capability or behavior added |
fix | Corrects broken behavior |
refactor | Restructures without changing behavior |
perf | Measurably improves performance |
test | Tests only |
docs | Documentation only |
chore | Deps, build tooling, no production effect |
ci | CI/CD pipeline changes |
revert | Reverts a prior commit |
Returns: detected type, inferred scope, breaking change flag, one-line WHY summary.
/platform-skills:commit analyze/platform-skills:commit analyze — I changed the auth middleware to reject tokens missing the 'sub' claimgenerateRuns analyze internally, then writes the full commit message — subject (imperative mood, WHY-focused, ≤72 chars), body, and footers.
/platform-skills:commit generate/platform-skills:commit generate feat auth — I added OIDC login support with PKCE flow/platform-skills:commit generate — breaking change, removed the /v1/orders endpointExample output:
feat(auth): add OIDC login with PKCE flow
Previous password-based login could not support SSO providers. PKCE
prevents auth code interception on public clients without requiring
a client secret.
BREAKING CHANGE: /api/auth/login now returns an authorization_url
instead of a session token. Clients must redirect to this URL.
Closes #142stageLists all modified files, groups them by logical change, identifies unrelated changes that should be separate commits, and stages the chosen group.
/platform-skills:stageUseful when: you've changed multiple unrelated things and want clean atomic commits.
Example:
/platform-skills:commit stageOutput groups might be:
src/auth/oidc.ts, src/auth/oidc.test.ts → feat(auth): add OIDCpackage.json, package-lock.json → chore(deps): bump axios to 1.7.0README.md → docs: document OIDC setupvalidateChecks an existing commit message against the full spec. Reports PASS or exact violations with the corrected line.
Rules checked: valid type, lowercase scope, ! only with BREAKING CHANGE footer, subject starts lowercase, no trailing period, ≤72 chars, blank line before body, body wraps at 72 chars, well-formed footers.
/platform-skills:commit validate "Fix: updated auth service"/platform-skills:commit validate "$(git log -1 --format=%B)"/platform-skills:observabilityWhat it does: Add the three pillars to a service (logs/metrics/traces), build Grafana dashboards, write Prometheus alerting rules, design k6 load tests, or estimate capacity and HPA configuration.
/platform-skills:observability [instrument|dashboard|alert|loadtest|capacity] [service description]instrumentAdds structured logging, RED metrics, and OpenTelemetry tracing to a service. Asks for: language/framework, existing logging library, metrics backend (Prometheus/Datadog/CloudWatch), tracing backend (Jaeger/Tempo/OTLP).
What gets generated:
http_requests_total counter + http_request_duration_seconds histogram, labelled by route and status/metrics Prometheus scrape endpoint/healthz and /readyz health check endpoints/platform-skills:observability instrument a Node.js Express API — Prometheus metrics, Tempo tracing, Pino logging/platform-skills:observability instrument a Python FastAPI service — I'm already using structlog, add RED metrics and OpenTelemetry spans for the payment and order endpointsdashboardGenerates a Grafana dashboard using RED (request-based services) or USE (resource-based infrastructure) method.
Panels generated: request rate (req/s), error rate (%), p50/p95/p99 latency (ms), active connections, queue depth. Threshold lines at SLO boundaries, template variables for env and service.
/platform-skills:observability dashboard for the orders-service — RED method, SLO is 99.9% availability and p99 < 500ms/platform-skills:observability dashboard USE method for the Postgres database — CPU, memory, disk I/O, connection poolalertWrites Prometheus alerting rules using rate() over 5-minute windows, with for: duration, severity label, and runbook annotation.
Alert design enforced:
runbook annotation URL/platform-skills:observability alert write SLO burn-rate alerts for orders-service — 99.9% availability target, 30-day window/platform-skills:observability alert high p99 latency alert for the payments-service — critical at 1s, warning at 500ms, fire after 5 minutesloadtestWrites a k6 load test with ramp-up → steady-state → ramp-down stages, thresholds matching the SLO, and check() assertions on status code and response time.
Asks for: target endpoint, expected peak RPS, SLO thresholds (p95 latency, error rate).
/platform-skills:observability loadtest for POST /orders — peak 500 RPS, p95 must be under 200ms, error rate under 0.1%/platform-skills:observability loadtest simulate 2000 concurrent users on the checkout flow — ramp up over 5 minutescapacityEstimates replica count, resource requests/limits, and HPA configuration based on expected load.
Formula used: replicas = ceil((peak_rps × avg_latency_s) / target_concurrency_per_pod) + 50% headroom. HPA CPU target ≤ 60%.
/platform-skills:observability capacity orders-service handles 200 RPS at p99 150ms, expecting 3× growth — what are the resource requests and HPA settings?/platform-skills:observability capacity current 4 pods at 70% CPU at 1000 RPS — what's the right HPA min/max and target utilisation?/platform-skills:opaWhat it does: Generate Rego policies, write unit tests, run the full validation pipeline (fmt → regal → conftest verify), explain policies in plain English, or debug why a rule isn't firing.
/platform-skills:opa [generate|test|validate|explain|debug] [policy description or file path]generateWrites a production-ready Rego policy from a description. Asks for: target resource type (Terraform HCL/plan JSON, Kubernetes manifest, GitHub Actions, Dockerfile), rule logic, and whether a named package is needed.
What gets generated:
title, description, authors, entrypoint: truepackage <namespace> — named packages for multi-domain repos (package terraform.iam, package k8s.pods)import rego.v1deny for hard failures, warn for advisory, violation for Gatekeepermsg including resource name and remediation hint/platform-skills:opa generate a policy that denies Kubernetes Deployments without resource limits set on all containers/platform-skills:opa generate a Terraform policy that denies S3 buckets with public ACLs and warns if versioning is not enabled/platform-skills:opa generate a policy for GitHub Actions workflows that warns if actions are not pinned to a SHA committestWrites _test.rego unit tests for a given policy with minimal focused fixtures.
Structure generated:
package <namespace>_testimport data.<namespace>deny/warn rule: a positive test (test_deny_*, asserts count > 0) and a negative test (test_allow_*, asserts count == 0)conftest verify --policy <dir>/platform-skills:opa test write unit tests for my S3 encryption policy/platform-skills:opa test — [paste policy file] — generate tests for all deny and warn rulesvalidateRuns the full 5-step validation pipeline in order. Fixes each stage before proceeding to the next.
| Step | Tool | What it catches |
|---|---|---|
| 1. Format check | conftest fmt --check | Inconsistent indentation, non-canonical formatting |
| 2. Auto-format | conftest fmt | Rewrites files in place if check failed |
| 3. Lint | regal lint | Style violations, no-defined-entrypoint, unused variables |
| 4. Unit tests | conftest verify --policy | Logic correctness against fixtures |
| 5. Integration test | conftest test --policy <dir> <input-files> | Real input parsing against actual files |
/platform-skills:opa validate ./policies/platform-skills:opa validate — regal is reporting "no-defined-entrypoint" on all my files, how do I fix it?explainTranslates Rego into plain English, rule by rule. Maps each input.<field> to the actual resource attribute being read. Notes data.* dependencies.
For each rule explains:
/platform-skills:opa explain
deny contains msg if {
some name
bucket := input.resource.aws_s3_bucket[name]
bucket.acl == "public-read"
msg := sprintf("S3 bucket '%s' must not be public", [name])
}/platform-skills:opa explain ./policies/deny_unencrypted_s3.regodebugDiagnoses why a policy is not firing (or firing when it shouldn't). Checks in order: namespace mismatch → rule name → input shape → partial vs set comprehension → import rego.v1 missing → some missing.
/platform-skills:opa debug my deny rule produces no output when I run conftest test against main.tf/platform-skills:opa debug — I have a warn rule that fires on everything, even compliant resources/platform-skills:opa debug — conftest test passes with exit 0 but I expected a failure — [paste policy and input]/platform-skills:kyvernoWhat it does: Generate, test, audit, debug, and migrate Kyverno policies using the new CEL-based policy types (ValidatingPolicy, MutatingPolicy, GeneratingPolicy, ImageValidatingPolicy — all apiVersion: policies.kyverno.io/v1). Covers matchConstraints, matchConditions, CEL validations/mutations, generator.Apply(), Audit→Deny promotion, PolicyException, PolicyReport analysis, and migration from legacy ClusterPolicy or PodSecurityPolicy.
/platform-skills:kyverno [generate|test|audit|debug|migrate] [policy description or file path]generateWrites a production-ready Kyverno policy using the new CEL-based types. Always starts in validationActions: [Audit] unless Deny is explicitly requested.
What gets generated:
apiVersion: policies.kyverno.io/v1 with the appropriate kind (ValidatingPolicy, MutatingPolicy, GeneratingPolicy, or ImageValidatingPolicy)annotations block: policies.kyverno.io/title, category, severity, descriptionmatchConstraints.resourceRules targeting only the required kinds and operationsmatchConditions with CEL to exclude system namespaces — replaces the old exclude blockValidatingPolicy: validations[].expression (CEL boolean); messageExpression for dynamic messagesMutatingPolicy: mutations[].patchType: ApplyConfiguration with Object{...} CEL for merges; patchType: JSONPatch with [JSONPatch{...}] CEL for precise path operationsGeneratingPolicy: variables with dyn() for inline resources; generate[].expression using generator.Apply(namespace, [resources]); evaluation.synchronize.enabled: trueImageValidatingPolicy: matchImageReferences; attestors with Cosign keyless or key-based; validations using verifyImageSignatures() CEL functionkyverno apply <policy.yaml> --resource <manifest.yaml> --detailed-results/platform-skills:kyverno generate a ValidatingPolicy that requires all Deployments to have app.kubernetes.io/team and app.kubernetes.io/name labels/platform-skills:kyverno generate a ValidatingPolicy that denies privileged containers in all namespaces except kube-system/platform-skills:kyverno generate a GeneratingPolicy that creates a default-deny-ingress NetworkPolicy in every new namespace/platform-skills:kyverno generate an ImageValidatingPolicy that requires all images to be signed with Cosign keyless (Sigstore)testWrites a kyverno-test.yaml manifest and resource fixture files to verify a policy with the kyverno CLI.
Structure generated:
matchConditions exclude a namespacekyverno-test.yaml referencing all resources with expected resultskyverno test ./platform-skills:kyverno test — [paste ValidatingPolicy YAML] — write the test manifest and resource fixtures/platform-skills:kyverno test write tests for my disallow-privileged-containers ValidatingPolicyauditReads PolicyReport data from a running cluster and produces a ranked, actionable violation summary.
What it does:
kubectl get policyreport -A and kubectl get clusterpolicyreport for all failureskubectl patch command to promote each zero-violation policy from Audit to Deny: kubectl patch validatingpolicy <name> --type merge -p '{"spec":{"validationActions":["Deny"]}}'/platform-skills:kyverno audit — here is my policyreport output: [paste JSON or describe violations]/platform-skills:kyverno audit we're ready to move require-labels to Deny, what violations remain?debugDiagnoses why a Kyverno policy is not behaving as expected.
Checks in order:
kubectl get validatingwebhookconfigurations)matchConstraints.resourceRules not covering the resource kind, apiGroup, or operationmatchConditions CEL expression filtering out the resource unexpectedlyvalidationActions: [Audit] — policy reports violations but does not block; check PolicyReport, not admission eventsevaluation.background.enabled: false — existing resources never evaluatedkubectl describe events on the resource)/platform-skills:kyverno debug my ValidatingPolicy is in Audit mode but policyreport shows no violations for existing Deployments/platform-skills:kyverno debug my CEL expression blocks every Pod even when it should pass — [paste ValidatingPolicy YAML]/platform-skills:kyverno debug CEL evaluation error on admission — [paste policy and the admission event]migrateGuides migration from legacy ClusterPolicy (kyverno.io/v1) or PodSecurityPolicy to the new CEL-based types.
From legacy ClusterPolicy:
| Legacy field | New equivalent |
|---|---|
spec.rules[].match.any[].resources | spec.matchConstraints.resourceRules[] |
spec.rules[].exclude | spec.matchConditions with CEL negation |
validate.pattern (JMESPath anchors) | validations[].expression (CEL boolean) |
validate.deny.conditions | validations[].expression with inverted CEL |
mutate.patchStrategicMerge | mutations[].patchType: ApplyConfiguration with Object{...} |
mutate.patchesJSON6902 | mutations[].patchType: JSONPatch with [JSONPatch{...}] |
generate.data / generate.clone | generate[].expression using generator.Apply() and resource.Get() |
validationFailureAction: Enforce | validationActions: [Deny] |
validationFailureAction: Audit | validationActions: [Audit] |
From PodSecurityPolicy:
ValidatingPolicy CEL expression[Audit] mode first[Deny] with zero violationsFrom OPA/Gatekeeper:
ValidatingPolicy CEL expressionsinput.review.object → object, deny rule → validations[].expression with inverted logic/platform-skills:kyverno migrate I'm migrating from PSP — here are my existing PodSecurityPolicies: [paste YAML]/platform-skills:kyverno migrate translate this legacy ClusterPolicy to the new ValidatingPolicy type: [paste YAML]/platform-skills:kyverno migrate translate this Gatekeeper ConstraintTemplate to a Kyverno ValidatingPolicy: [paste YAML]/platform-skills:complianceWhat it does: SOC 2 compliance for Terraform infrastructure. Gap analysis, control implementation, audit evidence collection, Checkov remediation — all mapped to Trust Services Criteria (TSC).
/platform-skills:compliance [topic: gap | control | evidence | remediate | checklist]gapMaps your Terraform config or description to SOC 2 TSC criteria, identifies gaps, and prioritises: Critical (audit blocker) / High (likely finding) / Medium (improvement).
Output format:
Criterion | Finding | Severity | Fix
CC6.7 | S3 bucket missing KMS CMK | Critical | Add aws_s3_bucket_server_side_encryption_configuration
CC7.2 | CloudTrail not multi-region | Critical | Set is_multi_region_trail = true
CC6.6 | Security group allows 0.0.0.0/0 | High | Restrict to VPN CIDR/platform-skills:compliance gap analyze my EKS Terraform module for SOC 2 CC6.1 access control gaps/platform-skills:compliance gap we're going through a SOC 2 Type II audit in 3 months — run a gap analysis on our AWS infrastructure configcontrolImplements a specific SOC 2 control in Terraform. States the criterion, provides the exact resource(s), lists the Checkov rule IDs, and shows the auditor evidence command.
/platform-skills:compliance control implement CC6.7 encryption at rest for our RDS instances/platform-skills:compliance control CC7.2 — how do I implement CloudTrail with integrity validation and multi-region coverage?evidenceProvides copy-paste AWS CLI commands to gather audit evidence for a specific criterion. Notes what each output proves and flags any elevated permissions required.
/platform-skills:compliance evidence CC6.7 — what AWS CLI commands do I run to show auditors that all S3 buckets have encryption enabled?/platform-skills:compliance evidence CC6.6 — how do I prove to auditors that no security groups allow 0.0.0.0/0 on port 22?remediateFixes a specific Checkov or audit finding. Provides: criterion mapping, root cause, exact old → new Terraform block, blast radius (will this replace the resource?), validation steps, rollback.
/platform-skills:compliance remediate CKV_AWS_18: S3 bucket does not have access logging enabled/platform-skills:compliance remediate CKV_AWS_86: CloudFront distribution does not have logging enabledchecklistRuns through the full SOC 2 readiness checklist. For each item: pass / fail / unknown. For fails and unknowns: what's needed + which Checkov rule enforces it. Ends with a prioritised action list.
/platform-skills:compliance checklist — we have Terraform for EKS, RDS, S3, IAM, and CloudTrail/platform-skills:datadogWhat it does: End-to-end Datadog coverage — Agent deployment on Kubernetes, APM instrumentation, monitors, dashboards, SLOs, live incident investigation via the Datadog MCP server, pup CLI operations, and LLM Observability instrumentation and evaluation.
/platform-skills:datadog [setup|instrument|monitor|dashboard|slo|investigate|debug|pup|llmo] [service or description]setupDeploys the Datadog Agent on Kubernetes via Helm. Asks for: Kubernetes distribution (EKS/AKS/GKE), Datadog site (EU datadoghq.eu / US datadoghq.com), features needed.
Generates: Helm values with API key from Kubernetes Secret (never hardcoded), APM enabled, log collection enabled, cluster name set, Cluster Agent with 2 replicas. Adds Unified Service Tagging (DD_ENV, DD_SERVICE, DD_VERSION) to app Deployment.
/platform-skills:datadog setup EKS cluster, EU site, need APM and log collection/platform-skills:datadog setup how do I add Unified Service Tagging to my existing Deployments?instrumentAdds APM tracing to a service. Asks for: language and framework, whether log-trace correlation is needed.
dd-trace init as the first importddtrace-run entry point or patch_all() + DD_TRACE_ENABLED/platform-skills:datadog instrument Node.js Express orders service — need log-trace correlation/platform-skills:datadog instrument Python FastAPI payments service — add custom spans for the payment processing flowmonitorGenerates a Terraform datadog_monitor resource. Sets notify_no_data: true, warning and critical thresholds, PagerDuty/Slack notification handles, and service:, env:, team: tags for routing.
/platform-skills:datadog monitor high error rate on orders-service — critical at 5%, warning at 2%, notify @pagerduty-platform @slack-alerts/platform-skills:datadog monitor p99 latency for the checkout service — critical over 1s, warning over 500msdashboardGenerates a Terraform datadog_dashboard with RED method widgets using APM metrics. Template variables for env and service.
/platform-skills:datadog dashboard orders-service RED dashboard — request rate, error rate, p50/p95/p99 latencysloGenerates a Terraform datadog_service_level_objective. Links to monitors for error budget burn alerts.
/platform-skills:datadog slo orders-service availability SLO — 99.9% target, 30-day window, warn at 99.95%investigateLive incident investigation using the Datadog MCP server. Requires the MCP server connected — see references/datadog.md → MCP Server Setup.
Runs a 4-phase investigation through the MCP:
| Phase | What gets queried |
|---|---|
| 1. Triage | Active monitors in ALERT/WARN, event stream last 30 min, recent deployments |
| 2. Signals | Error logs for the incident window, APM error rate + p99 time series, sample failing traces |
| 3. Root cause | Before/after metric comparison, endpoint-level error breakdown, host CPU/memory |
| 4. Resolution | Acknowledge/resolve monitor, post incident Slack update, create post-mortem notebook |
/platform-skills:datadog investigate orders-service error rate spiked at 14:30 UTC — still ongoingNatural language queries Claude sends to the MCP:
What monitors are firing for service:orders-service env:production right now?
Show error logs for orders-service between 14:30 and 15:00 UTC.
Compare the error rate before and after 14:30 UTC.
Which endpoints have the highest error rate?debugDiagnoses Datadog data gaps without the MCP server. Classifies: Agent unhealthy, APM missing, logs not ingested, Monitor in No Data, custom metric not visible.
/platform-skills:datadog debug APM traces are missing for the payments-service — agent shows healthy/platform-skills:datadog debug monitor is stuck in "No Data" — the service is definitely runningpup modeScripted Datadog operations via the pup CLI — log search, metric queries, monitor management, and post-deploy quality gates.
/platform-skills:datadog pup search error logs for orders-service in the last 30 minutes
/platform-skills:datadog pup query p99 latency for orders-service over the last hour
/platform-skills:datadog pup mute the high error rate monitor for orders-service for 1 hour
/platform-skills:datadog pup generate a post-deploy gate script that fails CI if error rate exceeds 5%llmo modeInstrument an AI application with Datadog LLM Observability, bootstrap evaluators, or root-cause LLM failures.
/platform-skills:datadog llmo instrument my Python OpenAI app with LLMObs — I need faithfulness scoring
/platform-skills:datadog llmo add evaluation scores to my Node.js LLM spans and set up a CI quality gate
/platform-skills:datadog llmo root-cause this failing trace — trace ID 8f3a2b1c9d4e5f6a
/platform-skills:datadog llmo compare gpt-4o vs gpt-4o-mini on my orders-assistant over the last 7 days/platform-skills:dynatraceWhat it does: Dynatrace Operator deployment, OneAgent injection, code-level instrumentation, custom metrics, SLOs, dashboards, anomaly detection, Davis AI, and live incident investigation via the Dynatrace MCP server.
/platform-skills:dynatrace [setup|instrument|monitor|slo|dashboard|investigate|debug] [service or description]setupDeploys the Dynatrace Operator and OneAgent on Kubernetes. Asks for: environment ID, Kubernetes distribution, monitoring mode.
cloudNativeFullStack for automatic injection — no pod restarts requiredapiToken and dataIngestToken — never plain valuesmetadataEnrichment: true for k8s metadata on all telemetryRequired token scopes:
apiToken: ReadConfig, WriteConfig, DataExport, LogExport, ReadSyntheticData, WriteAnomalyDetectiondataIngestToken: metrics.ingest, logs.ingest/platform-skills:dynatrace setup EKS cluster, environment ID abc12345, cloudNativeFullStack injectioninstrumentAdds custom spans for business logic (OneAgent auto-instruments HTTP/DB/cache/messaging). Asks for: language, operations to trace.
@dynatrace/oneagent-sdkoneagent-sdkx-dynatrace header/platform-skills:dynatrace instrument Node.js — add custom spans for the payment processing and order creation flowsmonitorGenerates Terraform dynatrace_service_anomalies_v2 for failure rate and response time, plus dynatrace_alerting profile linked to PagerDuty/Slack/Opsgenie.
/platform-skills:dynatrace monitor orders-service — auto-detect anomalies, alert to PagerDuty platform-oncallsloGenerates Terraform dynatrace_slo_v2. Uses built-in availability metrics: builtin:service.errors.server.successCount / builtin:service.requestCount.server.
/platform-skills:dynatrace slo orders-service availability — 99.9% target, 30-day timeframe, warn at 99.5%dashboardGenerates Terraform dynatrace_json_dashboard with DATA_EXPLORER tiles for key service metrics.
/platform-skills:dynatrace dashboard orders-service — request count, response time, error rate, availabilityinvestigateLive incident investigation using the Dynatrace MCP server. Requires the MCP server connected — see references/dynatrace.md → MCP Server Setup.
Cost note:
execute_dqlqueries scan Grail data and may incur costs. Start with short timeframes (1h–24h). SetDT_GRAIL_QUERY_BUDGET_GBto cap session spend.
Runs a 4-phase investigation:
| Phase | What gets queried |
|---|---|
| 1. Triage | Open Davis AI Problems, root cause entity and affected services, k8s events |
| 2. Signals | Error logs via DQL, exceptions with stack traces, distributed traces with errors |
| 3. Root cause (Davis AI) | Davis Copilot plain-English explanation, Davis Analyzer automated root cause |
| 4. Resolution | Create Dynatrace Notebook, send Slack/email, close Problem with resolution note |
Example DQL queries the MCP runs:
fetch logs
| filter service.name == "orders-service" and loglevel == "ERROR"
| sort timestamp desc
| limit 50
| fields timestamp, content, trace_id, span_id/platform-skills:dynatrace investigate orders-service has an open Davis AI Problem since 14:00 UTCdebugDiagnoses data gaps without the MCP server. Classifies: injection failure, traces broken, custom metrics missing, SLO showing 0%, Davis AI not firing Problems.
/platform-skills:dynatrace debug OneAgent not injecting into pods in the checkout namespace/platform-skills:dynatrace debug custom metrics not showing in Metrics Explorer — ingest endpoint returns 202/platform-skills:documentWhat it does: Generate or improve technical documentation — docstrings, OpenAPI 3.1 specs, documentation sites, and getting started guides.
/platform-skills:document [docstrings|openapi|site|guide] [language/framework] [path or description]docstringsAdds or improves inline documentation. Asks for: language and preferred style.
| Language | Style options |
|---|---|
| Python | Google, NumPy, Sphinx |
| TypeScript/JavaScript | JSDoc |
For each undocumented public function/class: purpose, all parameters with types, return value, exceptions raised, at least one example. Validates examples run (python -m doctest, tsc --noEmit). Generates coverage report (interrogate, typedoc-coverage).
Does NOT document: obvious getters/setters, private methods without complex invariants.
/platform-skills:document docstrings python Google-style — add docstrings to all public functions in src/auth//platform-skills:document docstrings typescript JSDoc — document the OrderService class and all public methodsopenapiGenerates or improves an OpenAPI 3.1 spec. Asks for: framework and existing routes or codebase.
What gets generated:
components/schemascomponents/responsesoperationId on every operationnpx @redocly/cli lintFramework-specific patterns:
Field(description=...) + route docstrings → auto-generates /docs@ApiProperty, @ApiOperation, @ApiResponse decorators/platform-skills:document openapi generate a spec for our Orders REST API — POST /orders, GET /orders/{id}, GET /orders with pagination/platform-skills:document openapi FastAPI — generate OpenAPI spec from the existing route functions in app/routes/siteSets up a documentation site for a project.
| Project type | Recommended generator |
|---|---|
| Python library | MkDocs + mkdocstrings + Material theme |
| TypeScript SDK | TypeDoc + typedoc-material-theme |
| API portal | Redocly or Stoplight |
| General docs | Docusaurus |
Generates site config, nav structure (Getting Started, API Reference, Guides, Changelog), docstring auto-generation from source, search plugin, serve and build commands.
/platform-skills:document site Python library — set up MkDocs with auto-generated API reference from docstrings/platform-skills:document site REST API portal — we want rendered API docs with try-it-out for partnersguideWrites a getting started guide or tutorial. Structure enforced:
Rules: all code examples tested and runnable, realistic values (not <YOUR_VALUE> placeholders), one concept per section, expected output for each command.
/platform-skills:document guide write a getting started guide for the orders-service SDK — Node.js, installing from npm, making the first API call/platform-skills:document guide Kubernetes deployment guide for new engineers — assumes basic kubectl, no Helm knowledge/platform-skills:mcpWhat it does: Scaffold, review, or debug Model Context Protocol (MCP) server implementations in TypeScript or Python.
/platform-skills:mcp [create|review|debug] [typescript|python] [description]createScaffolds a production-ready MCP server. Asks for: language, transport (stdio / HTTP+SSE), list of tools and resources needed.
What gets generated:
mcp package)z.any() or empty schemasisError: true content on failures, no unhandled exceptions/platform-skills:mcp create typescript — MCP server that exposes Kubernetes pod logs and events as tools, uses stdio transport/platform-skills:mcp create python — MCP server that wraps our internal deployment API — 3 tools: list_services, get_status, trigger_deployreviewReviews an existing MCP server or client. Evaluates in priority order:
z.any()isError: true with message, no unhandled exceptions/platform-skills:mcp review my MCP server — focus on whether error handling is correct and if the schemas are tight enoughdebugDiagnoses protocol and integration failures. Classifies: Transport → Protocol → Schema → Handler → Integration.
Evidence to collect:
# Verify protocol compliance
npx @modelcontextprotocol/inspector node dist/index.js
# Smoke-test via stdio
echo '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}' | node dist/index.js/platform-skills:mcp debug tool is not appearing in the tools list in Claude — server starts without errors/platform-skills:mcp debug tool call returns empty result but no error — the handler is executing/platform-skills:mcp debug schema validation rejecting a valid input — Zod error: "Expected string, received number"/platform-skills:aws-profileWhat it does: Discover all AWS profiles from ~/.aws/config, check credential TTL, switch profiles across VS Code and Claude Code MCP server configs, and scan AWS Organization accounts for unconfigured entries.
/platform-skills:aws-profile [discover|status|switch <profile>|login <profile>|org-scan] [flags]Modes:
| Mode | What it does |
|---|---|
discover | List all profiles classified by type (SSO/assumed-role/Granted/static) with TTL and env tags |
status | Show which profile each MCP server uses and whether it will auto-refresh on expiry |
switch | Patch AWS_PROFILE in VS Code and/or Claude Code MCP configs, with prod safety guard |
login | Emit the correct auth command for the profile type (sso login / assume / role chain) |
org-scan | List all AWS Org accounts, flag those without configured profiles |
Common usage:
/platform-skills:aws-profile discover
/platform-skills:aws-profile discover --type sso --chain
/platform-skills:aws-profile status --watch
/platform-skills:aws-profile switch prod-platform-eu --scope workspace --confirm
/platform-skills:aws-profile switch --env dev --scope global
/platform-skills:aws-profile login dev-sandbox
/platform-skills:aws-profile org-scan --profile org-management --generate-profiles/platform-skills:productWhat it does: Platform product thinking — DevEx audits, friction analysis, RFC/ADR drafting, incident communication, blameless post-mortems, capacity planning, cost optimisation, and platform health review.
/platform-skills:product [topic: devex | friction | rfc | adr | incident | postmortem | capacity | cost | review]Every response ends with: Next step (one concrete action to take immediately) + Signal to watch (one metric or observable that confirms the change is working).
devexAudits developer experience using the SPACE framework (Satisfaction, Performance, Activity, Communication, Efficiency). Identifies the top friction point from description, proposes one systemic fix (not a local patch), and suggests one metric to track improvement.
/platform-skills:product devex developers are spending 45 minutes to get a new service to production on day 1/platform-skills:product devex CI pipeline takes 25 minutes — engineers skip it and push directly to stagingfrictionMaps the problem to the friction audit table (onboarding / CI / secrets / environment / ownership). States root cause, proposes the platform-level response, defines measurable "done".
/platform-skills:product friction every team has their own Terraform repo with different patterns — no shared modules/platform-skills:product friction rotating secrets takes 3 engineers and an incident windowrfcProduces a complete RFC:
/platform-skills:product rfc draft an RFC for migrating from Argo CD to Flux CD — 15 teams affected, GitOps repo restructure required/platform-skills:product rfc adopt Gateway API as the standard ingress interface across all clustersadrProduces a complete Architecture Decision Record:
/platform-skills:product adr decision to use Crossplane instead of Terraform for cluster-level infrastructure/platform-skills:product adr we decided to use a mono-repo GitOps structure over per-team reposincidentProduces a structured incident status update:
/platform-skills:product incident orders-service is returning 500s, started at 14:30 UTC, about 20% of requests affectedpostmortemProduces a blameless post-mortem structure:
/platform-skills:product postmortem database failover at 02:15 UTC caused 18 minutes of downtime for checkout — root cause was missing health check on standbycapacityTies growth to a business metric, states current baseline and projected growth, recommends headroom target and trigger threshold, proposes next review date.
/platform-skills:product capacity orders-service — currently 500 RPS at 60% CPU on 8 pods, expecting 3× growth in Q3costIdentifies top cost driver, applies the monthly cost loop (rightsizing → unused resources → showback), proposes a specific reduction action with owner and deadline.
/platform-skills:product cost EC2 spend is $40k/month — largest cluster has m5.2xlarge nodes at 15% average CPU/platform-skills:product cost we have 200 EBS volumes and no visibility into which are orphanedreviewRuns the full platform health checklist covering Developer Experience, Operations, Security and Compliance, and Cost. Flags gaps and proposes the minimum action to close each.
/platform-skills:product review we're a 3-team platform org, 12 clusters, planning a SOC 2 audit in 6 months/platform-skills:pr-reviewComprehensive pre-merge risk review across six dimensions. Each mode inspects the diff and current file state, reports findings with severity, and recommends concrete fixes.
Modes
| Mode | What it reviews |
|---|---|
cost | Compute, storage, and network spend delta |
drift | Environment alignment across dev/staging/prod overlays and values files |
ownership | CODEOWNERS gaps, missing team labels, Terraform module README, PR governance |
compliance | SOC 2 control impact — IAM, encryption, logging, network, change management |
upgrade | Deprecated Kubernetes APIs, loose Terraform provider constraints, floating action versions, :latest images |
rollback | Reversibility and blast radius score for every change |
full | All six modes in sequence with a Merge Readiness Summary |
Usage
/platform-skills:pr-review cost [paste gh pr diff output or PR number]
/platform-skills:pr-review drift
/platform-skills:pr-review ownership
/platform-skills:pr-review compliance
/platform-skills:pr-review upgrade
/platform-skills:pr-review rollback
/platform-skills:pr-review full 42Example prompts
/platform-skills:pr-review cost — here is the diff for PR #42: [paste output of gh pr diff 42]/platform-skills:pr-review drift — values-dev.yaml changed, check if prod is aligned/platform-skills:pr-review compliance — new IAM role added, verify CC6.1 and CC6.2/platform-skills:pr-review rollback — we're renaming a Deployment and adding an RDS storage increase/platform-skills:pr-review full 42Multi-mode workflow
/platform-skills:pr-review full # get the complete risk picture
# fix blockers
/platform-skills:review # validate specific manifests before re-reviewReference: references/pr-review.md
/platform-skills:triageTriage a PR review or issue comment from a bot, CI tool, or human reviewer. The command fetches the comment and diff with gh, classifies the thread, applies a minimal fix when the feedback is valid, replies, and resolves the review thread.
Modes
| Mode | What it does |
|---|---|
<PR number> <comment ID> | Triage one specific comment |
--all <PR number> | Triage every unresolved review thread on the PR |
Usage
/platform-skills:triage 42 123456789
/platform-skills:triage --all 42Classifications
| Classification | Meaning |
|---|---|
ACTIONABLE_FIX | Real issue in the changed files; apply the minimal fix, reply, and resolve |
INFORMATIONAL | Question or non-blocking suggestion; answer, reply, and resolve |
NOT_APPLICABLE | Status message, duplicate, already fixed, or outside this PR; explain and resolve |
Reference: commands/triage.md and examples/triage/README.md
/platform-skills:kedaDesign, generate, debug, and review KEDA (Kubernetes Event-Driven Autoscaling) ScaledObject and ScaledJob resources.
Modes
| Mode | What it does |
|---|---|
generate | Write a production-ready ScaledObject or ScaledJob from a description |
debug | Diagnose why a ScaledObject is not scaling as expected |
review | Correctness, security, and operational safety review |
scale | Design a scaling strategy for a workload from requirements |
Usage
/platform-skills:keda generate
/platform-skills:keda generate SQS queue, scale-to-zero, IRSA auth, max 20 replicas
/platform-skills:keda generate cron schedule weekday 08:00-20:00 Europe/Berlin, safety-net Prometheus trigger
/platform-skills:keda debug
/platform-skills:keda review
[paste ScaledObject YAML]
/platform-skills:keda scaleGenerate examples
/platform-skills:keda generateGenerates a ScaledObject for the orders-processor Deployment using SQS. Uses IRSA (no static credentials), scale-to-zero, activationQueueLength to prevent flapping, and HPA stabilization.
/platform-skills:keda generate cron schedule checkout-api 08:00-20:00 Europe/BerlinGenerates a ScaledObject with weekday/weekend Cron windows plus a Prometheus safety-net trigger for unexpected spikes.
/platform-skills:keda generate ScaledJob SQS batch one-job-per-messageGenerates a ScaledJob with restartPolicy: Never, activeDeadlineSeconds, and security hardening.
Debug checklist
The debug mode works through this checklist in order:
Active: true? — check kubectl describe scaledobject<unknown> targets? — metrics adapter connectionactivationThreshold / activationQueueLength too high?minReplicaCount: 1 preventing scale-to-zero?cooldownPeriod elapsed?Key rules enforced
TriggerAuthenticationtimezone explicitly on Cron triggersrestoreToOriginalReplicaCount: trueReference: commands/keda.md, references/keda.md, and examples/keda/
/platform-skills:self-improveWhat it does: Bootstrap and operate a self-improving agent workspace. Creates and maintains .learnings/ directories, logs LRN/ERR/FEAT entries, reviews accumulated patterns, and promotes recurring learnings to project memory.
Works on: Any project workspace. Most useful when working with an AI assistant over multiple sessions to accumulate reusable patterns, catch recurring errors, and evolve agent behavior.
/platform-skills:self-improve [init|log|review|promote] [description or file path]| Mode | What it does |
|---|---|
init | Scaffold .learnings/ directory with LEARNINGS.md, ERRORS.md, FEATURE_REQUESTS.md, and memory/working-buffer.md |
log | Append a new LRN, ERR, or FEAT entry to the appropriate file based on the description |
review | Scan .learnings/ for recurring patterns (threshold: 3+ occurrences) and surface candidates for promotion |
promote | Promote a recurring pattern from .learnings/ into CLAUDE.md or project memory |
Example usage:
/platform-skills:self-improve initCreates the full .learnings/ scaffold in the current workspace.
/platform-skills:self-improve log ERR helm upgrade failed due to immutable selectorLabelsAppends an ERR entry with timestamp, context, and resolution to .learnings/ERRORS.md.
/platform-skills:self-improve reviewScans all .learnings/ files, identifies patterns appearing 3+ times, and returns promotion candidates ranked by VFM score.
/platform-skills:self-improve promote "never include app.kubernetes.io/version in selectorLabels"Appends the pattern to CLAUDE.md memory and marks the source entry as promoted.
Key rules enforced
[YYYY-MM-DD] [ID] [TAG] description with Context:, Resolution:, Frequency: fields.learnings/ files; never edit existing entriesReference: commands/self-improve.md, references/agent-self-improve.md, and examples/agent-self-improve/
/platform-skills:supply-chainSecure the software supply chain from build pipeline to running container.
Modes
| Mode | What it does |
|---|---|
audit | Review an existing pipeline for supply chain security gaps |
sign | Walk through Cosign keyless signing setup (Sigstore/Rekor, no key management) |
sbom | Generate and attest an SBOM with Syft |
scan | Trivy or Grype CVE scan with configurable severity gate |
enforce | Generate Kyverno ImageValidatingPolicy to block unsigned images at admission |
slsa | SLSA Level 2 provenance via slsa-github-generator GitHub Actions reusable workflow |
Usage
/platform-skills:supply-chain audit
/platform-skills:supply-chain sign
/platform-skills:supply-chain sbom
/platform-skills:supply-chain scan
/platform-skills:supply-chain enforce
/platform-skills:supply-chain slsa
[paste workflow YAML or describe your registry/CI setup]Key rules enforced
@sha256:…), never the tagImageValidatingPolicy in Audit mode first, then DenyReference: references/supply-chain.md and examples/supply-chain/
/platform-skills:runtime-securityDetect and respond to in-container threats at the syscall level using Falco.
Modes
| Mode | What it does |
|---|---|
install | Deploy Falco on EKS/GKE with eBPF driver via Helm |
rules | Write and unit-test custom Falco rules |
alerts | Configure Falcosidekick to route alerts to Slack, PagerDuty, or webhook |
debug | Diagnose why a Falco rule is not firing |
harden | Map Falco alert metadata to Kyverno admission enforcement |
Usage
/platform-skills:runtime-security install
/platform-skills:runtime-security rules
[describe the threat you want to detect]
/platform-skills:runtime-security alerts
/platform-skills:runtime-security debug
[paste kubectl logs output from Falco pod]
/platform-skills:runtime-security hardenKey rules enforced
minimumpriority: warning for Slack routing; suppress DEBUG/INFO noisefalco-event-generator before productionReference: references/runtime-security.md and examples/runtime-security/
/platform-skills:chaosLitmus Chaos and Chaos Mesh fault injection, steady-state hypothesis, GameDay workflow, and DORA feedback loop.
Modes:
| Mode | What it does |
|---|---|
install | Helm install for Litmus Chaos or Chaos Mesh with namespace isolation and RBAC |
experiment | Generate a fault experiment (ChaosEngine or Chaos Mesh CRD) from a description |
schedule | Wrap an experiment in a recurring schedule (ChaosSchedule or Chaos Mesh Schedule CRD) |
gameday | Structured GameDay runbook: steady-state → blast radius → inject → observe → verdict → DORA impact |
debug | Diagnose failed or stuck experiments — ChaosResult, chaos-runner logs, RBAC gaps |
report | Summarize blast radius, steady-state probe timeline, recovery time, and DORA delta |
Usage:
/platform-skills:chaos install [litmus|chaos-mesh] [namespace]
/platform-skills:chaos experiment [fault-class] [target-workload]
/platform-skills:chaos schedule [experiment-name] [interval]
/platform-skills:chaos gameday
/platform-skills:chaos debug [experiment-name]
/platform-skills:chaos report [experiment-name]Examples:
/platform-skills:chaos install litmus
/platform-skills:chaos experiment pod-delete my-service
/platform-skills:chaos schedule pod-delete weekly
/platform-skills:chaos gameday
/platform-skills:chaos debug pod-delete-engine
/platform-skills:chaos report pod-delete-engineReference: references/chaos.md and examples/chaos/
/platform-skills:doraGitHub Actions + Prometheus instrumentation for all four DORA metrics — Deployment Frequency, Lead Time for Changes, Change Failure Rate, and MTTR.
Modes:
| Mode | What it does |
|---|---|
instrument | Generate GitHub Actions steps to push deploy and incident events to Prometheus Pushgateway |
dashboard | Generate a Grafana dashboard with four DORA panels and Elite/High/Medium/Low threshold bands |
benchmark | Classify current metric values against 2023 DORA performance bands; identify weakest metric |
debug | Diagnose missing deployment events, missing MTTR, CFR stuck at 0%, or metrics stopping at a date |
Usage:
/platform-skills:dora instrument [workflow-file]
/platform-skills:dora dashboard
/platform-skills:dora benchmark [metric-values]
/platform-skills:dora debug [metric-name]Examples:
/platform-skills:dora instrument .github/workflows/deploy.yml
/platform-skills:dora dashboard
/platform-skills:dora benchmark "deploy_freq=2, lead_time=3600, cfr=8, mttr=7200"
/platform-skills:dora debug mttrReference: references/dora.md and examples/dora/
/platform-skills:awesome-docsGenerate, convert, and maintain animated GitHub-safe Markdown documents with animated SVG diagrams.
Usage: /platform-skills:awesome-docs <mode> [topic or file path]
generateCreate any animated Markdown document from scratch. The skill asks for doc type (readme, architecture-guide, runbook, tutorial, api-reference, how-it-works, rfc, post-mortem, or custom), topic, output path, and key components — then generates relevant SVGs one at a time with confirmation before the next.
/platform-skills:awesome-docs generate readme for orders-service
/platform-skills:awesome-docs generate architecture-guide for the KEDA autoscaling system
/platform-skills:awesome-docs generate runbook for Kubernetes cluster upgrade
/platform-skills:awesome-docs generate post-mortem for the 2026-05-21 checkout outage
/platform-skills:awesome-docs generate --theme docs-light tutorial for setting up FalcoconvertInject animated SVGs into an existing plain Markdown doc.
/platform-skills:awesome-docs convert docs/keda-guide.md
/platform-skills:awesome-docs convert README.mdupdateRevise a single diagram in an existing doc.
/platform-skills:awesome-docs update assets/keda-arch-flow.svg
/platform-skills:awesome-docs update "Architecture section"diffDetect stale diagrams vs git HEAD.
/platform-skills:awesome-docs diff docs/architecture.mdauditQuality check — missing captions, broken refs, env-specific IDs, missing diagrams.
/platform-skills:awesome-docs audit docs/architecture.mdpreviewOpen the doc locally in a browser before committing.
/platform-skills:awesome-docs preview docs/architecture.mdexportGenerate animated HTML for Confluence/Notion, or get PNG export instructions.
/platform-skills:awesome-docs export docs/architecture.md html
/platform-skills:awesome-docs export docs/architecture.md pngReference: references/awesome-docs.md and examples/awesome-docs/
Paste context — paste the manifest, error output, plan output, or code block directly after the command. The more concrete the input, the more actionable the output.
You don't need the slash command — describe your problem in plain English and the skill activates automatically when you're working with relevant files.
Chain commands — /platform-skills:debug to diagnose, then /platform-skills:review to validate the fix before merging.
Multi-mode workflows:
instrument → alert → dashboard → loadtestgenerate → test → validatecreate → review → securityinvestigate → postmortem → rfc (if systemic)gap → remediate → evidence → checklistDeep-dive references — each command points to a references/<domain>.md file. Read those for full spec coverage, edge cases, and worked examples.
/platform-skills:awsWhat it does: Structured guidance for AWS CloudFront, WAF, Lambda@Edge, and multi-account security patterns. Routes to the correct reference section, calls out common footguns (WAF scope, us-east-1 constraint, Lambda@Edge numbered ARN), and generates production-ready Terraform modules with best practices.
Usage: /platform-skills:aws <mode>
/platform-skills:aws cloudfront # distribution setup, OAC, cache policies, security headers
/platform-skills:aws waf # web ACL, managed rules, rate limiting, logging
/platform-skills:aws lambda-edge # Lambda@Edge vs CloudFront Functions decision + implementation
/platform-skills:aws multi-account # Firewall Manager, FMS policy, Organizations enforcement
/platform-skills:aws review # production-readiness checklist for CloudFront + WAF config
/platform-skills:aws terraform # generate complete Terraform module scaffoldExamples:
/platform-skills:aws cloudfront — how do I restrict my S3 origin to only CloudFront?
/platform-skills:aws waf — add rate limiting on /api/ to 500 req/5min
/platform-skills:aws lambda-edge — should I use Lambda@Edge or CloudFront Functions for URL rewrites?
/platform-skills:aws multi-account — enforce WAF on all CloudFront distributions in our production OU
/platform-skills:aws review — here is my distribution Terraform, is it production ready?
/platform-skills:aws terraform — generate a CloudFront + S3 + WAF module for a static siteReference: commands/aws.md, references/aws-cloudfront.md, references/aws-waf.md, and examples/aws/
/platform-skills:composite-actionsWhat it does: Interview-driven scaffold, review, hardening, and test generation for composite GitHub Actions. Enforces SHA pinning, secrets-as-inputs, $GITHUB_OUTPUT / $GITHUB_STEP_SUMMARY observability, input validation, and actionlint-clean workflows.
Usage: /platform-skills:composite-actions <mode>
/platform-skills:composite-actions generate # interview-driven full repo scaffold (action.yml, README, test/release workflows, dependabot.yml)
/platform-skills:composite-actions review # review an existing action.yml for correctness and security
/platform-skills:composite-actions secure # harden with SHA pinning, env isolation, ::add-mask::
/platform-skills:composite-actions test # generate a test workflow for the composite action
/platform-skills:composite-actions migrate # migrate a JS/Docker action to composite steps
/platform-skills:composite-actions publish # versioning, release workflow, dependabot.ymlExamples:
/platform-skills:composite-actions generate — build a composite action that deploys to Kubernetes with dry-run support
/platform-skills:composite-actions review — here is my action.yml, check it for security issues
/platform-skills:composite-actions secure — pin all uses: refs to SHAs and mask sensitive outputs
/platform-skills:composite-actions test — generate a test-action.yml for my db-migrate composite action
/platform-skills:composite-actions migrate — convert this Docker action to a composite action
/platform-skills:composite-actions publish — set up semantic versioning and a release.yml workflowReference: commands/composite-actions.md, references/composite-actions.md, and examples/github-actions/composite-actions/
/platform-skills:fluxcdWhat it does: Smart entry point for all FluxCD work. Identifies the right workflow from your input — live cluster debugging (5-workflow structured trace), GitOps repository audit (6-phase analysis), or Helm chart review — and routes directly to the correct command.
Usage: /platform-skills:fluxcd [describe your situation, error, repo path, or intent]
/platform-skills:fluxcd [error message or flux get output] # → debug mode
/platform-skills:fluxcd [repo path or "audit"] # → audit mode
/platform-skills:fluxcd [Chart.yaml or "helm"] # → helm mode
/platform-skills:fluxcd [Kustomization or HelmRelease YAML] # → review modeRouting logic:
| If your input contains… | Routes to |
|---|---|
Error message, flux get output, pod logs, "not reconciling" | /platform-skills:gitops debug — 5-workflow debug |
| Repo path, "audit", "review", "before merge", "is this correct" | /platform-skills:gitops audit — 6-phase audit |
Helm chart path, Chart.yaml, values.yaml, "helm", "chart" | /platform-skills:helmcheck — chart review |
| A manifest to review (Kustomization, HelmRelease, FluxInstance YAML) | /platform-skills:review — production-readiness check |
Examples:
/platform-skills:fluxcd — my HelmRelease is stuck in "Progressing" after a values change
/platform-skills:fluxcd ./clusters — audit our GitOps repo before the Monday release
/platform-skills:fluxcd — is this FluxInstance YAML production-ready? [paste YAML]
/platform-skills:fluxcd — review this Chart.yaml and values.yaml for security issuesReference: commands/fluxcd.md, references/fluxcd.md, references/fluxcd-sources.md, references/fluxcd-operator.md, and examples/fluxcd/
/platform-skills:renovateWhat it does: Interactive wizard that scans your repo for dependency file types and generates a complete Renovate setup — renovate.json with per-ecosystem rules, private registry auth, custom regex managers for internal sources, a pre-commit validation hook, and a GitHub Actions validation workflow. No secrets or tokens required for the workflow; registry credentials use env-var templating.
Works on: Any repo using GitHub Actions, Terraform, Helm, Go, Node, Python, Docker, Rust, Kubernetes manifests, or any combination — including repos with internal GitHub org modules, private Terraform registries, ECR/GCR/ACR/Harbor image registries, and private Helm OCI or HTTP chart registries.
/platform-skills:renovate [generate|workflow|precommit|all]Wizard questions (asked when no arguments are given, or for any mode that emits renovate.json):
| Q | Asks | Options |
|---|---|---|
| Q1 | Mode | generate / workflow / precommit / all |
| Q2 | Pinning strategy | digest (SHA for Actions + images) / semver (version tags only) |
| Q3 | Automerge scope | patch-only / minor+patch / none |
| Q4 | Update schedule | weekday mornings / monday / weekend / always |
| Q5 | Internal Terraform modules | GitHub org / private TF registry / no |
| Q6 | Private Helm registry | OCI / HTTP / no |
| Q7 | Private container registry | ECR / GCR / ACR / Harbor / no (multi-select) |
Modes:
| Mode | What it does |
|---|---|
generate | Scans repo → detects ecosystems → emits renovate.json with per-manager packageRules, hostRules for private registries, regexManagers for internal module sources |
precommit | Emits .pre-commit-config.yaml with renovatebot/pre-commit-hooks for local validation before every commit |
workflow | Emits .github/workflows/validate-renovate.yml — JSON syntax check + renovate-config-validator (Node 24) + ecosystem coverage scan, triggered on PRs that touch renovate.json |
all | Runs generate → precommit → workflow in sequence, prints a single consolidated commit block at the end |
Examples:
/platform-skills:renovate
/platform-skills:renovate all
/platform-skills:renovate generate
/platform-skills:renovate precommit
/platform-skills:renovate workflowReference: references/renovate.md
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests