Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
All notable changes to Platform Skills will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
agents/openai.yaml so Codex can display and invoke platform-skills with a default $platform-skills prompt..cursorrules and .cursor/rules/*.mdc to the current release and documented project/global Cursor installation.install.sh, a README install matrix, and copy-ready prompts for platform, DevOps, cloud, and SRE teams evaluating the skill.PROMPTS.md plus issue templates for agent/editor support, domain guide requests, and example contributions.${CODEX_HOME:-$HOME/.codex}/skills/platform-skills so SKILL.md, references/, examples/, and agents/openai.yaml stay together.commands/renovate.md: new /platform-skills:renovate command with four modes — generate (scan repo for dep file types, emit renovate.json covering only detected ecosystems with per-manager automerge/schedule/grouping rules and minimumReleaseAge: "3 days"), workflow (emit .github/workflows/validate-renovate.yml for schema + config-validator + coverage validation on PRs that touch renovate.json), precommit (interactive semver/automerge wizard and best-practices adviser), and all (run all modes in sequence).references/renovate.md: 11-section reference — manager catalog, preset reference, package rules patterns, dependency dashboard, GitOps integration (Renovate vs Flux Image Reflector ownership boundary), security hardening (pinDigests, osvVulnerabilityAlerts, minimumReleaseAge), regex manager templates, troubleshooting, private registries, custom regex managers, pre-commit hook..github/workflows/validate-renovate.yml: CI workflow triggering on PRs that touch renovate.json — three parallel jobs (JSON syntax via jq, config-validator via npx renovate-config-validator, coverage bash scan) with GITHUB_STEP_SUMMARY result table. No secrets required.renovate.json: added dependencyDashboard, osvVulnerabilityAlerts, minimumReleaseAge: "3 days" on automerge rules, packageRules for gomod/npm/pip, npmDedupe postUpdateOption.SKILL.md: Renovate row in tool table; /platform-skills:renovate slash command entry.COMMANDS.md: TOC entry and full /platform-skills:renovate command section (31 commands total).commands/aws-profile.md: new /platform-skills:aws-profile command with five modes — discover (profile table with TTL traffic-light), status (per-server credential health), switch (prod-gated profile patching across VS Code + Claude Code), login (type-aware auth command), org-scan (AWS Org account coverage).references/aws-mcp-profiles.md: 15-section reference covering AWS MCP server catalog, context budget table, profile type detection, auth flows, credential lifecycle, credential_process pattern, config file locations, VS Code input variables, team sharing, multi-account EKS, health check checklist, and prod safety.commands/mcp.md: new configure-aws mode with upfront cost/token warning, CLI-first alternative check, credential_process validation, version-pinned config generation for VS Code and Claude Code, --multi-account EKS named instances, and AWS MCP server catalog with 5 starter kits.examples/mcp/aws-multiprofile/: 13 example files — VS Code global/workspace/input/template configs, Claude Code snippet, aws-config-credential-process.ini, .gitignore.snippet, and 5 starter-kit JSON files (eks-debug, observe, knowledge, deploy, cost) with real PyPI version pins.SKILL.md: AWS MCP Profiles row in tool table; /platform-skills:aws-profile slash command entry.COMMANDS.md: TOC entry and full /platform-skills:aws-profile command section (30 commands total).references/mcp.md: AWS Multi-Account Profiles pointer section.evals/) for quality testing: pr-triage, opa-policy, flux-helmrelease. Each scenario includes task.md, criteria.json, and capability.txt.evals/) for 3 skill domains: PR triage, OPA/Rego policy, and Flux CD HelmRelease diagnosis.workflow_dispatch).meta.helm.sh annotations; evidence commands use helm list -A for cross-namespace owner lookup.references/opa.md: added Terraform CI Pipeline Placement section.commands/triage.md: strengthened reply format with concrete example.continue-on-error; publish now uses --skip-evals with separate eval step.--skip-evals from publish step so eval results link to the registry version and appear on the Evals tab.ci_placement criteria updated to check for runnable conftest test command presence.rendered manifests contain a resource that already exists) added to troubleshooting reference with evidence commands and fix.✅ Fixed ending requirement added to commands/triage.md.SKILL.md description updated to "as a system" framing — signals cross-cutting platform scope to reduce Tessl Distinctiveness conflict risk with single-technology skills.SKILL.md description rewritten to lead with actions (troubleshoot, implement, review, audit) and explicit "Use when" trigger for improved Tessl Completeness score.SKILL.md frontmatter description value quoted to fix YAML parse error (unquoted colon in value caused mapping values are not allowed error).SKILL.md frontmatter description trimmed from 1,357 → 580 characters to pass marketplace 1-1024 character validation.plugin.json description trimmed from 1,254 → 717 characters (same limit).references/fluxcd-sources.md — authoritative source CRD reference: GitRepository (auth options, sparse checkout, SSH note), OCIRepository (Cosign/Notation keyless verification, cloud OIDC, layerSelector), HelmRepository, HelmChart, Bucket, ExternalArtifact, ArtifactGenerator (monorepo decomposition + Helm chart composition, copy strategies: Overwrite/Merge/Extract). Source selection decision matrix for all 7 source kinds.references/fluxcd-resourcesets.md — complete ResourceSet and ResourceSetInputProvider reference: template syntax (<< >> delimiters, slim-sprig functions), input strategies (Flatten/Permute with name normalization rules), all 8 provider types (Static, OCIArtifactTag, ECRArtifactTag, GitHubTag, GitHubPullRequest, GitLabTag, AzureContainerTag, ExternalService), CEL-based dependency readiness expressions, conditional resource inclusion, copyFrom annotation, built-in input fields, deduplication, gitless multi-tenant fleet pattern, preview environments per PR, gitless image automation with limit: 1 warning.references/fluxcd-notifications.md — notification-controller reference: Provider (all 4 categories: messaging, alerting/monitoring, event streaming, Git commit status; key spec fields), Alert (severity levels, inclusion/exclusion regex filtering, eventSources with wildcard/cross-namespace/matchLabels, eventMetadata), Receiver (HMAC webhook, CEL resourceFilter, webhookPath from status), common patterns: Slack error alerting, GitHub commit status on PRs, Datadog event forwarding, immediate Git sync on push.references/fluxcd-operator.md — Flux Operator deep dive: FluxInstance rules (one per cluster, named flux), cluster sizing table (small/medium/large: concurrency and memory), multi-tenancy (SA impersonation, cross-namespace restrictions), sync modes (OCI gitless vs Git), kustomize patches for controller Deployments, optional components (image-reflector, image-automation, source-watcher), FluxReport (5-minute auto-update, reporting interval annotation), runtime ConfigMap (flux-runtime-info) with SSA Merge for Terraform co-ownership, reconciliation trigger annotation, migration path from flux bootstrap.references/fluxcd-kustomization.md — Kustomization advanced reference: core fields with descriptions, CEL-based dependency readyExpr, postBuild substituteFrom with correct Kustomization scoping rule, SOPS decryption (age key + Cloud KMS workload identity), CEL health check expressions, remote cluster deployment (secret-based and secretless workload identity), SSA apply behaviour annotations (Override/Merge/IfNotPresent/Ignore/force/prune-Disabled), status inventory format, reconciliation triggers, common mistakes table.references/fluxcd-helmrelease.md — HelmRelease deep dive: spec.chartRef vs spec.chart.spec mutual exclusivity, values merging order, install/upgrade RetryOnFailure strategy (legacy vs modern fields), drift detection modes (disabled/warn/enabled) with ignore rules for HPA-managed fields, CRD lifecycle table (Create/Skip/CreateReplace), dependencies, CEL health check expressions, post-renderers (Kustomize patches applied after Helm render, sequential pipeline), remote cluster deployment, status tracking with history, common mistakes table.references/fluxcd-terraform.md — Terraform + Flux Operator bootstrap: ownership model (Terraform = ephemeral Job, Flux = steady state), repository layout, gitops resources (create-once) vs managed resources (every run), runtime info variable substitution, Terraform + Git co-ownership via SSA Merge, sync authentication (PAT/GitHub App/OCI), node scheduling at all 3 layers (Job/Operator/controllers), drift and revision counter, shared operator values file, debugging with debug_on_failure.references/fluxcd-mcp.md — Flux MCP server (flux-operator-mcp): Homebrew install, JSON MCP config (absolute path warning, --read-only for production), 6 core workflows (cluster inspection, HelmRelease trace, Kustomization trace, ResourceSet trace, multi-cluster comparison, force reconciliation), log analysis steps, applying manifests with overwrite: true warning, key guidelines (always call get_kubernetes_api_versions first, secrets are masked, avoid modifying Flux-managed resources).references/datadog.md — added FluxCD integration section: built-in Agent integration (7.51.0+, minimum 7.49.1), supported controllers (source/kustomize/helm/notification), pod annotation setup via FluxInstance kustomize patch, log collection config, key metrics table (reconcile duration, reconcile condition, workqueue depth/retries, active workers, process CPU/memory, leader election), three recommended monitors (HelmRelease failure, Kustomization failure, workqueue saturation), service check fluxcd.openmetrics.health, dashboard guidance.references/dynatrace.md — added FluxCD integration section: Method 1 ActiveGate Prometheus scraping (DynaKube config + pod annotation patches via FluxInstance), Method 2 OTel Collector pipeline (scrape config + OTLP exporter to DT endpoint), key metrics under gotk_ namespace, Dynatrace metric expression for reconciliation failure rate, Davis AI anomaly detection recommendations (gotk_reconcile_condition, workqueue_depth, process_resident_memory_bytes), log ingestion with field extraction, required token scope (metrics.ingest).references/fluxcd-troubleshooting.md — scannable incident cheat-sheet derived from fluxcd/agent-skills gitops-cluster-debug references. Organised by failure category: controller failures, source failures (GitRepository, OCIRepository, HelmChart/HelmRepository), Kustomization failures, HelmRelease failures, ResourceSet failures. Each entry: symptom → most likely cause → exact fix. Includes label-based ownership tracing section (how to find the managing Kustomization or ResourceSet from resource labels without guessing), and a 10-step general debugging checklist.commands/gitops.md (debug mode) — Workflow 2 (HelmRelease trace) and Workflow 3 (Kustomization trace) now include explicit label-tracing steps: read kustomize.toolkit.fluxcd.io/name and resourceset.fluxcd.io/name labels to identify the managing object. Added reference pointer to fluxcd-troubleshooting.md at end of debug workflows.references/fluxcd-migration.md — Flux API migration guide: v2.7 v1beta1 removals (all 5 toolkits), v2.8 v1beta2 removals (4 toolkits), current stable apiVersion table, CLI 3-step upgrade (migrate files → migrate in-cluster → upgrade components), Flux Operator 3-step upgrade (upgrade operator → migrate files → update FluxInstance version), CI gate pattern, common post-migration issues.references/fluxcd-security.md — Security audit checklist with grep commands: unencrypted secrets (Secret without sops:/ENC[), hardcoded credentials (password:, token:, ACCESS_KEY=), insecure sources (insecure: true), cloud registries without Workload Identity, cross-namespace source refs, OCIRepository without Cosign, cluster-admin bindings, image automation pushing to main. Automated scan via upstream validate.sh and check-deprecated.sh.commands/gitops.md — merged gitops-audit into gitops as a subcommand. Command now has two modes: debug (five structured Flux debug workflows: installation check, source failures, HelmRelease trace, Kustomization trace, ResourceSet trace — plus Argo CD diagnostics, producing a 5-section report) and audit (six-phase read-only repo analysis: discovery via discover.sh, manifest validation via validate.sh, API compliance via check-deprecated.sh, best practices checklist, security grep patterns, Critical/Warning/Info report). commands/gitops-audit.md deleted — its content lives in the audit subcommand.references/fluxcd.md — Repository patterns section expanded: three pattern descriptions (monorepo, multi-repo Git-based, multi-repo OCI-based), identification heuristics table (8 signals), layout rules (flux-system placement, no overlapping paths, no workloads in default namespace).references/fluxcd.md — major expansion from 194 → 430+ lines. New sections: full CRD reference table (18 CRDs with apiVersion and controller), source selection decision matrix, Flux Operator and FluxInstance (installation, upgrade lifecycle, FluxReport health, gitless OCI sync YAML, one-per-cluster rule), ResourceSet and ResourceSetInputProvider (N-input templating, << inputs.field >> delimiters, OCIArtifactTag input provider for gitless image automation), gitless OCI delivery model (why to prefer it for Flux Operator deployments, CI push commands, OCIRepository source YAML, HelmRelease from OCI with layerSelector), reconcile.fluxcd.io/watch: Enabled label (what it does, when to use on valuesFrom/substituteFrom ConfigMaps), common mistakes table (5 entries: template delimiters, mutually exclusive chart fields, multiple FluxInstances, remediation strategy, OCIRepository layerSelector). Image automation section extended with gitless model (OCIArtifactTag + ResourceSet) alongside existing Git-based model, with comparison table (Git credentials, audit trail, PR approval, recommended for).commands/gitops.md — major upgrade from 65 → 200+ lines. Replaced flat evidence-collection approach with 5 structured debug workflows: (1) Installation Check — FluxInstance status, FluxReport health, crashlooping controller logs; (2) HelmRelease trace — spec/status → managing object → valuesFrom references → chart source → inventory → pod logs; (3) Kustomization trace — spec/status → parent object → substituteFrom → source → managed resources; (4) ResourceSet trace — status → inputsFrom providers → dependsOn → generated resources; (5) Log analysis — Deployment → matchLabels → pods → logs. Added structured 5-section report format (Summary, Resource Analysis, Dependency Chain, Root Cause, Recommendations). Added edge cases: spec.suspend: true, Ready: Unknown, Flux-managed resource revert warning, stale status backpressure. Argo CD content retained.examples/fluxcd/multi-tenant/ — multi-tenant repo pattern: clusters/production/tenants.yaml, tenants/team-a/ (namespace, ServiceAccount in flux-system, RoleBinding to tenant-reconciler ClusterRole, GitRepository, Kustomization with serviceAccountName and targetNamespace for RBAC isolation), platform/rbac/tenant-role.yaml (ClusterRole scoped to workload resources), platform/network-policy/default-deny.yaml (default-deny-all + allow-same-namespace + DNS egress per tenant namespace). README includes bootstrap instructions, troubleshooting, and RBAC can-i verification.examples/fluxcd/helm-releases/ — Helm chart management via OCI: releases/base/cert-manager/ocirepository.yaml (semver 1.x, layerSelector.mediaType), releases/base/cert-manager/helmrelease.yaml (spec.chartRef, RetryOnFailure, driftDetection.mode: enabled, valuesFrom), releases/base/cert-manager/values-configmap.yaml (reconcile.fluxcd.io/watch: Enabled), releases/staging/kustomization.yaml (1 replica overlay), releases/production/kustomization.yaml (3 replicas, PDB, anti-affinity). README covers key patterns and common mistakes.examples/fluxcd/image-automation/ — both image automation models side-by-side: git-based/imagerepository.yaml, git-based/imagepolicy.yaml (semver range), git-based/imageupdateautomation.yaml (staging branch push, not main); gitless/resourcesetinputprovider.yaml (OCIArtifactTag, semver filter), gitless/resourceset.yaml (<< inputs.tag >> substitution). README includes comparison table, setup steps for both models, safety rules, and troubleshooting.examples/fluxcd/flux-operator/ — Flux Operator bootstrap: fluxinstance.yaml (gitless OCI sync, networkPolicy: true, all 6 components, reconcile annotations), ocirepository.yaml (Cosign verify), verify-policy.yaml (keyless Cosign with OIDC issuer + subject matching for GitHub Actions). README covers Flux Operator vs flux bootstrap comparison table, install steps, push-and-sign CI commands, health checks, and migration from flux bootstrap.examples/fluxcd/README.md — updated from "planned patterns" stubs to full 5-example index table with pattern descriptions, choose-the-right-pattern guide, and shared best practices matrix.commands/fluxcd.md — new /platform-skills:fluxcd slash command. Smart entry point that identifies the right workflow from input: error/logs/symptom → /platform-skills:gitops debug (5-workflow debug); repo path/audit intent → /platform-skills:gitops audit (6-phase audit); helm chart path → /platform-skills:helmcheck; manifest YAML → /platform-skills:review. Includes quick reference apiVersion table for all 18 FluxCD CRDs and links to all 11 fluxcd-*.md reference files.examples/flux/ → renamed to examples/fluxcd/ — consistent naming with references/fluxcd.md, commands/fluxcd.md, and examples/argocd/. All 5 subdirectories preserved: basic-monorepo/, multi-tenant/, helm-releases/, image-automation/, flux-operator/.references/flux.md → renamed to references/fluxcd.md — consistent naming with argocd.md, datadog.md, kyverno.md; all references updated across SKILL.md, HOW_IT_WORKS.md, GETTING_STARTED.md, README.md, COMMANDS.md, CONTRIBUTING.md, tests/*.sh, examples/fluxcd/README.md.SKILL.md — Domain 29 updated to GitOps (debug + audit subcommands); updated Flux/FluxCD reference descriptions; added 8 new FluxCD reference pointers; added /platform-skills:fluxcd slash command; removed /platform-skills:gitops-audit; updated examples/fluxcd/ path.skills/platform-skills/SKILL.md — synced with root SKILL.md.COMMANDS.md — added /platform-skills:fluxcd to command index table; updated gitops section to document debug and audit subcommands; removed gitops-audit section.HOW_IT_WORKS.md — updated gitops row to gitops debug / gitops audit; removed gitops-audit; added fluxcd row.GETTING_STARTED.md — updated gitops row to show debug/audit subcommands; replaced gitops-audit row with fluxcd entry; expanded FluxCD reference table with all 9 new fluxcd-*.md entries.INSTALLATION.md — version string updated to v1.25.0..claude-plugin/plugin.json — version bumped to 1.25.0; ./commands/fluxcd.md added; ./commands/gitops-audit.md removed (merged into gitops)..claude-plugin/marketplace.json — version bumped to 1.25.0; description rewritten to 628 chars (was 1,784 chars over the 1,024 limit); added gitops-audit, flux-operator, fluxinstance, resourceset, gitless-delivery, oci-gitops keywords.references/composite-actions.md — comprehensive reference covering: decision matrix (composite vs JavaScript vs Docker action), action.yml anatomy, variables and secrets (full mechanism table: ${{ inputs.* }}, env:, $GITHUB_ENV, $GITHUB_OUTPUT, $GITHUB_PATH, $GITHUB_STEP_SUMMARY, ::add-mask:: — with end-to-end secrets flow diagram), inputs and outputs (type annotations, boolean string comparison, multi-step $GITHUB_OUTPUT chaining), nine security requirements (SHA pinning, shell: on every run step, secrets as inputs, env isolation from shell injection, masking, ${{ github.action_path }}, minimum token permissions), observability ($GITHUB_STEP_SUMMARY job summary, ::group::/::endgroup:: log grouping, ::error::/::warning::/::notice:: inline annotations, RUNNER_DEBUG debug mode), fail-fast input validation pattern (enum + regex + required check with ::error:: annotations), step control flow (IDs, conditionals, continue-on-error, $GITHUB_ENV sharing), file references and scripts, multi-OS patterns (${{ runner.os }}, pwsh vs bash), idempotency guidance (skip-if-exists, upsert, kubectl apply patterns), timeout and concurrency (step-level timeout-minutes, caller concurrency: recommendation), testing (local path test workflow, matrix strategy, act commands, actionlint verification), breaking changes (deprecation warnings, dual-input support, CHANGELOG migration notes), versioning and release (semver + floating major tag, release workflow with actionlint gate), dependabot.yml configuration for github-actions ecosystem, anti-patterns table (15 entries), and production checklist (security, correctness, observability, testing, documentation, release)commands/composite-actions.md — slash command /platform-skills:composite-actions with four modes:
generate: 9-question interview (purpose, repo destination, pinning strategy, inputs with secret flag, outputs, cloud credentials, notifications, job summary, confirm) → full scaffold: action.yml + README.md + CHANGELOG.md + .gitignore + .github/dependabot.yml + .github/workflows/test-action.yml + .github/workflows/release.yml; if existing repo: branches as feat/add-<name>-composite-action, commits all files, opens PR via gh pr create with inputs/outputs/secrets summary and test plan checklist; post-generation tip to run /platform-skills:awesome-docs generatereview: audit existing action.yml — CRITICAL (unpinned tags, missing shell:, injection, direct secrets access) / WARNING (no validation, unmasked secrets, relative paths, no summary, no timeout) / INFORMATIONAL (branding, descriptions, dependabot, test workflow) — scored N/20secure: fix in place — resolve tags to SHAs via gh api, add shell: bash, move inputs.* to env: blocks, add ::add-mask::, inject validation step, inject job summary steptest: generate test workflow with local path reference and matrix, plus act commands for all key input variantsexamples/github-actions/composite-actions/:
docker-build-push/ — full repo structure: action.yml (multi-platform Buildx, GHCR via ephemeral GITHUB_TOKEN, GHA layer cache, SLSA provenance + SBOM attestations, input validation, ::group:: phases, $GITHUB_STEP_SUMMARY), README.md (architecture diagram, variables & secrets flow, build-args secret guidance), CHANGELOG.md, .github/workflows/test-action.yml (4 test jobs: defaults, multi-platform, validation failure assertion, custom Dockerfile), .github/workflows/release.yml (actionlint gate + SHA pinning check + floating major tag), .github/dependabot.yml, .actionlint.yamlnotify-slack/ — full repo structure: action.yml (webhook URL masked immediately with ::add-mask::, payload built via printf to avoid quoting issues, channel override, @mention on failure, status_emoji + http_status outputs, 1-minute timeout on curl), README.md (secrets flow diagram, logged vs masked table, Slack webhook setup guide), CHANGELOG.md, .github/workflows/test-action.yml (3 test jobs: invalid webhook validation, invalid status validation, status/emoji matrix), .github/workflows/release.yml, .github/dependabot.yml, .actionlint.yamlk8s-deploy/ — action.yml (base64 kubeconfig decoded to chmod-600 temp file, ::add-mask:: on decoded content, kubectl apply with --dry-run=server support, rollout status with configurable timeout, kubectl events on failure, cleanup post-step runs if: always()), README.md (kubeconfig secret flow, logged vs masked table, minimum RBAC snippet, base64 encoding commands, service account setup), .github/workflows/test-action.yml, .github/workflows/release.yml, .github/dependabot.yml, .actionlint.yamlterraform-plan/ — action.yml (AWS + Azure OIDC dual-cloud support, Terraform provider cache, fmt → validate → plan pipeline with -detailed-exitcode, idempotent PR comment upsert via hidden marker <!-- terraform-plan:<dir> --> using github-script, plan truncated to 65,000 chars), README.md (OIDC auth flow, logged vs masked table, has_changes output usage), .github/workflows/test-action.yml, .github/workflows/release.yml, .github/dependabot.yml, .actionlint.yamlsecurity-scan/ — action.yml (Trivy image/fs/repo scan, severity enum validation, registry_password masked, ::error:: annotations for CRITICAL findings, ::warning:: for HIGH, SARIF output support, fail_on_findings gate), README.md (build → scan → deploy pipeline example, no-credentials pattern for GHCR), .github/workflows/test-action.yml, .github/workflows/release.yml, .github/dependabot.yml, .actionlint.yamlrelease-tag/ — action.yml (conventional commit auto-detection: feat!/BREAKING CHANGE → major, feat → minor, fix/perf/etc → patch; $GITHUB_OUTPUT chaining across 4 steps: bump → git_tag → changelog → create_release; floating major tag update; draft + prerelease support; is_new_release=false skip path), README.md (output chaining diagram, secrets flow, idempotency note, combined release + notify example), .github/workflows/test-action.yml, .github/workflows/release.yml, .github/dependabot.yml, .actionlint.yamlpr-comment/ — action.yml (idempotent upsert via hidden HTML marker, listComments pagination, collapsible <details> wrapper, delete_on_close for PR closed event, update_existing toggle, github-script with explicit token, comment_id/comment_url/action_taken outputs), README.md (unique marker pattern for multiple instances, delete-on-close example, token scoping), .github/workflows/test-action.yml, .github/workflows/release.yml, .github/dependabot.yml, .actionlint.yamlsetup-env/ — tutorial-baseline multi-runtime action: action.yml (Node.js/Python/Go runtime choice, per-runtime version defaults, dependency cache restore: ~/.npm/~/.cache/pip/~/go/pkg/mod, optional install_dependencies flag, runtime_version + cache_hit outputs, input validation, $GITHUB_STEP_SUMMARY), README.md (runtime matrix usage example), CHANGELOG.md, .github/workflows/test-action.yml (4 test jobs: invalid runtime assertion, matrix across all runtimes/versions, default versions, cache disabled), .github/workflows/release.yml, .github/dependabot.yml, .actionlint.yamlconfigure-cloud/ — action.yml (AWS + Azure + GKE OIDC via cloud_provider switch; GKE uses google-github-actions/auth WIF + get-gke-credentials; full required-input validation per provider; SHA-pinned actions; aws_account_id output), README.md (per-provider quick-start examples, OIDC trust prerequisites)setup-terraform/ — action.yml (version install via hashicorp/setup-terraform, provider plugin cache at ~/.terraform.d/plugin-cache, terraform_wrapper toggle, working directory support, terraform_version output), README.md (cache key strategy, workspace pattern)db-migrate/ — action.yml (Flyway / Liquibase / golang-migrate; ::add-mask:: on database_url; nc health check with retry loop; advisory lock timeout; dry-run; post-migration verify; job summary), README.md (per-tool quick-start, rollback guide, connection URL format)examples/github-actions/composite-actions/README.md — index with 11-example quick-reference table, decision guide ("Need to build a container image? → docker-build-push"), shared best-practices matrix (shell on every run step, SHA pinning, secrets-as-inputs, masking, env isolation, input validation, job summary, log groups, annotations, timeout, idempotency, dependabot, release workflow)commands/composite-actions.md — extended with two new modes:
migrate: scan .github/workflows/ for repeated step sequences (3+ occurrences), rank by impact (count × step-count), extract to composite action, rewrite callerspublish: GitHub Actions Marketplace checklist — prerequisite verification (root placement, unique name, description length, branding), Feather icon/color suggestions per action type, release tag commands, submit walkthroughreferences/composite-actions.md — extended with five new sections:
$GITHUB_STATE — passing data from main run to post-step cleanup; pure composite workaround using $GITHUB_ENV + if: always()shell:, empty ${{ secrets.* }}, ::add-mask:: not working, uses: ./ not found, actionlint shellcheck errors, empty step outputs (3 causes), floating tag not updated, boolean comparison always falsereferences/github-actions.md — extended with composite actions section: key rules summary, pointer to references/composite-actions.md, slash command, and links to all 11 examplesThanks to @geetika-sv for contributing the composite GitHub Actions skill — 11 production-ready examples, the /platform-skills:composite-actions command, and references/composite-actions.md in this release.
init global / init local subcommandscommands/self-improve.md — init mode split into three explicit forms:
init global — scaffolds ~/.claude/ directly, no prompt; offers to wire all three hooks and create ~/.claude/CLAUDE.md; idempotent (reports existing state and stops if already set up)init local — scaffolds .learnings/ and memory/ in $PWD, no prompt; asks gitignore vs commit; idempotentinit (no argument) — asks user to choose, then delegates to init global or init localcommands/self-improve.md — argument-hint updated to [init [global|local]|log [LRN|ERR|FEAT]|promote <ID>|migrate [global|local]|status|resume|review|state]commands/self-improve.md — Path Resolution preamble updated to short-circuit on init global / init local before auto-detectionreferences/agent-self-improve.md — Bootstrap section updated with explicit subcommand examples; line 26 updated from bare init to init global / init local examplesexamples/agent-self-improve/HOW_IT_WORKS.md — init section rewritten with both subcommands, idempotency notestatus and migratecommands/self-improve.md — new ## Mode: status — read-only health summary: counts across all three log files, working-buffer age, sessions since last review, any WAL entries still PENDING, pending error count, and a prioritised action-item listcommands/self-improve.md — new ## Mode: migrate — moves workspace between global and local scope; chooses merge or replace for conflicting entries; writes a WAL entry before moving any files; verifies write to new path and confirms deletion of old path; idempotent if already at target scopepromote supports global targetcommands/self-improve.md — promote mode now includes ~/.claude/CLAUDE.md as a valid target for rules that should apply across all projects (e.g. "never do X on any repo"); scope determination step added before drafting the promoted linereferences/agent-self-improve.md — Promotion targets table updated with ~/.claude/CLAUDE.md rowreview flags stale resolved entriescommands/self-improve.md — review mode now reports entries with Status: resolved that are older than 30 days and have no Action containing "promoted", surfacing them as promotion candidates rather than silently aging outresume warns on stale working buffercommands/self-improve.md — resume mode now reads the Last updated: timestamp from memory/working-buffer.md before proceeding; warns if buffer is 3+ days old (task may be abandoned); blocks with an explicit confirmation prompt if 7+ days oldSKILL.md — self-improve reference line updated to mention global/project-local workspace, init global/init local, status/migrate subcommandsSKILL.md — /platform-skills:self-improve slash command description updated to reflect full capability setskills/platform-skills/SKILL.md — same updates applied (kept in sync with root SKILL.md)commands/self-improve.md — added ## Path Resolution preamble that all modes evaluate before acting: detects global setup (~/.claude/.learnings/ exists), project setup (.learnings/ in $PWD), or auto-creates global on first use; all path references updated to use LEARNINGS_BASEcommands/self-improve.md — init mode now asks global vs project-local before creating any directories; step 4 skips .gitignore check for global setup; step 5 targets the correct settings.json (~/.claude/ for global, .claude/ for project) and enforces absolute paths in the PostToolUse hook commandcommands/self-improve.md — log confirm message now shows resolved $LEARNINGS_BASE path instead of hardcoded .learnings/commands/self-improve.md — resume, review, promote, state, WAL Protocol, Working Buffer, SESSION-STATE, and Daily Notes proactive protocols updated to use $LEARNINGS_BASEreferences/agent-self-improve.md — new Global vs project scope section under Directory layout: scope comparison table, auto-detection order, promotion-targets-stay-local rule, working PostToolUse hook JSON example with absolute pathsexamples/agent-self-improve/HOW_IT_WORKS.md — updated init description to mention global vs project choice; corrected "What This Skill Cannot Do" section which previously stated learnings cannot persist across projectsexamples/agent-self-improve/README.md — updated setup instructions to reflect global scope option and hook wiring stepsexamples/agent-self-improve/scripts/session-end.sh — Stop hook script: drains .pending-errors.log into ERR entries on every session close, saves daily notes, auto-logs PENDING WAL as ERR, clears session marker, increments session counter with review reminder every 5 sessions, nudges if no LRN logged todayexamples/agent-self-improve/scripts/session-start-reminder.sh — PreToolUse hook script: marker-based session detection fires once per session; prints memory-load banner with active task and pending error count; silent on all subsequent tool callsexamples/agent-self-improve/global-claude.md — template for ~/.claude/CLAUDE.md: temporary path override table, session-start instruction, in-session logging rules (log LRN/ERR/FEAT at moment of discovery), VFM_THRESHOLD=60, Agent Rules sectionexamples/agent-self-improve/settings.json.example — all 3 hooks wired: Stop (session-end.sh), PreToolUse (session-start-reminder.sh), PostToolUse (inline error capture to .pending-errors.log)examples/agent-self-improve/scripts/session-end.ps1 — PowerShell 5.1+ equivalent of session-end.sh for Windows native; same logic (daily notes, error drain, WAL check, session counter, review reminder)examples/agent-self-improve/scripts/session-start-reminder.ps1 — PowerShell equivalent of session-start-reminder.sh for Windows native; marker-based once-per-session bannerexamples/agent-self-improve/settings-windows.json.example — all 3 hooks wired with PowerShell commands for Windows native (%USERPROFILE% paths, powershell -NonInteractive -File ...)examples/agent-self-improve/scripts/session-end.sh — shebang changed from #!/bin/bash to #!/usr/bin/env bash for portability on Linux distros where bash is not at /bin/bashexamples/agent-self-improve/scripts/session-start-reminder.sh — same shebang fixexamples/agent-self-improve/README.md — new Platform support table (macOS, Linux, Alpine, WSL, Git Bash, Windows native); Windows recommendation (WSL/Git Bash is simpler); PowerShell copy/setup instructions; execution policy notecommands/self-improve.md — Path Resolution preamble updated with cross-platform ~ resolution note; init global step 5 updated to detect platform and point to correct scripts (session-end.sh vs .ps1, settings.json.example vs settings-windows.json.example)references/agent-self-improve.md — new Platform Compatibility section: global config path table per OS, hook script selection table, per-platform settings.json snippets (bash and PowerShell), Alpine install note, execution policy note; old inline hook JSON examples replaced with a pointer to the new sectioncommands/self-improve.md — path resolution steps reordered: explicit init global/init local now short-circuit (steps 1–2) before auto-detection probes (steps 3–4), matching the changelog claim and preventing an existing .learnings/ repo from hijacking an explicit init global callreferences/agent-self-improve.md — copy command updated to copy both session-end.sh and session-start-reminder.sh; added mkdir -p ~/.claude/scripts to prevent hook wiring failure when the scripts directory doesn't exist yetexamples/agent-self-improve/scripts/session-start-reminder.sh — added mkdir -p "$MEMORY_DIR" before touch "$SESSION_MARKER" to prevent banner-on-every-tool-call when ~/.claude/memory doesn't existThanks to @geetika-sv for contributing the self-improve global workspace, cross-platform support, and init global/init local/status/migrate subcommands in this release.
references/aws-cloudfront.md — comprehensive reference covering: distribution anatomy, OAC (Origin Access Control replaces legacy OAI), cache policies, origin request policies, security headers (aws_cloudfront_response_headers_policy with HSTS preload + CSP + X-Frame + XSS), WAF integration (CLOUDFRONT scope requires us-east-1 — documented as the most common footgun), Lambda@Edge (all 4 event types with memory/timeout/body-access table), CloudFront Functions vs Lambda@Edge decision matrix (~6× cost difference), standard logging v2 + real-time logging + Athena partitioning, multi-account patterns (shared CDN account with cross-account OAC, per-account distributions + FMS enforcement), Organizations SCP, and production readiness checklistreferences/aws-waf.md — comprehensive reference covering: scope section (CLOUDFRONT vs REGIONAL, prominent footgun callout), web ACL anatomy, full managed rule groups table (14 groups, free vs paid), rate-based rules with aggregation keys (IP, forwarded IP, HTTP header, query string, custom key combinations), custom rules (IP sets, geo match, regex, label matching), CAPTCHA/Challenge actions, WAF logging (CW log group must start with aws-waf-logs-), testing workflow (Count → Block), multi-account FMS (prerequisites, policy anatomy, FIRST/MIDDLE/LAST rule group ownership model, auto-remediation, compliance dashboard, cross-account central logging), and Shield Advanced integrationcommands/aws.md — slash command /platform-skills:aws with six modes: cloudfront (distribution design, OAC, caching, security headers), waf (scope, managed rules, rate limiting, logging), lambda-edge (event type selection, constraints, deployment), multi-account (FMS policy design, OU targeting, audit mode), review (production-readiness checklist for CloudFront + WAF Terraform), terraform (module structure, provider aliases, validation blocks, best practices)examples/aws/cloudfront/ — production-ready Terraform module:
versions.tf — provider constraints (aws >= 5.0.0, archive ~> 2.4), us_east_1 alias, required_version = ">= 1.7.0"variables.tf — 15 variables with validation{} blocks for price_class, geo_restriction_type, and name (regex + length)locals.tf — origin IDs, use_custom_domain, viewer_certificate, default_origin_id, use_alb derived valuesmain.tf — S3 bucket (versioning + SSE-KMS + public access block), OAC with cloudfront.amazonaws.com service principal + AWS:SourceArn condition, response headers policy, CloudFront Function, distribution with dynamic blocks for ALB origin + custom header enforcement, Lambda@Edge association via qualified_arn, geo restriction, standard logging v2outputs.tf — distribution_id/arn/domain_name/hosted_zone_id, origin_bucket_name/arn, oac_idlambda-edge/main.tf — Lambda@Edge: publish = true, provider = aws.us_east_1, dual trust policy (lambda.amazonaws.com + edgelambda.amazonaws.com), pre-created CloudWatch log group, qualified_arn output (with explicit note: do not use .arn which resolves to $LATEST)lambda-edge/functions/viewer-request.js — auth check (missing Bearer → 401), method restriction on /api/, A/B routing by cookie (20% to /v2/)functions/url-rewrite.js — CloudFront Functions JS 2.0: trailing slash → index.html, SPA deep-link rewrite, clean URL rewrite, path traversal blockingexamples/aws/waf/ — production-ready Terraform module:
versions.tf — us_east_1 alias with comment explaining CLOUDFRONT scope requirementvariables.tf — 9 variables with validation: default_action (allow/block), rate_limit (0 or 100–2B), cloudwatch_log_retention_days (validates against CloudWatch valid values list)main.tf — WebACL with: IP allowlist (priority 0, dynamic), IP reputation (5), AnonymousIpList (6), CRS with SizeRestrictions_BODY override to Count (10), KnownBadInputs (11), geo block (20, dynamic), rate limit with custom 429 + Retry-After header (30, dynamic), Bot Control with inspection level (40, dynamic), ATP with login path config (50, dynamic); logging config with field redaction (authorization, cookie headers) + filter to keep only BLOCK/COUNT actionsoutputs.tf — web_acl_arn/id, web_acl_capacity (WCU consumption), log_group_name/arnexamples/aws/firewall-manager/main.tf — self-contained FMS module: aws_fms_admin_account, aws_fms_policy for AWS::CloudFront::Distribution with OU include + excluded accounts, preProcessRuleGroups (IP reputation + CRS + KnownBadInputs at priorities 5/10/11), overrideCustomerWebACLAssociation = false to allow app-team local rules, remediation_enabled defaulting to false (audit mode) with explanationexamples/aws/README.md — updated index with cloudfront, waf, firewall-manager examples; quick-start snippets for single-account (WAF → CloudFront) and multi-account (FMS) patterns; key patterns summary and checklistreferences/awesome-docs.md — comprehensive reference covering: SVG Pattern A (architecture flow — offset-path animated dots, glow animations, arrowhead markers, legend), SVG Pattern B (decision/lifecycle loop — 3-state CSS cycling, status bar), SVG Pattern C (field explainer carousel — timing formula slot = T/N, N × slot = T invariant, tooltip structure), SVG Pattern D (timeline phases — bar math y = bottom - (N/max)*height), GitHub SVG animation constraints table (allowed vs blocked), theme system (github-dark, docs-light, custom), multi-platform export rules (GitHub SVG, Confluence/Notion HTML, PNG manual steps), quality checklist, and curated external references (MDN, GitHub Docs, Shields.io, SVGO, Primer CSS)commands/awesome-docs.md — slash command /platform-skills:awesome-docs with seven modes: generate (doc-type-aware guided interview with outline/preview step → full animated doc with incremental SVG confirmation for any doc type — readme, architecture-guide, runbook, tutorial, api-reference, how-it-works, rfc, post-mortem, or custom), convert (8-type doc classification; batch mode for directory input; inject SVGs into existing Markdown with conflict detection), update (revise a single diagram), diff (detect stale diagrams; --base <branch|tag|SHA> flag, defaults to HEAD), audit (quality punch list), preview (cross-platform local browser preview via python3 -m http.server), export (GitHub SVG default; Confluence/Notion animated HTML; PNG manual step)references/awesome-docs.md extended: SVG Pattern E (sequence diagram — actor lifelines, message arrows, CSS step animation, activation boxes), SVG Pattern F (state machine — state nodes, transition arrows, current-state cycling ring), field carousel multi-format syntax colors (YAML, HCL/Terraform, JSON, TOML)examples/awesome-docs/arch-flow.svg — reusable architecture flow SVG template (github-dark, 3 boxes, 2 animated dot paths)examples/awesome-docs/lifecycle-loop.svg — reusable lifecycle loop SVG template (3-state CSS, diamond gate, status bar)examples/awesome-docs/field-carousel.svg — reusable field carousel SVG template (4 fields, 8s cycle, N × slot = T timing)examples/awesome-docs/timeline-phases.svg — reusable timeline phases SVG template (3 phases, replica chart with correct bar math)examples/awesome-docs/sequence-diagram.svg — auth token request flow demo (4-step CSS-sequenced messages with activation boxes)examples/awesome-docs/state-machine.svg — deployment lifecycle demo (Pending → Deploying → Healthy, error/retry paths, current-state cycling ring)examples/awesome-docs/docs-audit-workflow.yml — GitHub Actions workflow: SVG size check, caption check, broken-ref check, external-href check, stale-diagram diff report, PR commentreferences/datadog.md — two new sections: pup CLI (install, config, common operations, scripting post-deploy gates, troubleshooting table) and Datadog Labs Claude Skills (dd-pup, dd-apm, dd-logs, dd-monitors, dd-docs — install commands, capability table, when-to-use decision matrix)references/llm-observability.md — new reference covering: LLMObs vs traditional APM decision matrix, Python instrumentation (@llm, @workflow, @retrieval decorators, LLMObs.annotate()), Node.js instrumentation (llmobs.trace() callback, llmobs.annotate()), environment variables, eval bootstrap workflow, submit_evaluation() pattern, pup-based CI quality gate, trace RCA with dd-llmo-eval-trace-rca, experiment analysis with dd-llmo-experiment-analyzer, manual pup fallback commands, troubleshooting table, and security guidancecommands/datadog.md — two new modes: pup (log search, metric query, monitor management, post-deploy gate generation) and llmo (instrument/evaluate/rca/experiment classification flow)examples/datadog/llm-observability/llmobs-python.py — complete Python LLMObs example with @llm, @workflow, @retrieval decorators and faithfulness evaluationexamples/datadog/llm-observability/llmobs-nodejs.js — complete Node.js LLMObs example with llmobs.trace() and evaluation submissionexamples/datadog/llm-observability/evaluator-bootstrap.py — faithfulness and quality evaluator stubs with LLM-as-judge patternreferences/chaos.md — comprehensive reference covering: decision matrix (Litmus Chaos v3 vs Chaos Mesh v2), fault taxonomy (pod/node/network/stress), steady-state hypothesis pattern (httpProbe and promProbe), blast radius scoping rules, ChaosEngine and NetworkChaos YAML, GitOps integration via ChaosSchedule, rollback semantics, DORA feedback loop, and troubleshooting for stuck experiments and RBAC gapscommands/chaos.md — slash command /platform-skills:chaos with six modes: install, experiment, schedule, gameday, debug, reportexamples/chaos/ — eight working examples and one validator:
litmus-install-values.yaml — Helm values for Litmus Chaos v3chaos-mesh-install-values.yaml — Helm values for Chaos Mesh v2pod-delete-experiment.yaml — Litmus ChaosEngine targeting a Deployment with HTTP probenetwork-loss-experiment.yaml — Chaos Mesh NetworkChaos with 20% packet losscpu-stress-experiment.yaml — Litmus pod-cpu-hog with Prometheus steady-state probechaos-schedule.yaml — weekly pod-delete ChaosSchedule (staging only)gameday-runbook.md — structured GameDay template (steady-state → blast radius → inject → observe → verdict → DORA impact)chaos-validate.sh — domain validatorreferences/dora.md — comprehensive reference covering: the four DORA metrics with exact definitions, 2023 performance bands (Elite/High/Medium/Low), GitHub Actions + Prometheus Pushgateway open-source instrumentation pattern, all four Prometheus recording rules, PagerDuty and OpsGenie incident webhook integration for MTTR, SaaS decision matrix (Sleuth, LinearB, Cortex, open-source), five anti-pattern detections (commits vs deploys, all alerts vs customer incidents, partial outages, staging vs production, PR open vs first commit), chaos engineering cross-reference for high change failure rate, and agent platform rulescommands/dora.md — slash command /platform-skills:dora with four modes: instrument, dashboard, benchmark, debugexamples/dora/ — five working examples, one validator, and an AMP variant:
deployment-event-step.yaml — GitHub Actions step pushing deploy timestamp and lead time to Pushgatewayincident-webhook-handler.yaml — full GitHub Actions workflow triggered by PagerDuty/OpsGenie webhook for MTTR trackingprometheus-recording-rules.yaml — all four DORA Prometheus recording rulesgrafana-dashboard.json — Grafana dashboard with four DORA panels and Elite/High/Medium/Low threshold bandsdora-validate.sh — domain validatoramp-variant/ — Amazon Managed Prometheus variant: amp-workspace.tf (Terraform module provisioning AMP workspace + recording rules via terraform-aws-modules/managed-service-prometheus/aws), pushgateway-helm-values.yaml, prometheus-agent-values.yaml (SigV4/IRSA remote_write), amp-recording-rules-deploy.sh (AWS CLI fallback), grafana-amp-datasource.yaml, grafana-amg-datasource.jsonFour improvements to the proactive agent pattern in Domain 24:
memory/SESSION-STATE.md) for corrections, preferences, decisions, and proper nouns. Written before responding when any of these occur (not only before destructive ops). Second read in compaction recovery after working-buffer.memory/YYYY-MM-DD.md) of notable exchanges, discoveries, completed tasks, and errors. Survives compaction as a searchable history. Third read in compaction recovery.commands/self-improve.md resume mode: working-buffer → SESSION-STATE → daily notes → resource verification. "Never ask where were we?" rule added.state mode in commands/self-improve.md — explicit slash command for capturing session state entriesmemory/SESSION-STATE.md and memory/YYYY-MM-DD.md templates added to examples/agent-self-improve/memory/.gitignore recommendation updated: gitignore memory/ entirely (daily notes grow fast)## Closing — Log learnings section added to commands/triage.md, commands/supply-chain.md, and commands/runtime-security.md — each workflow now ends with an explicit step to log errors and learnings via /platform-skills:self-improve log immediately after completing work, preventing context loss across sessions.claude/settings.json (local, not committed) prompts incremental learning capture after every git commitCLAUDE.md and reference guides (references/supply-chain.md, references/runtime-security.md, references/kyverno.md) as permanent agent rulesNew Domain 25: Supply Chain Security — secure the build pipeline and image lifecycle.
references/supply-chain.md — comprehensive reference covering:
ImageValidatingPolicy for admission enforcementcommands/supply-chain.md — slash command /platform-skills:supply-chain with six modes: audit, sign, sbom, scan, enforce, slsaexamples/supply-chain/ — five working GitHub Actions workflows and one Kyverno policy:
sign-and-push.yaml — build → Cosign keyless sign → pushsbom-attest.yaml — Syft SBOM generation and Cosign attestationtrivy-gate.yaml — Trivy scan with CRITICAL+HIGH severity gate and GitHub Security tab uploadkyverno-verify-image.yaml — ImageValidatingPolicy blocking unsigned images (Audit mode)slsa-provenance.yaml — SLSA Level 2 provenance via slsa-github-generatorsupply-chain-validate.sh — domain validatorNew Domain 26: Runtime Security — detect in-container threats with Falco.
references/runtime-security.md — comprehensive reference covering:
modern_ebpf vs ebpf vs kmod (kmod never on managed K8s)commands/runtime-security.md — slash command /platform-skills:runtime-security with five modes: install, rules, alerts, debug, hardenexamples/runtime-security/ — four working Helm values and one Kyverno policy:
falco-values.yaml — Falco Helm values with eBPF driver and resource limitsfalco-custom-rules.yaml — shell-in-container, privilege escalation, unexpected outboundfalcosidekick-values.yaml — Slack + webhook routing with deduplicationfalco-kyverno-bridge.yaml — ValidatingPolicy blocking Falco-flagged workload re-admissionruntime-security-validate.sh — domain validatorNew Domain 24: Agent Self-Improvement — bootstrap and operate self-improving, proactive agent workspaces.
references/agent-self-improve.md — comprehensive reference guide covering:
.learnings/ directory layout and entry lifecycle (LRN/ERR/FEAT)commands/self-improve.md — slash command definition for /platform-skills:self-improve with five modes: init, log, resume, review, promoteexamples/agent-self-improve/ — ready-to-copy workspace scaffold:
README.md — overview and copy commands.learnings/LEARNINGS.md — positive learnings template.learnings/ERRORS.md — error log template.learnings/FEATURE_REQUESTS.md — feature request templatememory/working-buffer.md — WAL scratchpad templateSKILL.md (root and skills/platform-skills/) with Domain 24 entry, reference bullet, and /platform-skills:self-improve slash commandREADME.md with Domain 26 badge and domain table rowCOMMANDS.md with ToC entry and full command reference sectionNew Domain 23: KEDA — design, generate, debug, and review event-driven autoscaling with KEDA v2.14.
references/keda.md — comprehensive reference guide covering:
ScaledObject spec: minReplicaCount, maxReplicaCount, pollingInterval, cooldownPeriod, advanced HPA behavior overrides, scale-to-zero with cold-start guidanceScaledJob spec: batch processing per message with scaling strategies (accurate/default/custom)TriggerAuthentication and ClusterTriggerAuthentication: Kubernetes Secret reference, IRSA (AWS), Azure Workload Identity, GCP Workload Identity/platform-skills:keda (commands/keda.md) — four modes:
generate — write a production-ready ScaledObject or ScaledJob from a descriptiondebug — diagnose scaling failures with a structured checklistreview — correctness, security, and operational safety reviewscale — design the scaling strategy for a workload from requirementsexamples/keda/ — four working examples:
scaledobject-sqs.yaml — SQS queue trigger with IRSA, scale-to-zero, HPA stabilizationscaledobject-prometheus.yaml — HTTP request rate trigger with Cron floor for business hoursscaledobject-kafka.yaml — Kafka consumer lag trigger with SASL/TLS authscaledjob-sqs.yaml — batch ScaledJob (one Job per SQS message) with security hardeningNew command /platform-skills:triage — triages any PR comment (from a bot, CI tool, or human reviewer) using the gh CLI directly inside Claude Code. No workflow or secrets required.
commands/triage.md — full skill definition with two modes:
<PR number> <comment ID> — triage one specific comment--all <PR number> — triage every unresolved thread on a PR in one passACTIONABLE_FIX, INFORMATIONAL, or NOT_APPLICABLEACTIONABLE_FIX: reads the referenced file, applies the minimal fix with the Edit tool, commits and pushes to the PR branchgh api graphql resolveReviewThread mutation--all modeexamples/triage/ — 11 fully documented scenarios:
actionable-fix/ — wildcard IAM (SOC 2 CC6.1), missing resource limits, deprecated networking.k8s.io/v1beta1 API, wrong liveness probe path, hardcoded secret referenceinformational/ — replica count question, PDB follow-up suggestion, KMS rotation evidence requestnot-applicable/ — CI status bot message, already-fixed-in-later-commit, comment on file not in PR diffEnhanced /platform-skills:review with structured output for automated PR comment workflows.
commands/review.md — adds --bot flag and GitHub-flavoured markdown output formatMERGE_READY / NEEDS_FIX / BLOCKED result values based on finding severity<!-- platform-skills-review --> HTML marker for idempotent comment updates on re-pushNew Domain 21: PR Review — comprehensive pre-merge risk review across six dimensions. Each mode inspects the diff and current file state, reports findings with severity, and recommends concrete fixes.
references/pr-review.md — cost impact (compute, storage, network spend delta with AWS pricing reference), environment drift detection (Kustomize overlay gaps, Helm values sibling mismatches, Terraform workspace drift, GitOps source drift, feature flag drift), ownership and governance (CODEOWNERS gaps, missing team labels, Terraform module README requirements, PR governance checklist), SOC 2 compliance mapping (CC6.1–CC8.1, A1.2 with Terraform patterns and aws CLI evidence collection commands), deprecated API / version hygiene (Kubernetes API removal timeline, Terraform provider constraint patterns, GitHub Actions SHA pinning, container digest pinning), rollback feasibility scoring (reversibility matrix: FULL/PARTIAL/MANUAL/NONE, blast radius: LOCAL/CLUSTER/PLATFORM/DATA, pre-merge requirements for high-risk changes, GitOps rollback patterns), and bot comment triage workflow with GH CLI commands for resolving threads/platform-skills:pr-review (commands/pr-review.md) — six modes: cost (spend delta with severity per resource), drift (environment alignment across overlays, values files, Terraform workspaces), ownership (CODEOWNERS, team labels, module README, PR governance), compliance (SOC 2 control impact with remediation and auditor evidence commands), upgrade (deprecated Kubernetes APIs, loose Terraform constraints, floating action versions, :latest images), rollback (reversibility and blast radius score per change with Rollback Risk Score: 🟢/🟡/🔴), full (all six modes with Merge Readiness Summary)New Domain 20: Kyverno — write and maintain Kubernetes-native admission policies using the CEL-based types (ValidatingPolicy, MutatingPolicy, GeneratingPolicy, ImageValidatingPolicy) introduced in Kyverno v1.14–v1.15. Covers Audit→Deny promotion, PolicyExceptions, PolicyReport analysis, and migration from legacy ClusterPolicy or PodSecurityPolicy. All policies use apiVersion: policies.kyverno.io/v1.
references/kyverno.md — all five new policy kinds with namespace-scoped variants, matchConstraints/matchConditions CEL filtering, ValidatingPolicy with validations[].expression and messageExpression, MutatingPolicy with ApplyConfiguration (Object{...}) and JSONPatch ([JSONPatch{...}]) patch types, container iteration with .map()/.all(), mutateExisting, GeneratingPolicy with generator.Apply() and dyn() inline resources, ImageValidatingPolicy with verifyImageSignatures(), Cosign keyless/key-based and Notary attestors, Audit→Deny promotion workflow with PolicyReport queries, PolicyException (kyverno.io/v2), kyverno-cli install and test manifest structure, PolicyReport export to CSV, GitHub Actions integration, three full annotated example policies, and a troubleshooting table covering 11 failure modes/platform-skills:kyverno (commands/kyverno.md) — five modes: generate (write CEL-based policy from description, always Audit-first), test (write kyverno-test.yaml with pass/fail/skip fixtures), audit (rank PolicyReport violations and produce Deny promotion plan), debug (diagnose webhook registration, matchConstraints/matchConditions gaps, background scan, CEL errors, PolicyException suppression), migrate (legacy ClusterPolicy→new-type translation table, PSP field mapping, Gatekeeper ConstraintTemplate translation)examples/kyverno/ — two ValidatingPolicy examples (require-team-labels.yaml, disallow-privileged-containers.yaml), one GeneratingPolicy example (generate-default-networkpolicy.yaml), four test resource fixtures, and a kyverno-test.yaml that exercises pass and fail cases for each validate policyNew Domain 19: OPA / Conftest — generate Rego policies, write unit tests, run the full validation pipeline (fmt → regal lint → conftest verify → integration test), explain existing policies in plain English, and debug why a rule is not firing.
references/opa.md — Rego v1 syntax, METADATA blocks, rule types (deny/warn/violation), package namespacing, input shape analysis with conftest parse, policy examples (Terraform IAM, Kubernetes pod security), unit test patterns with input fixtures, full validation pipeline (conftest fmt, regal lint, conftest verify, integration test), GitHub Actions workflow, shared data.* allow-lists, and troubleshooting table/platform-skills:opa (commands/opa.md) — five modes: generate (write Rego from description with correct structure), test (write _test.rego unit tests with fixtures), validate (run fmt/regal/verify/integration pipeline), explain (translate policy to plain English), debug (diagnose namespace mismatch, wrong rule name, input shape issues)examples/opa/conftest/ — working S3 encryption policy (deny_unencrypted_s3.rego), full unit test suite (deny_unencrypted_s3_test.rego), sample Terraform input, and .regal/config.yamlNew Domain 18: Conventional Commits — analyze git diffs or staged changes, generate commit messages that explain WHY a change was made, intelligently stage files for atomic commits, and validate messages against the Conventional Commits 1.0.0 specification.
references/conventional-commits.md — message structure (subject/body/footer rules), type classification table with decision guidance, scope rules, breaking change patterns with examples, atomic commit strategy with git add -p, full examples for fix/feat/chore/breaking/revert, commitlint setup with scope allow-list, husky git hook configuration, GitHub Actions PR title lint workflow, semantic-release configuration with version bump rules, and a validation rules table/platform-skills:commit (commands/commit.md) — four modes: analyze (classify type/scope/breaking from diff), generate (produce full conventional commit message), stage (intelligently group and stage files for atomic commits), validate (check an existing message against the spec)examples/conventional-commits/commitlint/ — commitlint config with scope allow-list, husky commit-msg hook, and package.json for installationNew Domain 16: Datadog — deploy and configure the Datadog Agent on Kubernetes, instrument services with APM and log correlation, and manage monitors, dashboards, SLOs, and synthetic tests as Terraform-managed resources.
references/datadog.md — Agent Helm setup, APM instrumentation (Node.js/Python), Unified Service Tagging, Log Management with processing rules, Monitors (Terraform), Dashboards (Terraform), SLOs, Synthetic tests, and troubleshooting table/platform-skills:datadog (commands/datadog.md) — six modes: setup, instrument, monitor, dashboard, slo, debugexamples/datadog/terraform/monitors.tf — Terraform monitors for error rate and latency, plus SLO resourceNew Domain 17: Dynatrace — deploy OneAgent via the Kubernetes Operator with automatic injection, configure anomaly detection, ingest custom metrics, define SLOs, and manage dashboards and alerting profiles with the Dynatrace Terraform provider.
references/dynatrace.md — Operator and DynaKube CR setup, Node.js/Python/Java SDK instrumentation, Log Monitoring, custom metrics via MINT API, SLOs, Terraform provider (anomaly detection, SLOs, alerting), Davis AI problem feeds, troubleshooting table, and token scopes/platform-skills:dynatrace (commands/dynatrace.md) — six modes: setup, instrument, monitor, slo, dashboard, debugexamples/dynatrace/operator/dynakube.yaml — production DynaKube CR with cloudNativeFullStack injection and ActiveGateNew Domain 13: MCP (Model Context Protocol) — build, review, and debug MCP servers and clients covering TypeScript and Python SDKs, Zod/Pydantic schema validation, stdio/HTTP/SSE transports, protocol compliance, authentication, and rate limiting.
references/mcp.md — protocol fundamentals, TypeScript SDK patterns, Python FastMCP patterns, schema design, error handling, testing with MCP Inspector, security, and deployment checklist/platform-skills:mcp (commands/mcp.md) — three modes: create (scaffold production-ready server), review (protocol compliance and security audit), debug (diagnose transport/protocol/schema/handler failures)examples/mcp/docs-server/ — complete TypeScript MCP server with search_docs and get_doc tools and a docs://index resourceNew Domain 14: Observability — instrument services with structured logging, Prometheus metrics, and OpenTelemetry tracing; build Grafana dashboards, write alerting rules, run k6 load tests, and plan capacity.
references/observability.md — Pino/structlog structured logging, prom-client/prometheus-client metrics, OpenTelemetry tracing, RED/USE method alerting rules, Grafana dashboard templates, k6 load testing, and capacity planning formulas/platform-skills:observability (commands/observability.md) — five modes: instrument, dashboard, alert, loadtest, capacityexamples/observability/prometheus-alerts/ — RED method Prometheus alerting rules for a production serviceNew Domain 15: Documentation — generate and validate inline docstrings (Google/NumPy/Sphinx/JSDoc), OpenAPI 3.1 specifications, documentation sites (MkDocs, TypeDoc), and developer getting started guides.
references/documentation.md — Google/NumPy/Sphinx docstring styles, TypeScript JSDoc, OpenAPI 3.1 spec with shared components, FastAPI and NestJS auto-documentation, coverage measurement, MkDocs site setup, and guide structure/platform-skills:document (commands/document.md) — four modes: docstrings, openapi, site, guideexamples/documentation/openapi-spec/ — complete OpenAPI 3.1 Orders API with shared components, security schemes, and response definitionsNew Domain 12: Helm (Helmcheck) — production-grade Helm chart development covering scaffolding, values design, template patterns, dependency management, security hardening, and the full lint/validation pipeline.
Reference guide:
references/helm.md — complete template patterns, standard _helpers.tpl, security-hardened Deployment baseline, conditional resource templates (Ingress, HPA, PDB, NetworkPolicy), values.yaml design principles, values.schema.json type safety, dependency management, lint/validation pipeline, and GitOps integration (Flux HelmRelease, Argo CD Application)Slash command:
/platform-skills:helmcheck (commands/helmcheck.md) — three modes: create (scaffold a production-ready chart), review (analyse chart structure and quality), security (audit RBAC, pod security, network policies, and secrets handling)Example chart:
examples/helm/web-service/ — full production chart: security-hardened Deployment, Service, Ingress, ServiceAccount, HPA, PDB, NetworkPolicy, values.schema.json, and helm test pod compliant with restricted PodSecurityPlatform rules added:
helm lint --strict → helm template --debug → kubeconform -strict -summary → checkov → helm test in order; fail CI on any helm lint --strict warningTests:
tests/validate-helmcheck.sh — 62 checks covering command structure, SKILL.md integration, reference guide sections, chart file presence, Chart.yaml correctness, schema validity, security baselines, template safety, and HOW_IT_WORKS completenessHOW_IT_WORKS.md — explains how AI coding agents work, how skills activate, how to write effective prompts, how the review workflow runs, and what the agent cannot doNew Domain 11: Compliance — SOC 2 Trust Services Criteria mapped to Terraform patterns, covering 11 criteria across access, encryption, detection, logging, change management, and availability.
Reference guide:
references/compliance.md — full SOC 2 TSC coverage with Terraform patterns, Checkov rule IDs, AWS CLI evidence commands, and a pre-audit readiness checklistCriteria covered with Terraform patterns, Checkov rules, and evidence commands:
CC6.1 — IAM least privilege: scoped policies, IRSA, SCPs denying privilege escalationCC6.2 — Authentication: MFA-enforcement SCP, GitHub Actions OIDC (no static credentials)CC6.3 — Access removal: access key rotation via Config access-keys-rotated ruleCC6.6 — Network security: security groups, private subnets, VPC flow logs, WAF (aws_wafv2_web_acl) with rate limiting and AWS managed rule groupsCC6.7 — Encryption: S3, RDS, EBS, KMS; extended to DynamoDB, ECR, ElastiCache, OpenSearch, Kinesis, EFS, and RedshiftCC6.8 — Vulnerability management: aws_ecr_registry_scanning_configuration (ENHANCED + CONTINUOUS_SCAN), aws_inspector2_enabler, aws_ssm_patch_baselineCC7.1 — Detection: aws_guardduty_detector (S3, EKS, malware sources), 14 CIS Benchmark CloudWatch metric filters and alarms (CIS 3.1–3.14), aws_securityhub_account with AWS Foundational and CIS standardsCC7.2 — Audit logging: multi-region CloudTrail with KMS encryption, S3 object lock (COMPLIANCE mode, 365-day), AWS Config recorder with 12 SOC 2-relevant managed rules, VPC flow logsCC7.3 — Incident response: KMS-encrypted SNS topic with email and PagerDuty subscriptions, GuardDuty HIGH/CRITICAL findings → EventBridge → SNS pipeline, Config delivery channel with SNS notificationCC8.1 — Change management: S3 + DynamoDB state locking, PR-gated Terraform plan/apply workflow with plan posted as PR commentA1.1 — Availability: multi-AZ EKS node groups, RDS multi-AZA1.2 / A1.3 — Backup and recovery: aws_backup_plan (daily 35-day + monthly 1-year), aws_backup_vault_lock_configuration (COMPLIANCE mode, 35-day minimum), cross-region copy for DR, tag-based backup selectionSlash command:
/platform-skills:compliance (commands/compliance.md) — five modes: gap (find audit blockers), control (implement a specific criterion), evidence (generate audit-ready CLI commands), remediate (fix a Checkov finding with blast-radius analysis), checklist (full SOC 2 readiness review)Examples:
examples/compliance/checkov-config.yaml — Checkov config with all SOC 2 check IDs grouped by criterion and documented suppression formatexamples/compliance/iam/ — IRSA application role, GitHub Actions OIDC trust (plan + apply), SCPs for MFA enforcement and privilege escalation preventionexamples/compliance/logging/ — CloudTrail with object lock, AWS Config recorder with 12 managed rules, VPC flow logsexamples/compliance/network/ — WAF, security groups, and VPC flow logsexamples/compliance/encryption-data-services/ — DynamoDB, ECR, ElastiCache, OpenSearch, Kinesis, EFS, and Redshift encryption/logging controlsexamples/compliance/vulnerability/ — Inspector v2, ECR enhanced scanning, and SSM patchingexamples/compliance/detection/ — GuardDuty, CIS CloudWatch alarms, and Security Hubexamples/compliance/incident-response/ — KMS-encrypted SNS, EventBridge alert routing, and PagerDuty subscriptionexamples/compliance/backup/ — AWS Backup plan, vault lock, and cross-region DR copiesInfrastructure updates:
Compliance) added to skills/platform-skills/SKILL.md tool-routing and reference file listreferences/compliance.md added to REQUIRED_REFERENCES in tests/validate-skill.shexamples/compliance added to EXAMPLE_DOMAINS in tests/validate-skill.shcompliance, soc2, checkov, audit-logging, iam-least-privilege, cloudtrail added to marketplace.json keywordsreferences/platform-mindset.md — product mindset, developer experience (SPACE/DORA metrics, friction audits, golden paths, Backstage), collaboration and communication (RFC/ADR templates, incident updates, blameless post-mortems), and proactive problem-solving (systemic fixes, capacity planning, cost optimisation loop, quarterly platform health review)/platform-skills:product command (commands/product.md) — applies product thinking to platform work across nine topics: devex, friction, rfc, adr, incident, postmortem, capacity, cost, reviewreferences/linux-networking.md — Linux administration (process/service management, disk, memory/CPU, kernel tuning, permissions) and networking fundamentals (DNS record types, CoreDNS, Route 53, Azure Private DNS, ALB/NLB, Ingress/Gateway API, VPC/VNet CIDR design, subnet tiers, peering vs Transit Gateway, PrivateLink, NSGs, troubleshooting checklists)/platform-skills:linux command (commands/linux.md) — structured guidance for dns, lb, vpc, process, disk, network, security-groups, and troubleshoot topicsLinux & NetworkingPlatform MindsetSKILL.md from the repository root to skills/platform-skills/SKILL.md to match the Claude Code proper skill directory format.claude-plugin/plugin.json to declare "skills": ["./skills/platform-skills"] (directory reference) instead of the old root-level pathtests/validate-skill.sh to reference skills/platform-skills/SKILL.md in all three validation checks.claude-plugin/plugin.json — registration manifest explicitly declaring plugin name, version, author, skill (SKILL.md), and all 5 command paths so Claude Code uses consistent /platform-skills:<name> namespacing regardless of install directory nameFive explicit slash commands added under commands/ — invoked as /platform-skills:<name>:
/platform-skills:debug — structured troubleshooting: classifies the problem layer, lists evidence commands, forms a root-cause hypothesis, proposes a fix with validation and rollback/platform-skills:review — production-readiness review of any manifest, Terraform file, workflow, or Helm values: correctness, security, operational safety, deprecations/platform-skills:terraform — full Terraform validation pipeline walkthrough (fmt, validate, tflint, tfsec/checkov), blast radius, IAM risk, state impact, module design/platform-skills:gitops — Flux CD and Argo CD troubleshooting: classifies by source/artifact/reconciliation/runtime layer, provides exact evidence commands, fix, and rollback/platform-skills:linkerd — Linkerd diagnostics: injection, mTLS, authorization policy, observability, traffic management, multi-cluster, control planereferences/linkerd.md: mTLS, proxy injection, authorization policy, traffic management (HTTPRoute, retries, timeouts, canary splits), golden-signal observability (linkerd viz stat/tap), Prometheus PodMonitor integration, multi-cluster mirroring, and 5-scenario troubleshooting guidereferences/linkerd.md added to tests/validate-skill.sh required referencesclaude plugin install . → claude plugin marketplace add $(pwd) + claude plugin install platform-skills); added upgrade flow; updated What We're Looking For to reflect current roadmap prioritiesreferences/kubernetes.md: 401 vs 403 diagnosis, kubectl auth can-i evidence collection, binding scope matrix (Role/ClusterRole × RoleBinding/ClusterRoleBinding), ClusterRoleBinding audit queryreferences/fluxcd.md: ImageRepository, ImagePolicy, ImageUpdateAutomation setup, registry auth table (GHCR/ECR/ACR/GAR/Docker Hub), troubleshooting table, safety rules for staging vs. production promotionreferences/secrets.md: decision matrix (ESO vs. Sealed Secrets), ESO setup for AWS/Azure/Vault/static credentials, Sealed Secrets seal/rotate/backup workflow, troubleshooting tables for both patterns, least-privilege IAM example, operational rulesreferences/secrets.md to REQUIRED_REFERENCES in tests/validate-skill.shhelm, docker, containers, deployment, rbac, secrets, security, gkeassignees format in renovate.json (removed @ prefix)kubeconform flag in validate.yml: -skip-kinds → -skip (correct flag name for v0.6.4)LICENSE copyright placeholder [yyyy] [name of copyright owner] → 2026 Nitin Jaindefault_tags provider block, ASG propagate_at_launch, EBS/Lambda propagation gaps, AWS Config required-tags rule, cost allocation tag activation steps, org-level tag policy enforcementmerge(local.common_tags, {...}) pattern, tag inheritance gap explanation, Azure Policy deny/modify enforcement, remediation task for existing resources, AKS managed resource group taggingexamples/kubernetes/*.yaml (4 files), examples/openshift/*.yaml (2 files), examples/aws/iam/*.json (2 files), examples/azure/workload-identity/ (main.tf + serviceaccount.yaml)tests/validate-skill.sh — checks SKILL.md frontmatter, all reference files exist, each example domain has at least one asset beyond README.md, SKILL.md references every reference file; wired into validate.yml as a blocking CI job.github/copilot-instructions.md — GitHub Copilot automatically applies Platform Skills patterns (no Claude Code required)VSCODE_INTEGRATION.md — comprehensive guide for VSCode with Claude Code extension, GitHub Copilot split-screen, and browser workflowsQUICKSTART.md — 5-minute install and first-use guideINSTALLATION.md — full installation methods, team setup, troubleshootingvalidate.yml and release.yml marketplace.json validation — field paths now match marketplace format (plugins[0].version, plugins[0].description, etc.)actions/create-release (archived action) with gh release create CLI in release.ymlhashicorp/setup-terraform in validate.yml — floating @v4.0.0 tag was causing the workflow's own security check to failactions/setup-node step from publish-marketplace job# placeholder link in README — replaced with real URLQUICKSTART.md, INSTALLATION.md, and VSCODE_INTEGRATION.md to README navigation and repository structure tablev1.0.0 examples in CHANGELOG release checklist — replaced with vX.Y.Z placeholderpath: fields — were pointing at Flux monorepo paths instead of Argo CD-appropriate pathsmain.tf — added required_providers block with minimum azurerm >= 3.87.0README.md-only entriesplatform-skills (was platform-skills-marketplace)claude, subcommand is claude plugin (not claude-code skill)Initial release of Platform Skills - A comprehensive Claude Agent Skill for platform engineering across 8 domains: Kubernetes, OpenShift, Argo CD, Flux CD, AWS, Azure, Terraform, and GitHub Actions.
.github/workflows/release.yml)
.github/workflows/validate.yml)
renovate.json)
Major Release:
Minor Release:
Patch Release:
Before releasing:
.claude-plugin/marketplace.jsongit tag -a vX.Y.Z -m "Release vX.Y.Z"git push origin vX.Y.Z| Tool | Version | Last Verified |
|---|---|---|
| Flux CD | 2.2+ | 2026-04-02 |
| Terraform | 1.5+ | 2026-04-02 |
| AWS CLI | 2.x | 2026-04-02 |
| Azure CLI | 2.50+ | 2026-04-02 |
| kubectl | 1.28+ | 2026-04-02 |
| Helm | 3.12+ | 2026-04-02 |
None currently.
N/A - Initial release
Thank you to all contributors who helped build Platform Skills:
See CONTRIBUTING.md to join this list!
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests