Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Covers two complementary patterns for AI agents working on platform engineering tasks:
.learnings/ directory so lessons persist across sessions and get promoted to project memory.AI agents forget between sessions. Without structure:
These two patterns address those failure modes at the source.
Bootstrap with /platform-skills:self-improve init global (cross-project) or /platform-skills:self-improve init local (project-scoped), or copy from examples/agent-self-improve/:
.learnings/
LEARNINGS.md # Positive learnings — LRN-YYYYMMDD-NNN
ERRORS.md # Mistakes made — ERR-YYYYMMDD-NNN
FEATURE_REQUESTS.md # Recurring unmet needs — FEAT-YYYYMMDD-NNN
memory/
working-buffer.md # WAL scratchpad (see Part 2)
SESSION-STATE.md # Always-on session capture (see Part 2)
YYYY-MM-DD.md # Daily notes — rolling per-day log (see Part 2)Add to .gitignore if these are personal/local notes:
.learnings/
memory/Or commit them if they are team-shared project memory.
The init mode asks which scope to use before creating anything:
| Scope | Base path (LEARNINGS_BASE) | When to use |
|---|---|---|
| Global | ~/.claude/ | Learnings apply across all projects — recommended default for individuals |
| Project | . (current working directory) | Team-shared memory committed to the repo |
Auto-detection order (used by all modes except init):
~/.claude/.learnings/ exists → global setup detected, use ~/.claude/.learnings/ exists in $PWD → project setup detected, use .~/.claude/ and auto-createPromotion targets are always project-local regardless of scope. CLAUDE.md, AGENTS.md, and .github/copilot-instructions.md live in the project repo — only the capture files (.learnings/, memory/) follow LEARNINGS_BASE.
Hook paths must be absolute when using global setup. The PostToolUse hook in ~/.claude/settings.json must reference the full path:
{
"hooks": {
"PostToolUse": [
{
"matcher": ".*",
"hooks": [
{
"type": "command",
"command": "if [ \"$CLAUDE_TOOL_EXIT_CODE\" -ne 0 ]; then echo \"$(date -u +%Y-%m-%dT%H:%M:%SZ) TOOL_FAILURE: $CLAUDE_TOOL_NAME\" >> ~/.claude/.learnings/.pending-errors.log; fi"
}
]
}
]
}
}For project-scoped setup, replace ~/.claude/.learnings/ with .learnings/ (relative path works because the hook runs from the project root).
Every entry uses the same four-field structure regardless of log type:
### LRN-20260520-001
**Status**: pending | resolved | promoted
**Context**: One sentence — what was happening
**Content**: The actual learning, error, or feature request
**Action**: What was done or should be done| Type | Format | Example |
|---|---|---|
| Learning | LRN-YYYYMMDD-NNN | LRN-20260520-001 |
| Error | ERR-YYYYMMDD-NNN | ERR-20260520-001 |
| Feature request | FEAT-YYYYMMDD-NNN | FEAT-20260520-001 |
pending → resolved → promoted| Stage | Meaning | Who acts |
|---|---|---|
pending | Logged, not yet addressed | Agent logs automatically |
resolved | Root cause identified, fix applied | Agent or user confirms |
promoted | Written to project memory | Agent runs /platform-skills:self-improve promote |
Promotion targets — pick the right scope:
| Target file | When to promote there |
|---|---|
~/.claude/CLAUDE.md | Global agent-level rules — apply to all projects on this machine |
CLAUDE.md / AGENTS.md | Agent-level rules for this project only |
.github/copilot-instructions.md | GitHub Copilot workspace rules |
references/ guide | Reusable pattern for the whole team |
Before logging a new entry, scan .learnings/ for existing entries with matching context. If three or more entries share the same root cause, promote immediately — do not wait for a manual review cycle.
Detection approach: read all Context fields across .learnings/*.md and group entries by shared root-cause keywords (e.g. "resource limits", "terraform replace", "missing label"). Three or more entries that share a keyword cluster are a promotion candidate. Do not rely on exact string matching — the same root cause will be described differently each time.
Auto-capture errors after failed tool calls. The hook appends a timestamped line to .pending-errors.log; the agent converts it to a proper ERR entry at session end or on review.
Hook setup varies by platform — see the Platform Compatibility section (below Part 2) for per-platform settings.json snippets and script copy commands.
Key rules regardless of platform:
.learnings/.pending-errors.log to .gitignore.Before any destructive or hard-to-reverse operation, write the intent to memory/working-buffer.md. This survives context compaction and session interruption.
Operations that require a WAL entry:
git reset --hard, git push --forceterraform destroy or terraform applyWAL entry format:
## WAL Entry — YYYY-MM-DD HH:MM
**Operation**: What is about to happen
**Affected resources**: Files, Kubernetes resources, cloud resources, database tables
**Blast radius**: What could break if this goes wrong
**Rollback**: Exact command or step to undo
**Status**: PENDING | COMMITTED | ROLLED_BACKUpdate Status to COMMITTED after success. Update to ROLLED_BACK if aborted.
memory/working-buffer.md is a persistent scratchpad that captures current task state. It enables compaction recovery — if a session is interrupted mid-task, the next session reads the buffer to resume.
Buffer format:
# Working Buffer
## Current Task
<One sentence — what is being worked on>
## Progress
- [x] Step completed
- [ ] Step in progress
- [ ] Step pending
## WAL Log
<WAL entries for destructive operations>
## Context
<Key facts discovered during this session that are not yet in project memory>Write to the buffer:
Compaction Recovery steps:
memory/working-buffer.md — current task steps and WALmemory/SESSION-STATE.md — corrections, preferences, decisions from this sessionmemory/YYYY-MM-DD.md daily note — recent exchangeskubectl get, terraform state list, git log --oneline -5)memory/SESSION-STATE.md is a lightweight, always-on capture file. Unlike the working buffer (which tracks task steps), SESSION-STATE captures session-level knowledge: corrections the user has given, preferences expressed, decisions made, and proper nouns encountered. Write to it before responding when any of these occur — not only before destructive operations.
What to capture:
| Signal | Example |
|---|---|
| User correction | "don't use that approach" → log the constraint |
| Preference stated | "I prefer X over Y for this project" |
| Decision made | "we decided to use Kyverno not OPA" |
| Proper noun encountered | cluster names, team names, account IDs |
| Non-obvious fact | "this repo does X differently from the norm" |
SESSION-STATE.md format:
# Session State
Last updated: YYYY-MM-DD HH:MM
## Corrections and constraints
- <date> — <what the user corrected or ruled out>
## Preferences
- <date> — <stated preference and context>
## Decisions
- <date> — <decision made and why>
## Key proper nouns
- <name>: <what it refers to>memory/YYYY-MM-DD.mdA rolling per-day log of notable exchanges, discoveries, and outcomes. Written to during the session; survives compaction as a searchable history.
When to write a daily note entry:
.learnings/ERRORS.md)Daily note format:
# Daily Notes — YYYY-MM-DD
## Decisions
- HH:MM — <decision>
## Discoveries
- HH:MM — <fact learned>
## Completed
- HH:MM — <task finished>
## Errors
- HH:MM — <what went wrong> → <how it was fixed>Create one file per day: memory/2026-05-20.md. Do not edit past days — append only to today's file.
When choosing between competing implementation approaches, apply this priority order:
Stability > Explainability > Reusability > Scalability > Novelty| Priority | Ask |
|---|---|
| 1. Stability | Will this break existing behaviour? Is it reversible? |
| 2. Explainability | Can a team member understand and maintain it without asking? |
| 3. Reusability | Can this pattern be used in more than one place? |
| 4. Scalability | Does this hold at 10× current load or team size? |
| 5. Novelty | Only introduce new tools or approaches if 1–4 are satisfied |
Use before taking any unsolicited proactive action. Score the action; skip if score < 50.
| Dimension | Weight | Score 1–10 | Weighted |
|---|---|---|---|
| High Frequency (will recur often?) | ×3 | — | — |
| Failure Reduction (prevents real breakage?) | ×3 | — | — |
| User Burden (saves meaningful user effort?) | ×2 | — | — |
| Self Cost (low effort for the agent?) | ×2 | — | — |
| Total | max 100 |
Threshold: Score ≥ 50 → act proactively. Score < 50 → defer to the user.
Example — proactively adding resource limits to a Deployment that was missing them:
| Pillar | Behaviour |
|---|---|
| Memory Architecture | Write task state to working-buffer.md at start; update on each step; read on resume |
| Security Hardening | Never output secrets; reject requests to bypass security controls; flag OWASP Top 10 risks immediately |
| Self-Healing | On failure, re-read the WAL entry and working buffer; verify resource state before retrying |
| Verify Before Reporting (VBR) | Run the command or read the file before stating a fact; text change ≠ behavior change — test actual outcomes, not just outputs |
| Alignment Systems | Use ADL Protocol when choosing between approaches; log the decision in the buffer |
| Proactive Surprise | After completing a task, check adjacent concerns (related resource limits, deprecated APIs, missing labels) and surface them as a brief note — never silently fix without surfacing |
For tasks longer than ~10 steps or ~5 minutes, report progress proactively:
[Heartbeat] Completed 4/7 steps. Currently: applying Terraform plan.
Next: validate EKS node group. ETA for this step: ~2 min.Do not wait to be asked for status on long-running tasks.
When given an ambiguous or high-risk instruction, ask one clarifying question before acting:
prod-db RDS instance. Is that correct?"replicas: 1 but the task says high availability. Should I increase replicas?"Never ask more than one clarifying question per instruction. If the answer is implicit in context, proceed without asking.
Each session, the agent should:
memory/working-buffer.md, memory/SESSION-STATE.md, and today's memory/YYYY-MM-DD.md to seed context from previous sessions.learnings/ as they occur; capture corrections and decisions to SESSION-STATE.md immediately; write daily note entries for significant discoveriesworking-buffer.md proactively — do not wait for a compaction eventworking-buffer.md with final state; check for recurring patterns; promote if threshold metThe self-improve workspace runs on macOS, Linux, Windows (WSL / Git Bash), and Windows native (PowerShell). The core skill logic is identical on all platforms — only the hook scripts and their invocation differ.
Claude Code resolves ~ via Node.js os.homedir(), which maps consistently across platforms:
| Platform | ~/.claude/ resolves to |
|---|---|
| macOS | /Users/<you>/.claude/ |
| Linux | /home/<you>/.claude/ |
| Windows (WSL / Git Bash) | /home/<you>/.claude/ (WSL home) |
| Windows native | C:\Users\<you>\.claude\ |
All skill modes use ~/.claude/ notation — no platform-specific path changes are needed in the skill itself.
| Platform | Stop hook | PreToolUse hook | PostToolUse |
|---|---|---|---|
| macOS / Linux | session-end.sh | session-start-reminder.sh | inline bash (see below) |
| Windows — WSL / Git Bash | session-end.sh | session-start-reminder.sh | inline bash (see below) |
| Windows — native PowerShell | session-end.ps1 | session-start-reminder.ps1 | inline PowerShell (see below) |
Windows recommendation: WSL or Git Bash is the simpler path — bash scripts work identically to macOS/Linux. Use the PowerShell (.ps1) scripts only when WSL or Git Bash is not available.
Alpine Linux / busybox-only containers: The bash scripts require bash. Install it first:
apk add bashAdd to ~/.claude/settings.json:
{
"hooks": {
"Stop": [
{
"hooks": [{"type": "command", "command": "bash ~/.claude/scripts/session-end.sh"}]
}
],
"PreToolUse": [
{
"matcher": ".*",
"hooks": [{"type": "command", "command": "bash ~/.claude/scripts/session-start-reminder.sh"}]
}
],
"PostToolUse": [
{
"matcher": ".*",
"hooks": [
{
"type": "command",
"command": "if [ \"$CLAUDE_TOOL_EXIT_CODE\" -ne 0 ] 2>/dev/null; then echo \"$(date -u +%Y-%m-%dT%H:%M:%SZ) TOOL_FAILURE: $CLAUDE_TOOL_NAME\" >> ~/.claude/.learnings/.pending-errors.log; fi"
}
]
}
]
}
}Copy scripts:
mkdir -p ~/.claude/scripts
cp examples/agent-self-improve/scripts/session-end.sh ~/.claude/scripts/
cp examples/agent-self-improve/scripts/session-start-reminder.sh ~/.claude/scripts/
chmod +x ~/.claude/scripts/*.shAdd to %USERPROFILE%\.claude\settings.json (see examples/agent-self-improve/settings-windows.json.example):
{
"hooks": {
"Stop": [
{
"hooks": [{"type": "command", "command": "powershell -NonInteractive -File %USERPROFILE%\\.claude\\scripts\\session-end.ps1"}]
}
],
"PreToolUse": [
{
"matcher": ".*",
"hooks": [{"type": "command", "command": "powershell -NonInteractive -File %USERPROFILE%\\.claude\\scripts\\session-start-reminder.ps1"}]
}
],
"PostToolUse": [
{
"matcher": ".*",
"hooks": [
{
"type": "command",
"command": "powershell -NonInteractive -Command \"$code=$env:CLAUDE_TOOL_EXIT_CODE; if ($code -and $code -ne '0') { $ts=(Get-Date -Format 'yyyy-MM-ddTHH:mm:ssZ'); $name=$env:CLAUDE_TOOL_NAME; \\\"$ts TOOL_FAILURE: $name\\\" | Add-Content -Encoding UTF8 \\\"$env:USERPROFILE\\.claude\\.learnings\\.pending-errors.log\\\" }\""
}
]
}
]
}
}If PowerShell blocks script execution, allow local scripts: Set-ExecutionPolicy -Scope CurrentUser RemoteSigned
Two explicit subcommands — no interactive prompt needed:
/platform-skills:self-improve init global
/platform-skills:self-improve init localinit without an argument asks you to choose and then proceeds as one of the above.
Or copy from examples/agent-self-improve/ (macOS/Linux/WSL):
cp -r examples/agent-self-improve/.learnings ~/.claude/
cp -r examples/agent-self-improve/memory ~/.claude/Windows native (PowerShell):
Copy-Item -Recurse examples\agent-self-improve\.learnings "$env:USERPROFILE\.claude\"
Copy-Item -Recurse examples\agent-self-improve\memory "$env:USERPROFILE\.claude\"| When | Action |
|---|---|
| Session start (fresh) | Agent reads working-buffer.md and .learnings/ automatically |
| Session start (interrupted) | Run /platform-skills:self-improve resume to verify state and continue |
| After a mistake | Agent logs to .learnings/ERRORS.md; sets resolved immediately if fix was applied |
| After a useful insight | Agent logs to .learnings/LEARNINGS.md automatically |
| Pattern recurs 3× | Run /platform-skills:self-improve review to promote |
| Lesson is stable | Run /platform-skills:self-improve promote to write to project memory |
The .learnings/ directory and memory/working-buffer.md are local file state. No pipeline, no cluster access, no cloud credentials needed.
| Domain | Integration point |
|---|---|
references/platform-mindset.md | Post-mortems and blameless retros are the human equivalent of .learnings/ERRORS.md — use both |
references/mcp.md | An MCP server can expose .learnings/ contents as a resource so Claude reads it via resources/read without manual file loading |
references/conventional-commits.md | Use conventional commit format when promoting learnings to CLAUDE.md: docs(memory): promote ERR-20260520-001 — never use terraform destroy without state backup |
references/platform-operating-model.md | ADL Protocol maps directly to the ownership boundary decisions described there |
Compact it: summarise completed WAL entries into a single ## Completed section and delete individual entries. Keep only the current in-progress task at full detail.
Check whether .learnings/ is in .gitignore. If it is, the agent must re-read the files explicitly at session start — they are not loaded automatically. Use /platform-skills:self-improve init to verify the setup.
Raise the VFM threshold in CLAUDE.md or AGENTS.md:
# Agent self-improvement settings
VFM_THRESHOLD=70 # default 50; raise to require stronger justificationCheck that the PostToolUse hook is configured in .claude/settings.json. Alternatively, ask the agent to log manually: "Log that error to .learnings/ERRORS.md."
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests