Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Bootstrap and operate a self-improving, proactive agent workspace.
Before executing any mode, resolve LEARNINGS_BASE:
init global → use ~/.claude/init local → use . (current working directory)~/.claude/.learnings/ exists → use ~/.claude/ as base (global setup).learnings/ exists in the current working directory → use . as base (project setup)init (no argument) → ask the user to choose (see init mode below)~/.claude/, create the directories, and inform the user that global setup was auto-created~/.claude/ resolves consistently across all platforms (macOS, Linux, Windows) because Claude Code uses os.homedir() for ~. On Windows this maps to C:\Users\<you>\.claude\ — no manual path adjustment needed.
All path references in every mode below use LEARNINGS_BASE as the root:
| Logical path | Resolved path (global) | Resolved path (project) |
|---|---|---|
.learnings/LEARNINGS.md | ~/.claude/.learnings/LEARNINGS.md | .learnings/LEARNINGS.md |
.learnings/ERRORS.md | ~/.claude/.learnings/ERRORS.md | .learnings/ERRORS.md |
.learnings/FEATURE_REQUESTS.md | ~/.claude/.learnings/FEATURE_REQUESTS.md | .learnings/FEATURE_REQUESTS.md |
memory/working-buffer.md | ~/.claude/memory/working-buffer.md | memory/working-buffer.md |
memory/SESSION-STATE.md | ~/.claude/memory/SESSION-STATE.md | memory/SESSION-STATE.md |
memory/YYYY-MM-DD.md | ~/.claude/memory/YYYY-MM-DD.md | memory/YYYY-MM-DD.md |
.learnings/.pending-errors.log | ~/.claude/.learnings/.pending-errors.log | .learnings/.pending-errors.log |
Promotion targets (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md) always remain project-local regardless of scope — only the capture files follow LEARNINGS_BASE.
Reference: references/agent-self-improve.md → Global vs project scope
Scaffold the global workspace under ~/.claude/ — learnings persist across all projects.
/platform-skills:self-improve init globalSteps:
LEARNINGS_BASE=~/.claude/~/.claude/.learnings/ already exists: report current state, list existing files, and stop — do not overwrite~/.claude/.learnings/
LEARNINGS.md
ERRORS.md
FEATURE_REQUESTS.md
~/.claude/memory/
working-buffer.md
SESSION-STATE.mdStatus: example~/.claude/settings.json:
session-end.sh, session-start-reminder.sh) + inline bash PostToolUse; point to settings.json.examplesession-end.ps1, session-start-reminder.ps1) + inline PowerShell PostToolUse; point to settings-windows.json.exampleapk add bashexamples/agent-self-improve/scripts/; PostToolUse hook must use absolute path to .pending-errors.log (global setup)~/.claude/CLAUDE.md from the template at examples/agent-self-improve/global-claude.md✓ ~/.claude/.learnings/LEARNINGS.md — positive learnings
✓ ~/.claude/.learnings/ERRORS.md — mistakes and root causes
✓ ~/.claude/.learnings/FEATURE_REQUESTS.md — recurring unmet needs
✓ ~/.claude/memory/working-buffer.md — WAL scratchpad and task state
✓ ~/.claude/memory/SESSION-STATE.md — always-on session capture
✓ ~/.claude/memory/YYYY-MM-DD.md — daily notes (created on first use)/platform-skills:self-improve review after a few sessionsReference: references/agent-self-improve.md → Global vs project scope
Scaffold a project-local workspace in the current working directory — learnings live in the repo.
/platform-skills:self-improve init localSteps:
LEARNINGS_BASE=. (current working directory).learnings/ already exists in $PWD: report current state, list existing files, and stop — do not overwrite.learnings/
LEARNINGS.md
ERRORS.md
FEATURE_REQUESTS.md
memory/
working-buffer.md
SESSION-STATE.mdStatus: example.gitignore — ask the user:
.learnings/ and memory/ to .gitignorememory/ daily notes grow fast.claude/settings.json (this project only):
.learnings/.pending-errors.log (relative path is safe here — hooks run from project root)~/.claude/settings.json even for project-local learnings✓ .learnings/LEARNINGS.md — positive learnings
✓ .learnings/ERRORS.md — mistakes and root causes
✓ .learnings/FEATURE_REQUESTS.md — recurring unmet needs
✓ memory/working-buffer.md — WAL scratchpad and task state
✓ memory/SESSION-STATE.md — always-on session capture
✓ memory/YYYY-MM-DD.md — daily notes (created on first use)/platform-skills:self-improve review after a few sessionsReference: references/agent-self-improve.md → Global vs project scope
When called without global or local, ask the user to choose:
init global if neither ~/.claude/.learnings/ nor .learnings/ in $PWD exists~/.claude/.learnings/ already exists, recommend init local (global already set up).learnings/ in $PWD already exists, report its state and suggest using log, resume, or review insteadThen proceed as init global or init local based on the answer.
Reference: references/agent-self-improve.md → Directory layout, Entry format
Log a learning, error, or feature request to the appropriate file.
Steps:
LRN) — a technique, pattern, or shortcut that workedERR) — a mistake, misunderstanding, or failed assumptionFEAT) — a need that was unmet by the current skill or tool set<TYPE>-YYYYMMDD-NNN where NNN is the next sequential number in that file today$LEARNINGS_BASE/.learnings/ for an existing entry with the same context keywords. If one exists, update its Action field and keep the existing ID — do not create a duplicate.### LRN-20260520-001
**Status**: pending
**Context**: <one sentence — what was happening>
**Content**: <the learning, error description, or feature request>
**Action**: <what was done or should be done>Status: resolved and record what was done in Action.<ID> in $LEARNINGS_BASE/.learnings/<FILE>.md"Reference: references/agent-self-improve.md → Entry format, Recurring Pattern Detection
Resume an incomplete task after context compaction or session interruption.
Steps:
$LEARNINGS_BASE/memory/working-buffer.md — identify current task and last [x] step$LEARNINGS_BASE/memory/SESSION-STATE.md — reload corrections, preferences, and decisions$LEARNINGS_BASE/memory/YYYY-MM-DD.md — reload recent session exchangeskubectl get <resource> -n <namespace>terraform state listgit log --oneline -5[ ] step — do not re-run already-committed stepsStatus: PENDING, determine whether the operation completed (check the resource) and update to COMMITTED or ROLLED_BACK accordinglyNever ask "where were we?" — the buffer and session state answer that.
Reference: references/agent-self-improve.md → Compaction Recovery, SESSION-STATE
Scan $LEARNINGS_BASE/.learnings/ for recurring patterns and surface actionable items.
Steps:
$LEARNINGS_BASE/.learnings/ filesPROMOTION CANDIDATE — ERR: "missing resource limits"
Entries: ERR-20260518-001, ERR-20260519-002, ERR-20260520-001
Suggested target: .github/copilot-instructions.md → "Always add resource limits"pending state older than 7 daysresolved state older than 30 days — these are stale and should be either promoted or discarded:
STALE RESOLVED — LRN-20260410-001: "helm diff before upgrade" (45 days in resolved)
Action: run /platform-skills:self-improve promote LRN-20260410-001 or set Status: discardedFEAT entries that could be addressed by an existing platform-skills domain$LEARNINGS_BASE/.learnings/.pending-errors.log if it exists and is non-empty — convert each line to a proper ERR entry and clear the logLearnings: 8 total, 3 pending, 5 resolved (1 stale)
Errors: 5 total, 1 pending, 4 resolved
Feature requests: 2 total, 2 pending
Promotion candidates: 1 | Stale resolved: 1Reference: references/agent-self-improve.md → Recurring Pattern Detection
Promote a resolved entry to the correct memory file.
Steps:
Read the entry by ID (e.g. ERR-20260520-001)
Determine if this rule applies globally (all projects) or locally (this project only):
~/.claude/CLAUDE.md under ## Agent RulesCLAUDE.md / AGENTS.md or .github/copilot-instructions.mdIdentify the correct promotion target:
| Target | Scope | When to use |
|---|---|---|
~/.claude/CLAUDE.md → ## Agent Rules | Global | Rule applies across all projects (only available for global setup) |
CLAUDE.md / AGENTS.md | Project | Agent-level rules for this project only |
.github/copilot-instructions.md | Project | GitHub Copilot workspace rules |
A references/ guide | Shared | Reusable pattern for the whole team |
Draft the promoted line — imperative voice, ≤ 80 characters:
kubectl delete without first capturing the manifest"helm diff upgrade before helm upgrade to preview changes"Ask the user to confirm the target file and wording before writing
Append to the confirmed target file under ## Agent Rules or ## Platform Rules
Update the entry Status in $LEARNINGS_BASE/.learnings/ from resolved to promoted
Commit with a conventional commit message:
docs(memory): promote <ID> — <imperative summary>
Reference: references/agent-self-improve.md → Entry lifecycle, Promotion targets
Capture a correction, preference, decision, or proper noun to memory/SESSION-STATE.md.
Steps:
$LEARNINGS_BASE/memory/SESSION-STATE.md:
- YYYY-MM-DD — <one sentence capturing what was said or decided>Last updated: timestamp at the top of the file$LEARNINGS_BASE/memory/SESSION-STATE.md"When to invoke proactively (without being asked):
Reference: references/agent-self-improve.md → SESSION-STATE, Compaction Recovery
Print a one-screen health summary of the self-improve workspace — no changes made.
/platform-skills:self-improve statusSteps:
LEARNINGS_BASE (same auto-detection as all other modes)$LEARNINGS_BASE/.learnings/ files and $LEARNINGS_BASE/memory/working-buffer.mdSelf-Improve Status
───────────────────────────────────────────────
Workspace: ~/.claude/ (global) [or: ./ (local)]
Learnings 3 pending 8 resolved 2 promoted
Errors 1 pending 4 resolved 0 promoted
Feature reqs 2 pending 0 resolved 0 promoted
Pending errors log: 2 unprocessed entries
Working buffer: active task — "deploy payments service"
Buffer age: 2 days
Last session: 2026-05-23 (today)
Sessions since review: 3 of 5
Action items:
• 1 pending ERR older than 7 days → run review
• 2 stale resolved LRN (30+ days) → promote or discard
• Run /platform-skills:self-improve review (due in 2 sessions)
───────────────────────────────────────────────.pending-errors.log is non-empty, note count but do not drain it (status is read-only)Reference: references/agent-self-improve.md → Entry lifecycle
Move the workspace from one scope to the other without losing any entries.
/platform-skills:self-improve migrate global # project-local → ~/.claude/
/platform-skills:self-improve migrate local # ~/.claude/ → current projectSteps:
migrate global: source is .learnings/ and memory/ in $PWDmigrate local: source is ~/.claude/.learnings/ and ~/.claude/memory/$LEARNINGS_BASE/memory/working-buffer.md before moving anything.learnings/*.md entries and memory/ files to the targetMigrated to ~/.claude/ (global):
✓ .learnings/LEARNINGS.md — 8 entries
✓ .learnings/ERRORS.md — 5 entries
✓ .learnings/FEATURE_REQUESTS.md — 2 entries
✓ memory/working-buffer.md
✓ memory/SESSION-STATE.md
Source removed: ./.learnings/, ./memory/.claude/settings.json, offer to update them for the new scopeReference: references/agent-self-improve.md → Global vs project scope
These protocols run automatically when the proactive agent pattern is active. No explicit mode is required.
Before any destructive or hard-to-reverse operation:
$LEARNINGS_BASE/memory/working-buffer.md before acting## WAL Entry — YYYY-MM-DD HH:MM
**Operation**: <what is about to happen>
**Affected resources**: <list files, K8s resources, cloud resources>
**Blast radius**: <what could break>
**Rollback**: <exact command to undo>
**Status**: PENDINGStatus to COMMITTED after successROLLED_BACK if abortedDestructive operations requiring a WAL entry: deleting files, git reset --hard, git push --force, terraform destroy, dropping database tables, modifying shared infrastructure.
Maintain $LEARNINGS_BASE/memory/working-buffer.md as a live task scratchpad:
[x] progress markersMaintain $LEARNINGS_BASE/memory/SESSION-STATE.md as always-on session capture. Write to it before responding whenever:
This file is the second read in compaction recovery (after working-buffer, before daily notes).
Write notable exchanges, discoveries, and outcomes to $LEARNINGS_BASE/memory/YYYY-MM-DD.md (today's date). One file per day, append-only. Read today's file at session start alongside the working buffer.
Before reporting a task as complete:
kubectl get, terraform plan)Never claim a fix is done based on the intent to fix it. Evidence required.
When choosing between implementation approaches, apply this priority:
Stability > Explainability > Reusability > Scalability > NoveltyLog the decision in the buffer when it was a non-obvious choice.
Before unsolicited proactive action, score against four dimensions (max 100). Act only if score ≥ 50:
Check CLAUDE.md or AGENTS.md for VFM_THRESHOLD=<N> before applying the default of 50. Use that value if present.
For tasks > 10 steps or > 5 minutes, report progress without waiting to be asked:
[Heartbeat] Completed 4/7 steps. Currently: <step>. Next: <step>.Ask one clarifying question before acting on ambiguous or high-risk instructions. Never ask more than one question per instruction.
.claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests