Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.
67
84%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
This document explains what the self-improve skill does, why it exists, and how to use each part of it.
AI agents forget between sessions. Every time you start a new conversation, the agent has no memory of the last one. This means:
The self-improve skill gives the agent a persistent memory layer and a set of behavioral protocols that survive session boundaries, context compaction, and interruptions.
The skill combines two complementary patterns:
| Pattern | What it does |
|---|---|
| Self-Improving Agent | Captures mistakes, learnings, and feature requests in .learnings/. Detects recurring patterns. Promotes stable lessons to project memory. |
| Proactive Agent | Gives the agent a consistent framework for when and how to act without being asked — WAL protocol, working buffer, decision scoring, heartbeat, and reverse prompting. |
You can use either pattern in isolation, but they work best together.
After running /platform-skills:self-improve init, your project gains:
.learnings/
LEARNINGS.md # Positive learnings — what worked, useful techniques
ERRORS.md # Mistakes and wrong assumptions — what broke and why
FEATURE_REQUESTS.md # Recurring needs the current tool set couldn't meet
.pending-errors.log # Scratch file written by the PostToolUse hook (gitignored)
memory/
working-buffer.md # Live task state and WAL logAdd to .gitignore for personal/local notes:
.learnings/
memory/working-buffer.mdCommit the directories if you want the team to share and build on them.
Every entry in .learnings/ follows the same structure and lifecycle:
pending → resolved → promoted### ERR-20260520-001
**Status**: pending
**Context**: Applying a Terraform plan that replaced an RDS instance
**Content**: Assumed changing db_subnet_group_name was non-destructive. It forces replacement.
**Action**: Added lifecycle { prevent_destroy = true }. Promote to references/terraform.md.| Stage | Meaning | Who acts |
|---|---|---|
pending | Logged, not yet addressed | Agent logs automatically |
resolved | Fix applied — action recorded | Agent sets this in the same session if the fix was applied; otherwise user confirms |
promoted | Written to project memory | Agent after running /platform-skills:self-improve promote |
Key rule: If the agent logs an error and applies the fix in the same session, it sets Status: resolved immediately — no manual step needed.
init global / init local — Bootstrap the workspaceTwo explicit subcommands — no interactive prompt:
/platform-skills:self-improve init globalCreates ~/.claude/.learnings/ and ~/.claude/memory/. Learnings persist across all projects on your machine. Recommended for individuals. Offers to wire all three hooks in ~/.claude/settings.json and create ~/.claude/CLAUDE.md from the template.
/platform-skills:self-improve init localCreates .learnings/ and memory/ in the current project directory. Learnings live in the repo and can be committed and shared with the team. Asks whether to gitignore or commit. Offers to add the PostToolUse hook to .claude/settings.json.
/platform-skills:self-improve initNo argument — asks you to choose global or local, then proceeds as above.
Both subcommands are idempotent: if the target directory already exists, they report the current state and stop without overwriting.
log — Capture a learning, error, or feature request/platform-skills:self-improve log ERR
We assumed the EKS node group could be renamed in-place. It cannot — rename forces replacement of all nodes.What the agent does:
LRN, ERR, or FEAT).learnings/ for an existing entry with the same context — updates rather than duplicatingERR-20260520-001Status: resolved if the fix was appliedERR-20260520-001 in .learnings/ERRORS.md"resume — Resume after an interruption/platform-skills:self-improve resumeUse this when a session was interrupted (context compaction, browser close, long pause). The agent:
memory/working-buffer.mdkubectl get <resource> -n <namespace>terraform state listgit log --oneline -5PENDING stateExample scenario:
Session 1: You ask the agent to apply a Terraform plan across three modules.
The agent completes module 1 and 2, then your laptop sleeps.
Session 2: /platform-skills:self-improve resume
Agent reads the buffer: "Step 3: [ ] Apply module 3"
Agent runs: terraform state list → confirms module 1 and 2 applied
Agent applies module 3 and updates the buffer to [x]review — Surface recurring patterns/platform-skills:self-improve reviewThe agent reads all three .learnings/ files, groups entries by shared root-cause keywords, and reports:
pending entries older than 7 daysExample output:
PROMOTION CANDIDATE — ERR: "resource limits missing"
Entries: ERR-20260518-001, ERR-20260519-002, ERR-20260520-001
Suggested target: CLAUDE.md → "Always add resource requests and limits to every container spec"
Stale pending entries: 1
ERR-20260510-001 — "helm upgrade failed on rollback" — 10 days old, still pending
Learnings: 8 total, 3 pending, 5 resolved
Errors: 5 total, 1 pending, 4 resolved
Feature requests: 2 total, 2 pending
Promotion candidates: 1promote — Write a lesson to project memory/platform-skills:self-improve promote ERR-20260520-001The agent:
Reads the entry
Identifies the right promotion target:
| Target | When |
|---|---|
CLAUDE.md / AGENTS.md | Agent-level rule for every session in this project |
.github/copilot-instructions.md | GitHub Copilot workspace rules |
references/ guide | Reusable pattern for the whole team |
Drafts the promoted line in imperative voice (≤ 80 characters):
"Never change db_subnet_group_name without a replace plan and snapshot""Run helm diff upgrade before helm upgrade to preview rendered changes"Asks you to confirm the target file and wording before writing
Appends to the confirmed file and updates the entry status to promoted
Commits with: docs(memory): promote ERR-20260520-001 — never rename EKS node group in-place
Before any hard-to-reverse operation, the agent writes a WAL entry to memory/working-buffer.md:
## WAL Entry — 2026-05-20 14:32
**Operation**: terraform apply — destroys and recreates payments-rds RDS instance
**Affected resources**: aws_db_instance.payments (us-east-1)
**Blast radius**: payments-api will lose database connectivity during replacement (~8 min)
**Rollback**: Restore from snapshot rds:payments-rds-2026-05-20 using aws rds restore-db-instance-from-db-snapshot
**Status**: PENDINGAfter the operation succeeds, it updates Status: COMMITTED. If aborted, Status: ROLLED_BACK.
Operations that always get a WAL entry:
git reset --hard, git push --forceterraform destroy or terraform applyThe WAL entry survives context compaction. If the session is interrupted between writing the entry and completing the operation, resume mode reads it and verifies actual resource state before proceeding.
Before taking any unsolicited proactive action, the agent scores the action:
| Dimension | Weight | Example score |
|---|---|---|
| High Frequency (will this recur?) | ×3 | 8 |
| Failure Reduction (prevents real breakage?) | ×3 | 9 |
| User Burden (saves meaningful effort?) | ×2 | 7 |
| Self Cost (low effort for the agent?) | ×2 | 9 |
Threshold: score ≥ 50 → act proactively. Score < 50 → surface as a note and defer.
Example — proactively adding missing resource limits to a Deployment:
High Frequency: 8 × 3 = 24 (missing limits is a repeat pattern)
Failure Reduction: 9 × 3 = 27 (OOM kills cause production incidents)
User Burden: 7 × 2 = 14 (user would need to find and fix manually)
Self Cost: 9 × 2 = 18 (trivial three-line edit)
Total: 83 → actTo raise the threshold for your project, add to CLAUDE.md or AGENTS.md:
VFM_THRESHOLD=70Each session follows a consistent cycle that compounds over time:
Session start
→ Read working-buffer.md and .learnings/ to seed context
→ If buffer shows incomplete task: run /platform-skills:self-improve resume
During work
→ Log errors to ERRORS.md as they occur (Status: resolved immediately if fix applied)
→ Log useful techniques to LEARNINGS.md
→ Write WAL entry before any destructive operation
→ Update working-buffer.md after each significant step
Session end
→ Update buffer with final state
→ Check for recurring patterns (three+ same-context entries → promote)
→ Leave buffer intact if task is incomplete
Next session start
→ Buffer and learnings shorten ramp-up to < 60 secondsWith the PostToolUse hook configured, tool failures are automatically appended to $LEARNINGS_BASE/.learnings/.pending-errors.log.
Global setup (~/.claude/settings.json — applies to all projects):
{
"hooks": {
"PostToolUse": [
{
"matcher": ".*",
"hooks": [
{
"type": "command",
"command": "if [ \"$CLAUDE_TOOL_EXIT_CODE\" -ne 0 ]; then echo \"$(date -u +%Y-%m-%dT%H:%M:%SZ) TOOL_FAILURE: $CLAUDE_TOOL_NAME\" >> ~/.claude/.learnings/.pending-errors.log; fi"
}
]
}
]
}
}Project-local setup (.claude/settings.json in the project root):
{
"hooks": {
"PostToolUse": [
{
"matcher": ".*",
"hooks": [
{
"type": "command",
"command": "if [ \"$CLAUDE_TOOL_EXIT_CODE\" -ne 0 ]; then echo \"$(date -u +%Y-%m-%dT%H:%M:%SZ) TOOL_FAILURE: $CLAUDE_TOOL_NAME\" >> .learnings/.pending-errors.log; fi"
}
]
}
]
}
}Global setup must use the absolute ~/.claude/ path. Relative paths resolve from the project root and will write to the wrong place when global setup is active.
On /platform-skills:self-improve review, the agent reads .pending-errors.log, converts each line into a proper ERR entry in ERRORS.md, and clears the log.
For project-local setup, add to .gitignore:
.learnings/.pending-errors.logCLAUDE.md, it is up to you to review it periodically and retire it if it becomes stale.## Completed block and delete individual entries once the task is done.~/.claude/) is local to your machine. To share lessons with the team, use project-local setup and commit .learnings/, or promote entries to a shared references/ file..claude-plugin
.github
commands
docs
examples
agent-self-improve
argocd
awesome-docs
aws
cloudfront
functions
lambda-edge
functions
azure
compliance
conventional-commits
datadog
llm-observability
demo
documentation
dora
dynatrace
fluxcd
github-actions
composite-actions
configure-cloud
db-migrate
docker-build-push
k8s-deploy
notify-slack
pr-comment
release-tag
security-scan
setup-env
setup-terraform
terraform-plan
helm
web-service
templates
kubernetes
kyverno
mcp
observability
openshift
pr-review
ownership
runtime-security
supply-chain
terraform
references
scripts
skills
platform-skills
tests