CtrlK
BlogDocsLog inGet started
Tessl Logo

nitinjain999/platform-skills

Production-grade platform engineering handbook — Kubernetes, Terraform, Flux CD, GitHub Actions, AWS, and more.

67

Quality

84%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

HOW_IT_WORKS.mdexamples/agent-self-improve/

How Agent Self-Improvement Works

This document explains what the self-improve skill does, why it exists, and how to use each part of it.


The Problem It Solves

AI agents forget between sessions. Every time you start a new conversation, the agent has no memory of the last one. This means:

  • The same mistake gets made twice (or ten times)
  • A useful shortcut you discovered in one session is gone in the next
  • When a long task is interrupted mid-way, the agent cannot safely resume without manual re-briefing
  • Proactive actions happen on gut feel rather than a consistent decision framework

The self-improve skill gives the agent a persistent memory layer and a set of behavioral protocols that survive session boundaries, context compaction, and interruptions.


Two Patterns in One Skill

The skill combines two complementary patterns:

PatternWhat it does
Self-Improving AgentCaptures mistakes, learnings, and feature requests in .learnings/. Detects recurring patterns. Promotes stable lessons to project memory.
Proactive AgentGives the agent a consistent framework for when and how to act without being asked — WAL protocol, working buffer, decision scoring, heartbeat, and reverse prompting.

You can use either pattern in isolation, but they work best together.


Directory Structure

After running /platform-skills:self-improve init, your project gains:

.learnings/
  LEARNINGS.md          # Positive learnings — what worked, useful techniques
  ERRORS.md             # Mistakes and wrong assumptions — what broke and why
  FEATURE_REQUESTS.md   # Recurring needs the current tool set couldn't meet
  .pending-errors.log   # Scratch file written by the PostToolUse hook (gitignored)
memory/
  working-buffer.md     # Live task state and WAL log

Add to .gitignore for personal/local notes:

.learnings/
memory/working-buffer.md

Commit the directories if you want the team to share and build on them.


Entry Lifecycle

Every entry in .learnings/ follows the same structure and lifecycle:

pending → resolved → promoted
### ERR-20260520-001
**Status**: pending
**Context**: Applying a Terraform plan that replaced an RDS instance
**Content**: Assumed changing db_subnet_group_name was non-destructive. It forces replacement.
**Action**: Added lifecycle { prevent_destroy = true }. Promote to references/terraform.md.
StageMeaningWho acts
pendingLogged, not yet addressedAgent logs automatically
resolvedFix applied — action recordedAgent sets this in the same session if the fix was applied; otherwise user confirms
promotedWritten to project memoryAgent after running /platform-skills:self-improve promote

Key rule: If the agent logs an error and applies the fix in the same session, it sets Status: resolved immediately — no manual step needed.


The Five Modes

init global / init local — Bootstrap the workspace

Two explicit subcommands — no interactive prompt:

/platform-skills:self-improve init global

Creates ~/.claude/.learnings/ and ~/.claude/memory/. Learnings persist across all projects on your machine. Recommended for individuals. Offers to wire all three hooks in ~/.claude/settings.json and create ~/.claude/CLAUDE.md from the template.

/platform-skills:self-improve init local

Creates .learnings/ and memory/ in the current project directory. Learnings live in the repo and can be committed and shared with the team. Asks whether to gitignore or commit. Offers to add the PostToolUse hook to .claude/settings.json.

/platform-skills:self-improve init

No argument — asks you to choose global or local, then proceeds as above.

Both subcommands are idempotent: if the target directory already exists, they report the current state and stop without overwriting.


log — Capture a learning, error, or feature request

/platform-skills:self-improve log ERR
We assumed the EKS node group could be renamed in-place. It cannot — rename forces replacement of all nodes.

What the agent does:

  1. Classifies the entry (LRN, ERR, or FEAT)
  2. Checks .learnings/ for an existing entry with the same context — updates rather than duplicating
  3. Assigns the next sequential ID for today: ERR-20260520-001
  4. Writes the four-field entry and sets Status: resolved if the fix was applied
  5. Confirms: "Logged as ERR-20260520-001 in .learnings/ERRORS.md"

resume — Resume after an interruption

/platform-skills:self-improve resume

Use this when a session was interrupted (context compaction, browser close, long pause). The agent:

  1. Reads memory/working-buffer.md
  2. Identifies the last completed step
  3. Verifies the actual state of affected resources before proceeding:
    • Files: checks they exist and have expected content
    • Kubernetes: kubectl get <resource> -n <namespace>
    • Terraform: terraform state list
    • Git: git log --oneline -5
  4. Resumes from the first incomplete step — does not re-run committed steps
  5. Resolves any WAL entries still in PENDING state

Example scenario:

Session 1: You ask the agent to apply a Terraform plan across three modules.
           The agent completes module 1 and 2, then your laptop sleeps.

Session 2: /platform-skills:self-improve resume
           Agent reads the buffer: "Step 3: [ ] Apply module 3"
           Agent runs: terraform state list → confirms module 1 and 2 applied
           Agent applies module 3 and updates the buffer to [x]

review — Surface recurring patterns

/platform-skills:self-improve review

The agent reads all three .learnings/ files, groups entries by shared root-cause keywords, and reports:

  • Promotion candidates — contexts that appear three or more times
  • Stale entriespending entries older than 7 days
  • Actionable FEATs — feature requests that an existing platform-skills domain could now address

Example output:

PROMOTION CANDIDATE — ERR: "resource limits missing"
Entries: ERR-20260518-001, ERR-20260519-002, ERR-20260520-001
Suggested target: CLAUDE.md → "Always add resource requests and limits to every container spec"

Stale pending entries: 1
  ERR-20260510-001 — "helm upgrade failed on rollback" — 10 days old, still pending

Learnings: 8 total, 3 pending, 5 resolved
Errors: 5 total, 1 pending, 4 resolved
Feature requests: 2 total, 2 pending
Promotion candidates: 1

promote — Write a lesson to project memory

/platform-skills:self-improve promote ERR-20260520-001

The agent:

  1. Reads the entry

  2. Identifies the right promotion target:

    TargetWhen
    CLAUDE.md / AGENTS.mdAgent-level rule for every session in this project
    .github/copilot-instructions.mdGitHub Copilot workspace rules
    references/ guideReusable pattern for the whole team
  3. Drafts the promoted line in imperative voice (≤ 80 characters):

    • ERR → negative rule: "Never change db_subnet_group_name without a replace plan and snapshot"
    • LRN → positive rule: "Run helm diff upgrade before helm upgrade to preview rendered changes"
  4. Asks you to confirm the target file and wording before writing

  5. Appends to the confirmed file and updates the entry status to promoted

  6. Commits with: docs(memory): promote ERR-20260520-001 — never rename EKS node group in-place


WAL Protocol — Safe Destructive Operations

Before any hard-to-reverse operation, the agent writes a WAL entry to memory/working-buffer.md:

## WAL Entry — 2026-05-20 14:32
**Operation**: terraform apply — destroys and recreates payments-rds RDS instance
**Affected resources**: aws_db_instance.payments (us-east-1)
**Blast radius**: payments-api will lose database connectivity during replacement (~8 min)
**Rollback**: Restore from snapshot rds:payments-rds-2026-05-20 using aws rds restore-db-instance-from-db-snapshot
**Status**: PENDING

After the operation succeeds, it updates Status: COMMITTED. If aborted, Status: ROLLED_BACK.

Operations that always get a WAL entry:

  • Deleting or overwriting files
  • git reset --hard, git push --force
  • terraform destroy or terraform apply
  • Dropping database tables or truncating data
  • Modifying shared infrastructure

The WAL entry survives context compaction. If the session is interrupted between writing the entry and completing the operation, resume mode reads it and verifies actual resource state before proceeding.


VFM Scoring — When to Act Without Being Asked

Before taking any unsolicited proactive action, the agent scores the action:

DimensionWeightExample score
High Frequency (will this recur?)×38
Failure Reduction (prevents real breakage?)×39
User Burden (saves meaningful effort?)×27
Self Cost (low effort for the agent?)×29

Threshold: score ≥ 50 → act proactively. Score < 50 → surface as a note and defer.

Example — proactively adding missing resource limits to a Deployment:

High Frequency:    8 × 3 = 24  (missing limits is a repeat pattern)
Failure Reduction: 9 × 3 = 27  (OOM kills cause production incidents)
User Burden:       7 × 2 = 14  (user would need to find and fix manually)
Self Cost:         9 × 2 = 18  (trivial three-line edit)
Total: 83 → act

To raise the threshold for your project, add to CLAUDE.md or AGENTS.md:

VFM_THRESHOLD=70

The Growth Loop

Each session follows a consistent cycle that compounds over time:

Session start
  → Read working-buffer.md and .learnings/ to seed context
  → If buffer shows incomplete task: run /platform-skills:self-improve resume

During work
  → Log errors to ERRORS.md as they occur (Status: resolved immediately if fix applied)
  → Log useful techniques to LEARNINGS.md
  → Write WAL entry before any destructive operation
  → Update working-buffer.md after each significant step

Session end
  → Update buffer with final state
  → Check for recurring patterns (three+ same-context entries → promote)
  → Leave buffer intact if task is incomplete

Next session start
  → Buffer and learnings shorten ramp-up to < 60 seconds

Automatic Error Capture via Hook

With the PostToolUse hook configured, tool failures are automatically appended to $LEARNINGS_BASE/.learnings/.pending-errors.log.

Global setup (~/.claude/settings.json — applies to all projects):

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": ".*",
        "hooks": [
          {
            "type": "command",
            "command": "if [ \"$CLAUDE_TOOL_EXIT_CODE\" -ne 0 ]; then echo \"$(date -u +%Y-%m-%dT%H:%M:%SZ) TOOL_FAILURE: $CLAUDE_TOOL_NAME\" >> ~/.claude/.learnings/.pending-errors.log; fi"
          }
        ]
      }
    ]
  }
}

Project-local setup (.claude/settings.json in the project root):

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": ".*",
        "hooks": [
          {
            "type": "command",
            "command": "if [ \"$CLAUDE_TOOL_EXIT_CODE\" -ne 0 ]; then echo \"$(date -u +%Y-%m-%dT%H:%M:%SZ) TOOL_FAILURE: $CLAUDE_TOOL_NAME\" >> .learnings/.pending-errors.log; fi"
          }
        ]
      }
    ]
  }
}

Global setup must use the absolute ~/.claude/ path. Relative paths resolve from the project root and will write to the wrong place when global setup is active.

On /platform-skills:self-improve review, the agent reads .pending-errors.log, converts each line into a proper ERR entry in ERRORS.md, and clears the log.

For project-local setup, add to .gitignore:

.learnings/.pending-errors.log

What This Skill Cannot Do

  • It does not replace human post-mortems. It captures agent-level errors. System-level incidents and team retrospectives still belong in your incident management process.
  • It cannot verify that promoted rules are being followed. Once a rule is in CLAUDE.md, it is up to you to review it periodically and retire it if it becomes stale.
  • The working buffer is not a database. Keep it lean — summarise completed WAL entries into a single ## Completed block and delete individual entries once the task is done.
  • Global learnings are not shared automatically. The global workspace (~/.claude/) is local to your machine. To share lessons with the team, use project-local setup and commit .learnings/, or promote entries to a shared references/ file.

Further Reading

  • references/agent-self-improve.md — full reference for all protocols
  • commands/self-improve.md — slash command specification
  • HOW_IT_WORKS.md — how the platform-skills skill works in general
  • GETTING_STARTED.md — install and first session

examples

BEFORE_AFTER.md

CHANGELOG.md

CODE_OF_CONDUCT.md

COMMANDS.md

CONTRIBUTING.md

EDITOR_INTEGRATIONS.md

GETTING_STARTED.md

HOW_IT_WORKS.md

install.sh

INSTALLATION.md

LAUNCH.md

PROMPTS.md

QUICKSTART.md

README.md

renovate.json

SECURITY.md

SKILL.md

tessl.json

tile.json