Use when the user is in the middle of a production incident and needs an orchestrated plan — phrases like "we have an outage", "prod is crashing", "page me through this", "what do I do first?", "mitigate or roll back?", "production is down since the last release", "something's broken in staging — help me triage", "we're on an incident call, walk me through the ConfigHub side", "post-incident cleanup". Triage the situation, decide between stabilize-and-mitigate vs head-moving rollback vs drift reconciliation, route to the right mutation skill with the scope and `--change-desc` composed, and drive the post-incident verification + close-out. Do not load for planned releases (use `promote-release`), for routine change management, or for single-Unit edits the user is confidently making on their own (use `cub-mutate`).
89
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Orchestrator for the ConfigHub-side of a production incident. Triages, decides stabilize vs. rollback vs. reconcile, and hands off to the mutating skill that will do the actual work. Does not mutate itself.
An incident is an ongoing loss of service. The first move is to return to a known-good state; root-cause analysis comes after the bleeding stops. This skill is biased toward the fastest path back to green, not the most elegant fix.
promote-release.cub-mutate.verify-apply.confighub-core.Ask (or derive from context) in order. Stop at the first one that routes clearly:
What was applied to the cluster most recently, and is that the suspected cause?
Only applies change live state — ChangeSet opens, mutations, upgrades, and --restore operations don't hit the cluster until the corresponding cub unit apply. Ignore the head-revision / ChangeSet noise; look at what the Worker actually deployed in the incident window.
# Apply actions across the org that started after <incident-window-start>.
# Returns at most one row per Unit (the latest Apply), with Unit + Space
# slug columns so a repeated slug across Spaces is disambiguated.
cub unit-action list --space '*' \
--where "Action = 'Apply' AND CreatedAt > '<ISO-timestamp>'"
# To scope by app, first get the Unit IDs from the app filter, then use
# UnitID IN (...) in the where clause:
cub unit list --space '*' --filter <app>-home/<app>-app \
-o jq='[.[] | .Unit.UnitID] | join("'"'"','"'"'")' \
--no-headers
# => paste the IDs into:
cub unit-action list --space '*' \
--where "Action = 'Apply' AND CreatedAt > '<ISO-timestamp>' AND UnitID IN ('<id1>','<id2>',...)"If this returns rows plausibly linked to symptoms → rollback path (Path A). If it returns rows but symptoms don't match, or returns empty → mitigate path (Path B).
Inspect a specific action with cub unit-action get <unit-slug> <num> (the Num column from list), adding --data, --livedata, --livestate, or --bridgestate when the payload matters for triage.
Did anyone make out-of-band cluster changes (kubectl, argocd sync, flux reconcile) to stabilize?
drift-reconcile to decide who wins per change.Is the breakage widespread (many Units) or narrow (one Unit)?
ChangeSet.Slug on their end-tag revisions.)Is ConfigHub itself healthy?
cub worker status --space <workers-space> <worker>. If Worker is down, mutations can't apply anyway — route to worker-bootstrap fix first, hold the ConfigHub-side mitigation until the Worker is back.cub space list succeeds.Is the user paging, post-mortem, or blast-radius bounding?
Use this triage to pick one of the three paths.
Use when the incident started after a specific recent cub unit apply / promotion apply and reverting is the fastest path back. "Recent change" alone isn't the trigger — it's "recent change that was applied into the cluster."
The triage query in question 1 already identified the Units with applies in the incident window. If those applies came from a single release ChangeSet, restore scope = that ChangeSet. Otherwise, restore per-Unit.
# Which ChangeSet's end-tag apply produced the suspect applies? Read the
# LastChangeDescription from the triage output, or look at the applied revision:
cub revision list --space <env-space> <unit-slug> \
--where "RevisionNum = <LastAppliedRevisionNum from triage>" \
-o jq='.[] | {ChangeSet: .Revision.ChangeSet.Slug, Description: .Revision.Description}'
# Recent ChangeSets in the home Space (useful when the apply query above
# returned multiple Units in the same ChangeSet).
cub changeset list --space <app>-home --where "CreatedAt >= '<incident start ISO timestamp>'"Pick the restore target — Before:ChangeSet:<home-space>/<slug> if the suspect is a ChangeSet, or LastAppliedRevisionNum (before the incident-window apply — i.e., PreviousLiveRevisionNum or a specific prior number) for single-Unit cases.
Route to rollback-revision with the scope + target, and with a --change-desc draft that makes the incident context explicit:
Incident rollback — <one-line symptom>. Reverting <slug-or-changeset> per on-call decision.
User prompt: <verbatim>
Clarifications: <condensed — link to incident ticket / Slack thread, symptom evidence, who approved>rollback-revision owns the cub unit update --restore + cub-apply hand-off. incident-management returns after apply completes, then moves to the verify + close-out block below.
Remember: rollback means cub unit update --restore. cub unit apply --revision <N> is not a rollback — it leaves head unchanged and the bad state returns on the next forward change.
Use when the incident root cause is not a recent mutation (infra flake, load spike, external dependency down, bad external image on registry side) and a targeted forward fix gets to green faster than a rollback.
Typical mitigations:
| Symptom | Mitigation | Skill |
|---|---|---|
| Saturation / traffic spike | Scale up (set-replicas), raise resources (set-container-resources). | cub-mutate |
| Bad image tag just pushed by CI | Pin to the last known-good image via set-container-image. | cub-mutate |
| Feature flag / env var causing crash | Flip it off via set-env-var. | cub-mutate |
| Broken probe taking pods down | Relax or disable via set-int-path / yq-i / set-starlark / set-cel. | cub-mutate |
| Pod spec wedged on a new admission policy | Adjust the security-context / label the pod sets to satisfy policy. | cub-mutate |
For any mitigation that touches more than one Unit, open a ChangeSet (named incident-<YYYYMMDD>-<ticket>) so the fix can be tagged and rolled back as a set if it doesn't hold. Single-Unit: skip the ChangeSet.
Remember: opening a ChangeSet and running mutations inside it does not touch the cluster. Live state changes only when cub unit apply runs (explicitly, or via --revision ChangeSet:<slug> against the filter once the ChangeSet is closed). Compose the mutations, close the ChangeSet, then hand off to cub-apply to put the fix in front of traffic.
Compose the --change-desc the same way — make the incident context explicit — and hand off to cub-mutate for the data change, then cub-apply to push it.
Incident mitigation — <symptom>. <What we changed and why>.
User prompt: <verbatim>
Clarifications: <condensed — ticket / channel, decision, expected-to-hold-through >Use after an incident where someone applied hotfixes directly in-cluster (kubectl edit, argocd app sync with an override, etc.) and ConfigHub now diverges from what's live.
Hand off to drift-reconcile with the scope of affected Units. The decisions there (ConfigHub wins / cluster wins / selective merge) should usually lean cluster-wins for incident-time edits that stuck — absorb the stabilization into Data so the next apply doesn't undo it — and ConfigHub-wins for edits that were temporary and the user wants gone.
Don't start Path C until the incident is contained (symptoms resolved, no ongoing paging). Reconciling drift while pods are still crashing just adds confusion.
Every mutation during the incident, regardless of path, carries a --change-desc with:
Incident: <ticket-id or short slug>This costs nothing in the moment and saves hours in the postmortem. Skip anything else that doesn't get you back to green — naming Tags, cleaning up slugs, renaming ChangeSets — until the incident is closed.
Once symptoms are gone and the user confirms stable:
Tag the resolution. Tag the rollback / mitigation revisions so they're a first-class reference for the postmortem:
cub tag create --space <app>-home incident-<YYYYMMDD>-<ticket> \
--annotation "description=<short incident description> — resolution type: rollback|mitigate|reconcile"
cub unit tag <app>-home/incident-<YYYYMMDD>-<ticket> \
--space <env-space> --filter <app>-home/<app>-appRun verify-apply on the affected scope. It classifies the apply outcome (Progressing / Completed / Failed / Aborted), produces a three-way ConfigHub ↔ controller ↔ cluster agreement table on request, and — once everything converges — closes the incident out with a clickable Revision history + review links.
If Path B with a ChangeSet was used, consider whether the ChangeSet should be rolled back. A mitigation is often temporary — the root-cause fix comes later. Keep the ChangeSet open as a marker, or close it and rely on Tag:incident-* for retrieval.
If Path C absorbed manual cluster fixes into ConfigHub, confirm the absorbed data has passed the Space's platform/standard-vets Triggers. Incident-time edits often produce vet failures; fix them in a follow-up change once you're out of the hot window.
Read-only and decision-only. This skill does not mutate. Every mutation during an incident goes through the skill it hands off to:
rollback-revision.cub-mutate.cub-apply.drift-reconcile.verify-apply.If you find yourself about to run cub unit update / cub function do / cub function set / cub unit apply from here, stop and hand off.
--change-desc" to save time. Push back — it's one line, costs nothing, saves hours in postmortem. If they insist and the situation is truly dire, log the mutations and the reasoning yourself in the session; a follow-up commit can annotate.cub unit apply --revision <N> as a rollback. Stop and route to rollback-revision — that approach leaves head unchanged and the bad state returns on the next change.worker-bootstrap) first.verify-apply confirms the affected scope is Completed, three-way converged (where applicable), and closed out.cub revision list --space <env-space> --filter <app>-home/<app>-app --tag <app>-home/incident-<...> surfaces every incident-related revision under one query.--change-desc.cub tag get --space <app>-home incident-<...> --web — the incident marker in the GUI.cub changeset get --space <app>-home <slug> --web — if a ChangeSet was opened.cub unit get <unit> --space <env-space> --web per affected Unit.references/changesets.md — ChangeSet lifecycle + rollback via Before:ChangeSet:<slug>.references/revisions.md — restore-target syntax.references/cub-cli.md — --change-desc discipline.rollback-revision (Path A), cub-mutate (Path B data change), cub-apply (Path B runtime), drift-reconcile (Path C), verify-apply (close-out), worker-bootstrap (Worker-down blocker), promote-release (the opposite-direction skill — don't use during an incident).59ea831
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.