Manages planned cluster maintenance across all tiers. Self-Hosted covers node drain procedures for OS patching, hardware changes, and configuration updates. Advanced/BYOC covers maintenance window configuration, patch scheduling, deferral policies, and monitoring during CRL-managed maintenance. Standard and Basic maintenance is fully managed with no customer action. Use when planning maintenance, configuring maintenance windows, or preparing applications for maintenance events.
94
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
Manages planned cluster maintenance across all deployment tiers. For Self-Hosted, this means draining and restarting individual nodes. For Advanced/BYOC, this means configuring and managing maintenance windows for CRL-applied patches. For Standard and Basic, maintenance is fully managed with no customer action required.
For permanent node removal: Use managing-cluster-capacity. For pre-maintenance health check: Use reviewing-cluster-health. For version upgrades: Use upgrading-cluster-version.
| Question | Options | Why It Matters |
|---|---|---|
| Deployment tier? | Self-Hosted, Advanced, BYOC, Standard, Basic | Determines maintenance procedure |
| Goal? | Plan maintenance, Configure maintenance window, Defer a patch, Monitor during maintenance, Prepare application | Routes to the right procedure |
If Self-Hosted:
| Question | Options | Why It Matters |
|---|---|---|
| Maintenance type? | OS patching, Hardware change, Binary upgrade, Config change, Planned restart | Affects sequencing and post-maintenance steps |
| Deployment platform? | Bare metal, VMs, Kubernetes (Operator/Helm/manual) | Changes drain and restart commands |
| Process manager? | systemd, manual, container orchestrator | Changes stop/start commands |
| Target node ID? | Node ID | Required for drain command |
| Long-running queries expected? | Yes (increase drain timeout), No (default timeout) | Determines drain-wait parameter |
If Advanced or BYOC:
| Question | Options | Why It Matters |
|---|---|---|
| Maintenance window configured? | Yes (what schedule), No | Determines if window needs setup |
| Patch pending? | Yes, No, Don't know | Determines urgency |
| Cloud provider? (BYOC only) | AWS, GCP, Azure | For infrastructure-level monitoring |
If Standard or Basic: No context needed — maintenance is fully managed.
| Tier | Go To |
|---|---|
| Self-Hosted | Self-Hosted Node Maintenance |
| Advanced | Advanced Maintenance Management |
| BYOC | BYOC Maintenance Management |
| Standard | Standard Maintenance |
| Basic | Basic Maintenance |
Applies when: Tier = Self-Hosted
Self-Hosted operators manage all maintenance directly. The core operation is draining a node to safely move leases and connections before stopping it.
Run all checks before any maintenance operation. Stop if any check fails.
-- Check 1: All nodes live (STOP if any node is not live)
SELECT n.node_id, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- Check 2: No other nodes currently draining (STOP if any draining)
SELECT node_id FROM crdb_internal.gossip_liveness WHERE draining = true;
-- Check 3: Ranges fully replicated (STOP if under-replicated ranges exist)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Check 4: No disruptive jobs running (WAIT or pause before proceeding)
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, status, now() - created AS running_for FROM j
WHERE status IN ('running', 'paused')
AND job_type IN ('SCHEMA CHANGE', 'BACKUP', 'RESTORE', 'IMPORT', 'NEW SCHEMA CHANGE');
-- Check 5: Not mid-upgrade (STOP if versions differ)
SELECT DISTINCT build_tag FROM crdb_internal.gossip_nodes;
-- Check 6: Storage utilization safe (WARNING if any node > 70%)
SELECT node_id,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;Stop conditions: Do not proceed with maintenance if any node is not live, ranges are under-replicated, another node is draining, or a rolling upgrade is in progress. Wait for running jobs to complete or pause them.
See maintenance-prechecks reference for a consolidated precheck script.
If platform = bare metal or VMs:
cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address>If long-running queries expected:
cockroach node drain --self --certs-dir=<certs-dir> --host=<node-address> --drain-wait=60sIf platform = Kubernetes:
# Operator handles drain automatically during pod eviction
kubectl delete pod <pod-name>
# Or for rolling restart:
kubectl rollout restart statefulset cockroachdbIf process manager = systemd:
sudo systemctl stop cockroachdb
# ... perform maintenance ...
sudo systemctl start cockroachdbIf process manager = manual:
kill -TERM $(pgrep -f 'cockroach start')
# ... perform maintenance ...
cockroach start --certs-dir=<certs-dir> --store=<path> --join=<addresses> --backgroundNever use kill -9 unless the process is unresponsive to SIGTERM.
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <node_id>;
-- is_live = true
SELECT node_id, lease_count FROM crdb_internal.kv_store_status WHERE node_id = <node_id>;
-- lease_count should increase over minutes as leases rebalanceSee drain-details reference for drain phases, timeout configuration, and advanced monitoring.
Periodic storage maintenance for Self-Hosted clusters:
Ballast file verification:
ls -lh <store-path>/auxiliary/EMERGENCY_BALLAST
# If missing, create: cockroach debug ballast <store-path>/auxiliary/EMERGENCY_BALLAST --size=1GiBDisk utilization check:
SELECT node_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;Nodes above 70% utilization should be addressed before maintenance — draining a node temporarily increases load on remaining nodes.
Applies when: Tier = Advanced
Advanced clusters are managed by Cockroach Labs. CRL applies patches and performs infrastructure maintenance during the configured maintenance window. You do not drain or restart nodes — CRL handles this using rolling restarts.
If no window is configured, CRL applies patches at a time of their choosing.
Cloud Console → Cluster → Settings → Maintenance shows the current schedule.
Cloud API:
curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.maintenance_window'If a pending patch needs to be delayed (e.g., for testing):
Deferred patches still apply at the end of the deferral period. Deferral only delays — it does not skip.
Single-node clusters experience downtime during maintenance. Consider scaling to 3+ nodes for production workloads.
Cloud Console:
SQL (during maintenance):
-- Check which nodes are currently live
SELECT node_id, build_tag, is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;Applies when: Tier = BYOC
BYOC maintenance follows the same CRL-managed process as Advanced. Follow all Advanced Maintenance Management steps for maintenance window configuration, patch deferral, and monitoring.
Since BYOC clusters run in your cloud account, you can directly observe maintenance operations:
If AWS:
If GCP:
If Azure:
For infrastructure changes in your cloud account that CRL does not manage (VPC, security groups, IAM, DNS):
Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no nodes, no maintenance windows to configure, and no patches to defer. Cockroach Labs manages all maintenance transparently.
Applies when: Tier = Basic
Basic is a serverless offering. All maintenance is fully managed by Cockroach Labs. The serverless architecture is designed for zero-downtime maintenance.
Read-only monitoring queries are safe on all tiers.
Self-Hosted node maintenance:
/health?ready=1 returning errorAdvanced/BYOC maintenance windows:
Standard/Basic: No maintenance risk for customers — fully managed by CRL.
See safety-guide reference for detailed risk matrix.
| Issue | Tier | Fix |
|---|---|---|
| Drain very slow | SH | Check SHOW CLUSTER STATEMENTS for stuck queries |
| Drain hangs | SH | Check logs; SIGTERM if unresponsive |
| Node won't rejoin after restart | SH | Verify --join flag; check network connectivity |
| Leases not returning to node | SH | Wait 5-10 min; monitor lease_count |
| Clients not reconnecting | SH | Verify load balancer health check is passing |
| Maintenance window missed | ADV/BYOC | Contact support |
| Unexpected maintenance outside window | ADV/BYOC | Emergency patches may be applied outside windows; check Cloud Console notifications |
| Latency during maintenance | ADV/BYOC | Expected — temporarily reduced capacity; monitor and verify recovery after window |
Skill references:
Related skills:
Official CockroachDB Documentation:
84bc1e4
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.