Manages CockroachDB cluster capacity across all tiers. Self-Hosted covers node decommissioning for permanent removal and adding nodes for expansion. Advanced/BYOC covers scaling node count and machine size via Cloud Console, API, or Terraform. Standard covers adjusting provisioned compute (vCPUs). Basic auto-scales — guidance covers spending limits and cost management. Use when scaling capacity up or down, permanently removing nodes, or managing costs.
94
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Manages cluster capacity across all CockroachDB deployment tiers. What "capacity" means varies by tier — Self-Hosted manages individual nodes, Advanced/BYOC manage node count and machine size, Standard manages provisioned vCPUs, and Basic auto-scales with cost controls.
For temporary maintenance (not capacity changes): Use performing-cluster-maintenance. For pre-operation health check: Use reviewing-cluster-health.
| Question | Options | Why It Matters |
|---|---|---|
| Deployment tier? | Self-Hosted, Advanced, BYOC, Standard, Basic | Different capacity model per tier |
| Direction? | Scale up (add capacity), Scale down (reduce capacity) | Determines procedure |
If Self-Hosted (scaling down):
| Question | Options | Why It Matters |
|---|---|---|
| How many nodes to remove? | 1, multiple | Multi-node decommission should be done simultaneously |
| Target node IDs? | Node IDs from cockroach node status | Required for CLI commands |
| Is the node alive or dead? | Alive, Dead | Dead nodes use a different procedure |
| Deployment platform? | Bare metal, VMs, Kubernetes | Changes CLI and cleanup steps |
| Current replication factor? | 3, 5, custom | Must have enough nodes remaining |
| Current node count? | Number | Validates remaining capacity |
| Storage utilization? | Low (<60%), Medium (60-80%), High (>80%) | Determines urgency and whether storage maintenance is needed |
If Advanced or BYOC:
| Question | Options | Why It Matters |
|---|---|---|
| Scale method? | Cloud Console, API, Terraform | Determines procedure |
| Current and target configuration? | e.g., 5 nodes → 3 nodes, or 4 vCPU → 8 vCPU | Validates constraints |
| Cloud provider? (BYOC only) | AWS, GCP, Azure | Affects infrastructure verification |
If Standard:
| Question | Options | Why It Matters |
|---|---|---|
| Current provisioned vCPUs? | Number | Context for scaling decision |
| Target vCPUs? | Number | Validates workload will fit |
If Basic: Gather cost management goals — Basic auto-scales with no manual capacity control.
| Tier | Go To |
|---|---|
| Self-Hosted | Self-Hosted Capacity Management |
| Advanced | Advanced Scaling |
| BYOC | BYOC Scaling |
| Standard | Standard Compute Management |
| Basic | Basic Cost Management |
Applies when: Tier = Self-Hosted
-- All nodes live
SELECT n.node_id, n.is_live, n.build_tag
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- Ranges fully replicated
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Remaining capacity check
SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;
-- Replication factor
SHOW ZONE CONFIGURATION FOR RANGE default;Remaining nodes must stay < 60% utilization after absorbing data. Node count after decommission must be >= replication factor.
# Step 1: Drain
cockroach node drain <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
# Step 2: Decommission (single node)
cockroach node decommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>
# Step 2: Decommission (multiple nodes — more efficient, do simultaneously)
cockroach node decommission <id_1> <id_2> <id_3> --certs-dir=<certs-dir> --host=<any-live-node>When a node has been dead longer than server.time_until_store_dead (default 5m), CockroachDB automatically re-replicates its data to surviving nodes. Use this procedure to clean up the dead node and optionally add a replacement.
Step 1: Confirm the node is dead and data is safe
-- Confirm node is dead
SELECT node_id, is_live FROM crdb_internal.gossip_nodes WHERE node_id = <dead_node_id>;
-- Verify all ranges are fully replicated (no under-replicated after re-replication)
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Check remaining capacity can handle the load
SELECT node_id, ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;If under-replicated ranges exist, wait for re-replication to complete before proceeding.
Step 2: Decommission the dead node (metadata cleanup)
cockroach node decommission <dead_node_id> --certs-dir=<certs-dir> --host=<any-live-node>Step 3: Add a replacement node (recommended)
If remaining nodes are above 60% utilization, provision a replacement node using the Scaling Up: Add Nodes procedure.
Multiple dead nodes: Decommission all dead nodes simultaneously:
cockroach node decommission <id_1> <id_2> --certs-dir=<certs-dir> --host=<any-live-node>See replacing-failed-nodes reference for detailed failure scenarios and recovery procedures.
cockroach node status --decommission --certs-dir=<certs-dir> --host=<any-live-node>Wait for gossiped_replicas = 0 and membership = 'decommissioned'. Then stop the process on the decommissioned node.
cockroach node recommission <node_id> --certs-dir=<certs-dir> --host=<any-live-node>Only works while still in decommissioning state.
cockroach version to confirm)--join pointing to existing cluster nodesSELECT node_id, address, is_live FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY node_id;SELECT node_id, range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY node_id;SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
SELECT node_id, range_count, lease_count,
ROUND((1 - available::FLOAT / capacity::FLOAT) * 100, 2) AS utilization_pct
FROM crdb_internal.kv_store_status ORDER BY node_id;Applies when: Tier = Advanced
Advanced clusters are managed by Cockroach Labs. Capacity is adjusted by changing node count or machine size.
# Scale node count
curl -X PATCH -H "Authorization: Bearer $COCKROACH_API_KEY" \
-H "Content-Type: application/json" \
-d '{"config": {"num_nodes": <new_count>}}' \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>"resource "cockroach_cluster" "example" {
dedicated {
num_virtual_cpus = 8 # vCPUs per node
storage_gib = 150
num_nodes = 5 # total nodes
}
}-- Ensure no disruptive jobs are running before scaling down
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j WHERE status = 'running' GROUP BY 1, 2;Applies when: Tier = BYOC
Follow all Advanced Scaling steps. BYOC scaling is managed through the same Cloud Console/API/Terraform interfaces.
If AWS:
aws ec2 describe-instances --filters "Name=tag:cockroach-cluster,Values=<cluster-name>" \
--query 'Reservations[].Instances[].{ID:InstanceId,State:State.Name}'If GCP:
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"If Azure:
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<name>']"Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no individual nodes. Capacity is managed by adjusting provisioned compute (vCPUs).
Monitor P99 latency and QPS in Cloud Console for 24-48 hours. If latency increases after scaling down, scale compute back up.
Applies when: Tier = Basic
Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to manage. Capacity scales automatically based on demand. Cost is managed through spending controls.
If you need explicit control over compute capacity (guaranteed vCPUs), consider upgrading to Standard. If you need dedicated infrastructure, consider Advanced.
| Operation | Tier | Reversible? |
|---|---|---|
cockroach node decommission | SH | Recommission only before completion |
| Stop decommissioned node | SH | No (must rejoin as new node) |
| Add node to cluster | SH | Yes (decommission to remove) |
| Scale via Console/API | ADV/BYOC | Contact support to reverse |
| Adjust provisioned vCPUs | STD | Yes (scale back) |
| Set spending limit | BAS | Yes (adjust anytime) |
Critical (Self-Hosted):
| Issue | Tier | Fix |
|---|---|---|
| Decommission hangs | SH | Check zone config constraints; investigate stalled ranges |
| Recommission fails | SH | Node already fully decommissioned; must rejoin as new |
| New node not rebalancing | SH | Wait for automatic rebalancing; check range_count |
| Scale-down rejected | ADV/BYOC | Below minimum or data won't fit |
| Latency spike after reduction | STD | Scale provisioned vCPUs back up |
| Cloud instances not cleaned up | BYOC | Contact support; verify in cloud console |
| Dead node not re-replicating | SH | Check server.time_until_store_dead; verify surviving nodes have capacity |
| Storage utilization high after scale-down | SH | Add replacement node or increase disk size |
Skill references:
Related skills:
Official CockroachDB Documentation:
84bc1e4
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.