Performs a comprehensive health check of a CockroachDB cluster. Gathers deployment context first, then provides tier-appropriate diagnostics. Self-Hosted uses SQL against node-level system tables and CLI. Advanced/BYOC use Cloud Console and SQL with node visibility. Standard monitors provisioned compute and workload via Cloud Console. Basic monitors Request Unit consumption and connectivity. Use for daily checks, pre-maintenance validation, post-incident verification, or production readiness assessment.
94
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Performs a comprehensive health check of a CockroachDB cluster. Before running diagnostics, this skill gathers deployment context to provide the right queries and tools for the operator's tier.
For live query issues: Use triaging-live-sql-activity. For background jobs: Use monitoring-background-jobs. For range analysis: Use analyzing-range-distribution.
| Question | Options | Why It Matters |
|---|---|---|
| Deployment tier? | Self-Hosted, Advanced, BYOC, Standard, Basic | Determines available diagnostics and operator responsibilities |
| Reason for health check? | Daily check, Pre-maintenance, Post-incident, Pre-upgrade | Prioritizes which dimensions to check first |
If Self-Hosted:
| Question | Options | Why It Matters |
|---|---|---|
| Access available? | SQL + CLI, SQL only | Determines which tools can be used |
| Cloud provider? | AWS, GCP, Azure, On-Premises | Affects infrastructure-level checks |
| Kubernetes deployment? | Yes (Operator, Helm, manual), No | Changes CLI commands and monitoring |
| Node count and regions? | e.g., 9 nodes, 3 regions | Sets expectations for query results |
If Advanced or BYOC:
| Question | Options | Why It Matters |
|---|---|---|
| Cloud provider? (BYOC only) | AWS, GCP, Azure | For infrastructure-level monitoring in your cloud account |
If Standard:
| Question | Options | Why It Matters |
|---|---|---|
| Current provisioned vCPUs? | Number | Context for compute utilization assessment |
If Basic: No additional context needed.
| Tier | Go To |
|---|---|
| Self-Hosted | Self-Hosted Health Check |
| Advanced | Advanced Health Check |
| BYOC | BYOC Health Check |
| Standard | Standard Health Check |
| Basic | Basic Health Check |
Applies when: Tier = Self-Hosted
SELECT
n.node_id, n.address, n.build_tag AS version, n.locality,
n.is_live, l.epoch,
CASE WHEN n.is_live THEN 'HEALTHY'
WHEN n.is_live IS NULL THEN 'UNKNOWN'
ELSE 'DOWN' END AS health_status
FROM crdb_internal.gossip_nodes n
LEFT JOIN crdb_internal.gossip_liveness l ON n.node_id = l.node_id
ORDER BY n.node_id;is_live = false (from gossip_nodes) requires immediate investigationepoch suggests repeated restarts (node flapping)If CLI is available:
cockroach node status --certs-dir=<certs-dir> --host=<node-address>SELECT build_tag AS version, COUNT(*) AS node_count,
array_agg(node_id ORDER BY node_id) AS node_ids
FROM crdb_internal.gossip_nodes GROUP BY build_tag;SELECT node_id, store_id,
ROUND(capacity / 1073741824.0, 2) AS total_gb,
ROUND(available / 1073741824.0, 2) AS available_gb,
ROUND((1 - (available::FLOAT / capacity::FLOAT)) * 100, 2) AS utilization_pct,
CASE WHEN (available::FLOAT / capacity::FLOAT) < 0.10 THEN 'CRITICAL'
WHEN (available::FLOAT / capacity::FLOAT) < 0.30 THEN 'WARNING'
ELSE 'OK' END AS capacity_status,
range_count, lease_count
FROM crdb_internal.kv_store_status ORDER BY utilization_pct DESC;SELECT
CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
WHEN array_length(replicas, 1) = 2 THEN 'under_replicated'
WHEN array_length(replicas, 1) = 1 THEN 'critically_under_replicated'
ELSE 'unknown' END AS replication_status,
COUNT(*) AS range_count
FROM crdb_internal.ranges_no_leases GROUP BY 1 ORDER BY 1;SELECT node_id,
to_timestamp((metrics->>'security.certificate.expiration.ca')::FLOAT)::TIMESTAMPTZ AS ca_expires,
to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ AS node_cert_expires,
CASE WHEN to_timestamp((metrics->>'security.certificate.expiration.node')::FLOAT)::TIMESTAMPTZ
< now() + INTERVAL '90 days' THEN 'EXPIRING_SOON'
ELSE 'OK' END AS cert_status
FROM crdb_internal.kv_node_status ORDER BY node_cert_expires;SELECT variable, value FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'server.time_until_store_dead', 'admission.kv.enabled',
'cluster.preserve_downgrade_option', 'gc.ttlseconds'
) ORDER BY variable;SELECT 'live_nodes' AS metric, COUNT(*)::TEXT AS value
FROM crdb_internal.gossip_nodes WHERE is_live = true
UNION ALL SELECT 'dead_nodes', COUNT(*)::TEXT
FROM crdb_internal.gossip_nodes WHERE is_live = false
UNION ALL SELECT 'distinct_versions', COUNT(DISTINCT build_tag)::TEXT
FROM crdb_internal.gossip_nodes
UNION ALL SELECT 'total_ranges', COUNT(*)::TEXT
FROM crdb_internal.ranges_no_leases
UNION ALL SELECT 'min_store_available_pct',
ROUND(MIN(available::FLOAT / capacity::FLOAT) * 100, 2)::TEXT
FROM crdb_internal.kv_store_status
UNION ALL SELECT 'cluster_version', value
FROM [SHOW CLUSTER SETTING version];If reason = Pre-maintenance, also check for running jobs:
WITH j AS (SHOW JOBS)
SELECT job_type, COUNT(*) FROM j WHERE status = 'running' GROUP BY job_type;Use when verifying a cluster is ready for production workloads or during periodic operational reviews.
-- Node count and replication (minimum 3 nodes for production)
SELECT COUNT(*) AS total_nodes,
COUNT(*) FILTER (WHERE n.is_live) AS live_nodes,
COUNT(DISTINCT n.locality) AS distinct_localities
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id);
-- Critical production settings check
SELECT variable, value,
CASE
WHEN variable = 'kv.rangefeed.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'kv.rangefeed.enabled' AND value = 'false' THEN 'WARN: should be true for CDC'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'sql.stats.automatic_collection.enabled' AND value = 'false' THEN 'WARN: should be true'
WHEN variable = 'admission.kv.enabled' AND value = 'true' THEN 'OK'
WHEN variable = 'admission.kv.enabled' AND value = 'false' THEN 'WARN: recommended for production'
WHEN variable = 'cluster.preserve_downgrade_option' AND value != '' THEN 'INFO: finalization pending'
ELSE 'OK'
END AS assessment
FROM [SHOW ALL CLUSTER SETTINGS]
WHERE variable IN (
'kv.rangefeed.enabled', 'sql.stats.automatic_collection.enabled',
'admission.kv.enabled', 'cluster.preserve_downgrade_option',
'server.time_until_store_dead', 'gc.ttlseconds'
) ORDER BY variable;
-- Enterprise license status (Self-Hosted only)
SELECT value AS organization FROM [SHOW CLUSTER SETTING cluster.organization];See production-readiness reference for the full production readiness checklist.
Applies when: Tier = Advanced
Advanced clusters are dedicated single-tenant clusters managed by Cockroach Labs. You have node-level visibility via both Cloud Console and SQL.
-- Node liveness (nodes are visible on Advanced)
SELECT n.node_id, n.build_tag, n.is_live
FROM crdb_internal.gossip_nodes n
JOIN crdb_internal.gossip_liveness l USING (node_id) ORDER BY n.node_id;
-- Version consistency
SELECT build_tag AS version, COUNT(*) FROM crdb_internal.gossip_nodes GROUP BY 1;
-- Range health
SELECT CASE WHEN array_length(replicas, 1) >= 3 THEN 'fully_replicated'
ELSE 'under_replicated' END AS status, COUNT(*)
FROM crdb_internal.ranges_no_leases GROUP BY 1;
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, COUNT(*) FROM j
WHERE status IN ('running', 'failed') AND created > now() - INTERVAL '24 hours'
GROUP BY job_type, status;curl -s -H "Authorization: Bearer $COCKROACH_API_KEY" \
"https://cockroachlabs.cloud/api/v1/clusters/<cluster-id>" | jq '.state, .cockroach_version'Applies when: Tier = BYOC
BYOC clusters are dedicated and run in your cloud account. You have the same CockroachDB visibility as Advanced, plus direct access to the underlying infrastructure.
Run all Advanced Health Check steps.
If AWS:
aws ec2 describe-instance-status --filters "Name=tag:cockroach-cluster,Values=<cluster-name>"If GCP:
gcloud compute instances list --filter="labels.cockroach-cluster=<cluster-name>"If Azure:
az vm list --resource-group <rg> --query "[?tags.cockroachCluster=='<cluster-name>']"Applies when: Tier = Standard
Standard is a multi-tenant managed service. There are no individual nodes to monitor — Cockroach Labs manages all infrastructure, replication, and capacity. Health checking focuses on your workload performance and provisioned compute.
RUNNING-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';Note: Node-level system tables (crdb_internal.gossip_nodes, kv_store_status, etc.) are not available on Standard. Use Cloud Console for all infrastructure health monitoring.
Applies when: Tier = Basic
Basic is a serverless offering that auto-scales. There are no nodes or provisioned compute to monitor. Cockroach Labs manages all infrastructure. Health checking focuses on connectivity, consumption, and spending.
RUNNING-- Verify connectivity
SELECT 1;
-- Current version
SELECT version();
-- Recent failed jobs
WITH j AS (SHOW JOBS)
SELECT job_type, status, description FROM j
WHERE status = 'failed' AND created > now() - INTERVAL '24 hours';All queries in this skill are read-only. No data is modified.
crdb_internal.ranges_no_leases can be slow on large clusters — consider using LIMIT| Issue | Tier | Fix |
|---|---|---|
crdb_internal.kv_node_status empty | SH | Grant admin or VIEWCLUSTERMETADATA |
crdb_internal table not found | STD/BAS | Expected — use Cloud Console |
| Node missing from gossip_nodes | SH | Check node process; verify --join address |
| Cloud Console shows degraded | ADV/BYOC | Check Cloud status page; contact support |
| High RU consumption | BAS | Profile queries; set spending limits |
| Cloud API returns 401 | ADV/BYOC | Regenerate API key |
| High latency on first connection | BAS | Expected cold start after idle period |
Skill references:
Related skills:
Official CockroachDB Documentation:
84bc1e4
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.