Monitors CockroachDB background job health by identifying failed, paused, and long-running jobs using SHOW JOBS and SHOW AUTOMATIC JOBS. Surfaces schema changes, backups/restores, automatic statistics collection, and SQL stats compaction jobs without DB Console access. Use when investigating schema change delays, failed backups, or automatic job issues.
86
81%
Does it follow best practices?
Impact
94%
1.95xAverage score across 3 eval scenarios
Passed
No known issues
Monitors background job health by identifying failed, paused, and long-running jobs that are distinct from user queries. Uses SQL-only interfaces (SHOW JOBS and SHOW AUTOMATIC JOBS) to surface schema changes, backups/restores, automatic statistics collection, and SQL stats compaction without requiring DB Console access.
For live query monitoring: Use triaging-live-sql-activity to monitor currently executing user queries. Note that background jobs execute statements that may not appear in SHOW CLUSTER STATEMENTS.
For historical query analysis: Use profiling-statement-fingerprints for query pattern trends. Note that background jobs are excluded from statement statistics.
Required SQL access:
VIEWJOB system privilege (read-only monitoring)CONTROLJOB role option (pause/cancel/resume jobs)Check your privileges:
SHOW GRANTS ON ROLE <username>;Look for:
VIEWJOB in the privilege_type column (system privilege)CONTROLJOB in role options (check with SHOW USERS)See permissions reference for detailed RBAC setup.
Key distinction:
Background jobs are excluded from:
SHOW CLUSTER STATEMENTS (live query monitoring)crdb_internal.statement_statistics (historical query analysis)Common job types:
| Category | Job Types | Examples |
|---|---|---|
| User-initiated | SCHEMA CHANGE, BACKUP, RESTORE, IMPORT, CHANGEFEED | ALTER TABLE, CREATE INDEX, BACKUP DATABASE, RESTORE |
| Automatic | SCHEMA CHANGE GC, AUTO CREATE STATS, AUTO SQL STATS COMPACTION | Post-DROP cleanup, table statistics refresh, stats table maintenance |
See job types reference for complete catalog.
| Interface | Scope | Time Window | Use Case |
|---|---|---|---|
SHOW JOBS | User-initiated + automatic | Last 12 hours (default) | Monitor backups, schema changes, user operations |
SHOW AUTOMATIC JOBS | Automatic only | Configurable (recommend 24h) | Monitor AUTO CREATE STATS, AUTO SQL STATS COMPACTION, SCHEMA CHANGE GC |
Time retention:
crdb_internal.jobs tableSHOW JOBS display window: 12 hours (configurable with SHOW JOBS SELECT * FROM [SHOW JOBS] WHERE ...)SHOW AUTOMATIC JOBS display window: Configurable with WHERE created > now() - INTERVAL '...'| Status | Meaning | Action Required |
|---|---|---|
running | Job is actively executing | Monitor progress via fraction_completed |
succeeded | Job completed successfully | None |
failed | Job encountered an error | Investigate error column, may need to retry |
paused | Job manually paused | Resume with RESUME JOB if appropriate |
canceled | Job was canceled (terminal state) | Retry operation if needed |
pending | Job queued but not started | Monitor; may indicate resource constraints |
reverting | Job failed and is rolling back changes | Wait for completion; check error after |
Running status sub-states:
performing backup: Backup job actively transferring datarestoring: Restore job actively applying datawaiting for MVCC GC: SCHEMA CHANGE GC waiting for garbage collection eligibilitySee job states reference for detailed state transitions and "waiting for MVCC GC" explanation.
Identify jobs that failed with error messages:
-- Failed jobs in last 12 hours
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
finished,
now() - created AS total_duration,
error
FROM j
WHERE status = 'failed'
AND created > now() - INTERVAL '12 hours'
ORDER BY created DESC
LIMIT 50;Key columns:
error: Failure reason (check for permission errors, disk space, network issues)description: Human-readable description of what the job was doingtotal_duration: How long the job ran before failingCommon failure patterns:
Find jobs running longer than expected threshold:
-- Jobs running longer than 1 hour
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
status,
running_status,
created,
now() - created AS running_for,
fraction_completed,
coordinator_id
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '1 hour'
ORDER BY created
LIMIT 50;Key columns:
running_for: Total elapsed time since job startedfraction_completed: Progress estimate (0.0 to 1.0, NULL if unavailable)running_status: Sub-state details (e.g., "waiting for MVCC GC")Customizable thresholds:
Identify jobs that are paused and may need attention:
-- Paused jobs needing resume
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS paused_for,
coordinator_id
FROM j
WHERE status = 'paused'
ORDER BY created
LIMIT 50;Action required:
Resume with RESUME JOB <job_id> after verifying the pause reason.
Common reasons for paused jobs:
Find SCHEMA CHANGE GC jobs waiting for garbage collection:
-- Schema change cleanup jobs waiting for GC
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS waiting_for,
running_status
FROM j
WHERE status = 'running'
AND job_type = 'SCHEMA CHANGE GC'
AND running_status LIKE '%waiting for MVCC GC%'
ORDER BY created
LIMIT 50;Interpretation:
gc.ttlseconds setting (default 25 hours)gc.ttlseconds + some overheadgc.ttlseconds (check setting with SHOW CLUSTER SETTING gc.ttlseconds)Why this happens: After DROP TABLE/INDEX operations, CockroachDB must wait for all reads at older timestamps to complete before physically removing data. This prevents "time-travel" queries from failing.
See job states reference for detailed MVCC GC explanation.
Monitor automatic background jobs like statistics collection:
-- Automatic jobs in last 24 hours
SELECT
job_id,
job_type,
description,
status,
created,
finished,
COALESCE(finished, now()) - created AS duration
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type IN ('AUTO CREATE STATS', 'AUTO SQL STATS COMPACTION')
ORDER BY created DESC
LIMIT 50;Key job types:
AUTO CREATE STATS: Automatic table statistics refresh (critical for query optimizer)AUTO SQL STATS COMPACTION: Periodic cleanup of statement/transaction statistics tablesHealth indicators:
Aggregated view for pattern analysis:
-- Job distribution by type and status (last 24h)
WITH j AS (SHOW JOBS)
SELECT
job_type,
status,
COUNT(*) AS job_count,
MIN(created) AS oldest,
MAX(created) AS newest
FROM j
WHERE created > now() - INTERVAL '24 hours'
GROUP BY job_type, status
ORDER BY job_type, status;Use case:
Track progress of backup/restore jobs:
-- Active backup/restore jobs with progress
WITH j AS (SHOW JOBS)
SELECT
job_id,
job_type,
description,
created,
now() - created AS running_for,
ROUND(COALESCE(fraction_completed, 0) * 100, 2) AS percent_complete,
CASE
WHEN fraction_completed > 0 AND fraction_completed < 1 THEN
((now() - created) / fraction_completed) - (now() - created)
ELSE NULL
END AS estimated_time_remaining,
running_status
FROM j
WHERE status = 'running'
AND job_type IN ('BACKUP', 'RESTORE')
ORDER BY created
LIMIT 50;Key columns:
percent_complete: Progress percentage (0-100)estimated_time_remaining: Rough estimate based on current progress raterunning_status: Detailed status (e.g., "performing backup to s3://...")Note: fraction_completed may be NULL for some job types or early in execution.
Scenario: User reports ALTER TABLE or CREATE INDEX appears stuck.
Check for running schema changes:
WITH j AS (SHOW JOBS)
SELECT job_id, description, created, now() - created AS running_for,
fraction_completed, running_status
FROM j
WHERE status = 'running'
AND job_type IN ('SCHEMA CHANGE', 'NEW SCHEMA CHANGE')
ORDER BY created;Identify MVCC GC waits:
-- Use Query 4 to find "waiting for MVCC GC" jobsInterpret results:
running_status = "waiting for MVCC GC": Normal for post-DROP cleanup (wait up to gc.ttlseconds)fraction_completed: Check for contention, large table size, or resource constraintserror column for specific failure reasonNext steps:
SHOW CLUSTER SETTING gc.ttlseconds and waitScenario: Scheduled backup job failed.
Find recent failed backups:
-- Use Query 1 filtered for BACKUP job type
WITH j AS (SHOW JOBS)
SELECT job_id, description, created, finished, error
FROM j
WHERE status = 'failed'
AND job_type = 'BACKUP'
AND created > now() - INTERVAL '24 hours'
ORDER BY created DESC;Analyze error messages:
Verify backup destination:
-- Check SHOW BACKUP for successful backups to same destination
SHOW BACKUP 's3://bucket/path';Remediate and retry:
Scenario: Proactive monitoring of automatic background jobs.
Check AUTO CREATE STATS frequency:
-- Use Query 5 to see recent automatic statistics jobs
SELECT job_type, status, COUNT(*) AS job_count,
MAX(created) AS most_recent
FROM [SHOW AUTOMATIC JOBS]
WHERE created > now() - INTERVAL '24 hours'
AND job_type = 'AUTO CREATE STATS'
GROUP BY job_type, status;Expected pattern:
Warning signs:
Verify settings:
SHOW CLUSTER SETTING sql.stats.automatic_collection.enabled; -- Should be true
SHOW CLUSTER SETTING sql.stats.automatic_collection.min_stale_rows;Scenario: Track progress of expected long-running operations.
Identify long-running jobs:
-- Use Query 2 with custom threshold
WITH j AS (SHOW JOBS)
SELECT job_id, job_type, description,
now() - created AS running_for,
fraction_completed
FROM j
WHERE status = 'running'
AND created < now() - INTERVAL '30 minutes'
ORDER BY created;Monitor progress over time:
-- Re-run every 10-15 minutes, track fraction_completed changes
-- Example: 0.25 → 0.40 → 0.55 indicates steady progressEstimate completion:
-- Use Query 7 for backup/restore jobs with time estimatesDecide on action:
Read-only operations (all diagnostic queries):
All SHOW JOBS and SHOW AUTOMATIC JOBS queries are read-only and safe to run in production. No performance impact on cluster operations.
Job control operations (opt-in):
CAUTION: Pausing or canceling jobs can have data integrity implications
Only proceed with job control if:
CONTROLJOB role optionJob control commands:
-- Pause a running job (can be resumed later)
PAUSE JOB <job_id>;
-- Resume a paused job
RESUME JOB <job_id>;
-- Cancel a job (terminal - cannot be resumed)
CANCEL JOB <job_id>;Risks by job type:
Best practice: Focus on monitoring and diagnosis; only use control operations when explicitly required and authorized.
See permissions reference for CONTROLJOB role option setup.
| Issue | Cause | Fix |
|---|---|---|
SHOW JOBS returns empty | No jobs in last 12h, or insufficient privileges | Grant VIEWJOB privilege; verify cluster has recent job activity |
| "waiting for MVCC GC" for many hours | Normal behavior for SCHEMA CHANGE GC after DROP operations | Wait up to gc.ttlseconds (default 25h); check SHOW CLUSTER SETTING gc.ttlseconds |
| Can't pause/resume job: "permission denied" | Missing CONTROLJOB role option | Use ALTER ROLE <username> WITH CONTROLJOB (not GRANT SYSTEM) |
Job stuck at same fraction_completed | Job may be processing large batch, or actually stuck | Wait 15-30 min and re-check; if no change, investigate with live query triage |
| No AUTO CREATE STATS jobs | Automatic collection disabled | Check sql.stats.automatic_collection.enabled = true |
SHOW AUTOMATIC JOBS shows old jobs only | Need to filter by time window | Add WHERE created > now() - INTERVAL '24 hours' |
| Failed job with "schema change GC" error | Expected for post-DROP cleanup failures | Usually safe to ignore; job will retry automatically |
| Job error: "concurrent schema change" | Multiple schema changes on same table | Wait for first schema change to complete, then retry |
crdb_internal.jobs for up to 14 daysgc.ttlsecondsfraction_completed may be NULL or sparse for some job typesSkill references:
Related skills:
Official CockroachDB Documentation:
84bc1e4
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.