CtrlK
BlogDocsLog inGet started
Tessl Logo

triaging-live-sql-activity

Diagnoses live CockroachDB cluster performance issues by identifying long-running queries, busy sessions, and active transactions using SQL-only interfaces. Use when users report cluster slowness, high CPU, or need to find runaway queries and their source applications without DB Console access.

90

Quality

88%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Triaging Live SQL Activity

Diagnoses live cluster performance issues by identifying currently active long-running queries, busy sessions, and active transactions. Uses SQL-only interfaces (SHOW statements and crdb_internal views) to provide immediate triage without requiring DB Console, HTTP endpoints, or Prometheus access.

When to Use This Skill

  • Users report "the cluster is slow right now"
  • High CPU or memory usage on cluster nodes
  • Need to identify runaway queries or stuck transactions
  • Want to find which applications/users are consuming resources
  • Require immediate triage without DB Console access
  • Need to generate SQL to cancel problematic sessions/queries

For historical performance analysis: Use profiling-statement-fingerprints to analyze query patterns over time, identify slow fingerprints, and investigate trends without needing live queries. For transaction-level analysis: Use profiling-transaction-fingerprints to analyze historical transaction retry patterns, commit latency trends, and statement composition. For background job monitoring: Use monitoring-background-jobs to monitor schema changes, backups, and automatic jobs that don't appear in SHOW CLUSTER STATEMENTS.

Prerequisites

Required SQL access:

  • Connection to any CockroachDB node
  • For cluster-wide visibility: VIEWACTIVITY or VIEWACTIVITYREDACTED privilege
    • VIEWACTIVITYREDACTED: Redacts constants in other users' queries (recommended for privacy)
    • VIEWACTIVITY: Shows full query text for all users
    • Without these: Only see your own sessions/queries
  • Basic understanding of SQL query execution
  • (Optional) CANCELQUERY / CANCELSESSION privileges for cancellation operations

Check your privileges:

SHOW GRANTS ON ROLE <username>;

See permissions reference for detailed RBAC setup.

Core Diagnostic Approach

CockroachDB provides SQL-only interfaces for live activity triage:

InterfacePurposeCluster-wide?
SHOW CLUSTER STATEMENTSCurrently executing queriesYes (with VIEWACTIVITY)
SHOW CLUSTER SESSIONSActive client sessionsYes (with VIEWACTIVITY)
crdb_internal.cluster_transactionsIn-progress transactionsYes (with VIEWACTIVITY)

Triage workflow:

  1. Identify long-running queries (> 5-10 minutes)
  2. Correlate to sessions and applications
  3. Check transaction retry counts (high retries = contention)
  4. Drill down by app/user/client
  5. (Optional) Cancel runaway work

Safety: All diagnostic queries are read-only. Cancellation is opt-in with explicit warnings.

Core Diagnostic Queries

Long-Running Queries

Identify queries running longer than a specified threshold:

-- Queries running longer than 5 minutes
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT
  query_id,
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  start,
  now() - start AS running_for,
  substring(query, 1, 200) AS query_preview,
  distributed,
  phase
FROM q
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

Key columns:

  • running_for: How long the query has been executing
  • query_preview: First 200 characters (protects against massive queries)
  • phase: execution phase (preparing, executing, etc.)
  • distributed: whether query spans multiple nodes

Customizable thresholds:

  • Change INTERVAL '5 minutes' to '10 minutes', '30 seconds', etc.
  • Adjust LIMIT based on cluster size and expected load

Active Sessions

Find sessions with long-running active queries:

-- Sessions with active queries running > 5 minutes
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT
  node_id,
  session_id,
  user_name,
  client_address,
  application_name,
  status,
  active_query_start,
  now() - active_query_start AS active_query_for,
  substring(active_queries, 1, 200) AS active_queries_preview,
  substring(last_active_query, 1, 200) AS last_query_preview
FROM s
WHERE active_query_start IS NOT NULL
  AND active_query_start < now() - INTERVAL '5 minutes'
ORDER BY active_query_start
LIMIT 50;

Key columns:

  • active_query_for: Duration of current active query
  • application_name: Source application for drill-down
  • client_address: Client IP/hostname for troubleshooting
  • status: Session state (Idle, Active, etc.)

Active Transactions

Identify long-running transactions (potential blockers):

-- Transactions running > 5 minutes
SELECT
  id AS txn_id,
  node_id,
  session_id,
  application_name,
  start,
  now() - start AS running_for,
  num_stmts,
  num_retries,
  num_auto_retries,
  substring(txn_string, 1, 200) AS txn_string_preview
FROM crdb_internal.cluster_transactions
WHERE start < now() - INTERVAL '5 minutes'
ORDER BY start
LIMIT 50;

Key columns:

  • num_retries / num_auto_retries: High retry counts indicate contention
  • num_stmts: Number of statements in transaction (large = potentially problematic)
  • txn_string: Transaction fingerprint

Production safety note: crdb_internal.cluster_transactions is production-approved and safe for triage.

Drill-Down by Application, User, or Client

Once you identify suspicious activity, drill down by filtering:

Filter by Application

-- All activity from specific application
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, user_name, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
ORDER BY start;

Filter by User

-- All activity from specific user
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, application_name, client_address,
       active_query_start, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE user_name = 'app_user'
  AND active_query_start IS NOT NULL
ORDER BY active_query_start;

Filter by Client Address

-- All sessions from specific client IP
WITH s AS (SHOW CLUSTER SESSIONS)
SELECT session_id, user_name, application_name,
       status, substring(active_queries, 1, 200) AS active_queries_preview
FROM s
WHERE client_address LIKE '10.0.1.%'
ORDER BY active_query_start;

Combined Filters

-- Long queries from specific app and user
WITH q AS (SHOW CLUSTER STATEMENTS)
SELECT query_id, node_id, start, now() - start AS running_for,
       substring(query, 1, 200) AS query_preview
FROM q
WHERE application_name = 'payments-api'
  AND user_name = 'app_user'
  AND start < now() - INTERVAL '10 minutes'
ORDER BY start;

Safety Considerations

Read-only operations: All diagnostic queries (SHOW statements, crdb_internal.cluster_transactions) are read-only and safe to run in production.

Cancellation operations (opt-in):

CAUTION: Canceling queries/sessions terminates user work

Only proceed if:

  • You've confirmed the query/session is runaway or stuck
  • You have authorization to interrupt user workloads
  • You've notified stakeholders if appropriate
  • You have CANCELQUERY or CANCELSESSION privileges

Canceling Runaway Work (Opt-In)

Cancel a Specific Query

-- 1. Identify the query_id from triage queries above
-- 2. Cancel it
CANCEL QUERY '<query_id>';

Example:

CANCEL QUERY '15f9e0e91f072f0f0000000000000001';

Cancel an Entire Session

-- 1. Identify the session_id from triage queries above
-- 2. Cancel all queries in that session
CANCEL SESSION '<session_id>';

Example:

CANCEL SESSION '15f9e0e91f072f0f';

Verification: After canceling, re-run the triage queries to confirm the query/session is gone.

Required privileges:

  • CANCELQUERY system privilege to cancel queries
  • CANCELSESSION system privilege to cancel sessions
  • Admin role has both by default

See permissions reference for granting these privileges.

Common Triage Workflows

Workflow 1: "Cluster is slow" investigation

Scenario: Users report general slowness.

  1. Check for long-running queries:

    -- Run the "Long-Running Queries" diagnostic
    -- Look for queries running > 5-10 minutes
  2. Identify source applications:

    -- Group by application to find culprits
    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT application_name, COUNT(*) AS num_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    ORDER BY num_queries DESC;
  3. Drill down into specific app:

    -- Filter by top application from step 2
    -- Use "Filter by Application" query
  4. Decide on action:

    • Contact app team to investigate query patterns
    • Cancel specific runaway queries if critical
    • Check for schema/index issues if queries are legitimate

Workflow 2: Find high-retry transactions

Scenario: Suspect contention issues.

  1. Check for high retry counts:

    SELECT application_name, AVG(num_retries) AS avg_retries,
           MAX(num_retries) AS max_retries, COUNT(*) AS num_txns
    FROM crdb_internal.cluster_transactions
    WHERE start < now() - INTERVAL '5 minutes'
    GROUP BY application_name
    HAVING AVG(num_retries) > 5
    ORDER BY avg_retries DESC;
  2. Investigate specific transactions:

    -- Find transactions with >10 retries
    SELECT id, application_name, num_retries, num_stmts,
           substring(txn_string, 1, 200) AS txn_preview
    FROM crdb_internal.cluster_transactions
    WHERE num_retries > 10
    ORDER BY num_retries DESC;
  3. Next steps:

    • Review transaction patterns for contention
    • Check for lock conflicts or hotspots
    • Consider schema changes to reduce contention

Workflow 3: Identify resource hogs by user

Scenario: Need to attribute load to specific users.

  1. Count active queries per user:

    WITH q AS (SHOW CLUSTER STATEMENTS)
    SELECT user_name, COUNT(*) AS num_active_queries,
           AVG(now() - start) AS avg_duration
    FROM q
    GROUP BY user_name
    ORDER BY num_active_queries DESC;
  2. Drill down to specific user's activity:

    -- Use "Filter by User" query
  3. Take action:

    • Contact user if unexpected load
    • Review user's query patterns
    • Cancel if clearly runaway

Troubleshooting

IssueCauseFix
SHOW CLUSTER STATEMENTS returns emptyNo active queries, or insufficient privilegesGrant VIEWACTIVITY or VIEWACTIVITYREDACTED; verify cluster has active load
Query text shows <hidden>Using VIEWACTIVITYREDACTED privilegeThis is expected for privacy; use VIEWACTIVITY if full text needed
Can't cancel query: "permission denied"Missing CANCELQUERY privilegeGrant CANCELQUERY system privilege to your user
crdb_internal.cluster_transactions slowHigh transaction volume on clusterAdd filters (application_name, time threshold) to reduce result set
"relation does not exist" errorTypo in table name or old CockroachDB versionVerify you're using production-approved tables; check CockroachDB version compatibility
Triage queries themselves are slowCluster under extreme loadUse more aggressive filters (shorter time window, specific apps); consider canceling obvious runaway work first

Key Considerations

  • Privacy: Use VIEWACTIVITYREDACTED instead of VIEWACTIVITY to protect sensitive query constants in multi-tenant environments
  • Performance impact: Triage queries are read-only and lightweight, but avoid running them in tight loops during extreme load
  • LIMIT clause: Always include LIMIT to prevent overwhelming output on large clusters
  • Time thresholds: Adjust INTERVAL based on your workload (5 minutes is a reasonable default, but fast OLTP may need 30 seconds)
  • Cancellation is disruptive: Only cancel queries/sessions after confirming they're problematic; coordinate with application teams when possible
  • Not for historical analysis: These queries show current state only; for trends over time, use DB Console or Prometheus metrics
  • Production-approved sources: Only use SHOW CLUSTER STATEMENTS, SHOW CLUSTER SESSIONS, and crdb_internal.cluster_transactions for production triage

References

Skill references:

Related skills:

Official CockroachDB Documentation:

  • SHOW STATEMENTS
  • SHOW SESSIONS
  • CANCEL QUERY
  • CANCEL SESSION
  • crdb_internal
  • VIEWACTIVITY privilege
Repository
cockroachlabs/cockroachdb-skills
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.