or run

tessl search

databricks-incident-runbook

tessl install github:jeremylongshore/claude-code-plugins-plus-skills --skill databricks-incident-runbook

github.com/jeremylongshore/claude-code-plugins-plus-skills

Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".

Review Score

87%

Validation Score

13/16

Implementation Score

77%

Activation Score

100%

Databricks Incident Runbook

Overview

Rapid incident response procedures for Databricks-related outages.

Prerequisites

Access to Databricks workspace
CLI configured with appropriate permissions
Access to monitoring dashboards
Communication channels (Slack, PagerDuty)

Severity Levels

Level	Definition	Response Time	Examples
P1	Production data pipeline down	< 15 min	Critical ETL failed, data not updating
P2	Degraded performance	< 1 hour	Slow queries, partial failures
P3	Non-critical issues	< 4 hours	Dev cluster issues, delayed non-critical jobs
P4	No user impact	Next business day	Monitoring gaps, documentation

Quick Triage

#!/bin/bash
# quick-triage.sh - Run this first during any incident

echo "=== Databricks Quick Triage ==="
echo "Time: $(date)"
echo ""

# 1. Check Databricks status
echo "--- Databricks Status ---"
curl -s https://status.databricks.com/api/v2/status.json | jq '.status.description'
echo ""

# 2. Check workspace connectivity
echo "--- Workspace Connectivity ---"
databricks workspace list / --output json | jq -r '.[] | .path' | head -5
if [ $? -eq 0 ]; then
    echo "Workspace: CONNECTED"
else
    echo "Workspace: CONNECTION FAILED"
fi
echo ""

# 3. Check recent job failures
echo "--- Recent Job Failures (last 1 hour) ---"
databricks runs list --limit 20 --output json | \
    jq -r '.runs[] | select(.state.result_state == "FAILED") | "\(.run_id): \(.run_name) - \(.state.state_message)"'
echo ""

# 4. Check cluster status
echo "--- Running Clusters ---"
databricks clusters list --output json | \
    jq -r '.clusters[] | select(.state == "RUNNING" or .state == "ERROR") | "\(.cluster_id): \(.cluster_name) [\(.state)]"'
echo ""

# 5. Check for errors in last hour
echo "--- Recent Errors ---"
# Query system tables via SQL warehouse or notebook

Decision Tree

Job/Pipeline failing?
├─ YES: Is it a single job or multiple?
│   ├─ SINGLE JOB → Check job-specific issues
│   │   ├─ Cluster failed to start → Check cluster events
│   │   ├─ Code error → Check task output/logs
│   │   ├─ Data issue → Check source data
│   │   └─ Permission error → Check grants
│   │
│   └─ MULTIPLE JOBS → Likely infrastructure issue
│       ├─ Check Databricks status page
│       ├─ Check workspace quotas
│       └─ Check network connectivity
│
└─ NO: Is it a performance issue?
    ├─ Slow queries → Check query plan, cluster sizing
    ├─ Slow cluster startup → Check instance availability
    └─ Data freshness → Check upstream pipelines

Immediate Actions by Error Type

Job Failed - Code Error

# 1. Get run details
RUN_ID="your-run-id"
databricks runs get --run-id $RUN_ID

# 2. Get detailed error output
databricks runs get-output --run-id $RUN_ID | jq '.error'

# 3. Check task-level errors
databricks runs get --run-id $RUN_ID | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message}'

# 4. If notebook task, get notebook output
# (View in UI or use jobs API to get cell outputs)

Cluster Failed to Start

# 1. Check cluster events
CLUSTER_ID="your-cluster-id"
databricks clusters events --cluster-id $CLUSTER_ID --limit 20

# 2. Common causes and fixes
# - QUOTA_EXCEEDED: Terminate unused clusters
# - CLOUD_PROVIDER_LAUNCH_ERROR: Check instance availability
# - DRIVER_UNREACHABLE: Network/firewall issue

# 3. Quick fix - restart cluster
databricks clusters restart --cluster-id $CLUSTER_ID

# 4. Check cluster logs
databricks clusters get --cluster-id $CLUSTER_ID | jq '.termination_reason'

Permission/Auth Errors

# 1. Check current user
databricks current-user me

# 2. Check job permissions
databricks permissions get jobs --job-id $JOB_ID

# 3. Check table permissions (run in notebook)
# SHOW GRANTS ON TABLE catalog.schema.table

# 4. Fix: Grant necessary permissions
databricks permissions update jobs --job-id $JOB_ID --json '{
  "access_control_list": [{
    "user_name": "user@company.com",
    "permission_level": "CAN_MANAGE_RUN"
  }]
}'

Data Quality Failures

-- Quick data quality check
SELECT
    COUNT(*) as total_rows,
    COUNT(DISTINCT id) as unique_ids,
    SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) as null_amounts,
    MIN(created_at) as oldest_record,
    MAX(created_at) as newest_record
FROM catalog.schema.table
WHERE created_at > current_timestamp() - INTERVAL 1 DAY;

-- Check for recent changes
DESCRIBE HISTORY catalog.schema.table LIMIT 10;

-- Restore to previous version if needed
RESTORE TABLE catalog.schema.table TO VERSION AS OF 5;

Communication Templates

Internal (Slack)

:red_circle: **P1 INCIDENT: [Brief Description]**

**Status:** INVESTIGATING
**Impact:** [Describe user/business impact]
**Started:** [Time]
**Current Action:** [What you're doing now]
**Next Update:** [Time]

**Incident Commander:** @[name]
**Thread:** [link]

External (Status Page)

**Data Pipeline Delay**

We are experiencing delays in data processing. Some reports may show stale data.

**Impact:** Dashboard data may be up to [X] hours delayed
**Started:** [Time] UTC
**Current Status:** Our team is actively investigating

We will provide updates every 30 minutes.

Last updated: [Timestamp]

Post-Incident

Evidence Collection

#!/bin/bash
# collect-incident-evidence.sh

INCIDENT_ID=$1
RUN_ID=$2
CLUSTER_ID=$3

mkdir -p "incident-$INCIDENT_ID"

# Job run details
databricks runs get --run-id $RUN_ID > "incident-$INCIDENT_ID/run_details.json"
databricks runs get-output --run-id $RUN_ID > "incident-$INCIDENT_ID/run_output.json"

# Cluster info
if [ -n "$CLUSTER_ID" ]; then
    databricks clusters get --cluster-id $CLUSTER_ID > "incident-$INCIDENT_ID/cluster_info.json"
    databricks clusters events --cluster-id $CLUSTER_ID --limit 50 > "incident-$INCIDENT_ID/cluster_events.json"
fi

# Create summary
cat << EOF > "incident-$INCIDENT_ID/summary.md"
# Incident $INCIDENT_ID

**Date:** $(date)
**Run ID:** $RUN_ID
**Cluster ID:** $CLUSTER_ID

## Evidence Collected
- run_details.json
- run_output.json
- cluster_info.json
- cluster_events.json
EOF

tar -czf "incident-$INCIDENT_ID.tar.gz" "incident-$INCIDENT_ID"
echo "Evidence collected: incident-$INCIDENT_ID.tar.gz"

Postmortem Template

## Incident: [Title]

**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Incident Commander:** [Name]

### Summary
[1-2 sentence description of what happened]

### Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | [First alert/detection] |
| HH:MM | [Investigation started] |
| HH:MM | [Root cause identified] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Incident resolved] |

### Root Cause
[Technical explanation of what went wrong]

### Impact
- **Data Impact:** [Tables affected, rows impacted]
- **Users Affected:** [Number, types]
- **Duration:** [How long data was unavailable/stale]
- **Financial Impact:** [If applicable]

### Detection
- **How detected:** [Alert, user report, monitoring]
- **Time to detect:** [Minutes from issue start]
- **Detection gap:** [What could have caught this sooner]

### Response
- **Time to respond:** [Minutes from detection]
- **What worked:** [Effective response actions]
- **What didn't:** [Ineffective actions, dead ends]

### Action Items
| Priority | Action | Owner | Due Date |
|----------|--------|-------|----------|
| P1 | [Preventive measure] | [Name] | [Date] |
| P2 | [Monitoring improvement] | [Name] | [Date] |
| P3 | [Documentation update] | [Name] | [Date] |

### Lessons Learned
1. [Key learning 1]
2. [Key learning 2]
3. [Key learning 3]

Instructions

Step 1: Quick Triage

Run the triage script to identify the issue source.

Step 2: Follow Decision Tree

Determine if the issue is Databricks-side or code/data issue.

Step 3: Execute Immediate Actions

Apply the appropriate remediation for the error type.

Step 4: Communicate Status

Update internal and external stakeholders.

Step 5: Collect Evidence

Document everything for postmortem.

Output

Issue identified and categorized
Remediation applied
Stakeholders notified
Evidence collected for postmortem

Error Handling

Issue	Cause	Solution
Can't access workspace	Token expired	Re-authenticate
CLI commands fail	Network issue	Check VPN
Logs unavailable	Cluster terminated	Check cluster events
Restore fails	Retention exceeded	Check vacuum settings

Examples

One-Line Job Health Check

databricks runs list --job-id $JOB_ID --limit 5 --output json | \
    jq -r '.runs[] | "\(.start_time): \(.state.result_state)"'

Quick Cluster Restart

databricks clusters restart --cluster-id $CLUSTER_ID && \
    echo "Cluster restart initiated"

Resources

Next Steps

For data handling, see databricks-data-handling.