CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-incident-runbook

Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".

69

Quality

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

A highly actionable incident runbook with executable code, a clear decision tree, and well-structured one-level-deep references. It loses points on conciseness for some explanatory padding and on workflow clarity because destructive/batch steps lack explicit validation checkpoints before proceeding.

Suggestions

Tighten conciseness: drop explanatory sentences like the auth-warning rationale and redundant install-command duplications; trust Claude to know pip/brew basics.

Add explicit validation checkpoints before destructive or batch operations — e.g. confirm `databricks runs get` shows the expected failed task before running `runs repair --rerun-tasks FAILED`, and verify the DESCRIBE HISTORY version exists before `RESTORE TABLE ... TO VERSION AS OF`.

Replace the static Error Handling table's generic guidance with a short feedback loop (detect failure cause → apply matching fix → re-run triage to confirm recovery) for at least the cluster-restart-loop and repair-too-old cases.

DimensionReasoningScore

Conciseness

Mostly efficient and action-oriented, but carries some padding Claude doesn't need — e.g. the Prerequisites preamble "Before this runbook runs, the responder must have", the sentence "Running this skill without auth produces misleading output", and version-pin verbosity like "Install: pip install ... or brew install ...". It explains rather than just instructs in a few spots.

2 / 3

Actionability

Provides fully executable bash/SQL commands throughout (triage script, cluster get/events, runs get/repair, RESTORE TABLE, permissions update) that are copy-paste ready, plus a concrete decision tree routing errors to numbered steps.

3 / 3

Workflow Clarity

The six-step sequence is clear, but batch/destructive operations (runs repair, permissions update, RESTORE TABLE, xargs cancel-all) lack explicit validation-then-proceed checkpoints. The triage script has no fail-state branching beyond the API check, and an error-handling table substitutes for inline feedback loops.

2 / 3

Progressive Disclosure

The body is an overview that cleanly offloads long material to three real one-level-deep references (communication-templates.md, evidence-collection.md, postmortem-template.md), each clearly signaled with a relative link and a one-line purpose. Verified the referenced files exist in ./references/.

3 / 3

Total

10

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

A strong description: it names concrete actions, gives explicit 'Use when' guidance, and lists natural trigger phrases tightly scoped to Databricks incident response. The third-person voice is consistent throughout.

DimensionReasoningScore

Specificity

Lists multiple concrete actions — "triage, mitigation, and postmortem" — naming the specific incident-response phases the skill performs rather than vague language.

3 / 3

Completeness

Explicitly states both what it does (triage/mitigation/postmortem) and when to use it ("Use when responding to Databricks-related outages..."), with an explicit trigger clause.

3 / 3

Trigger Term Quality

Gives natural phrases an on-call engineer would actually say ("databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed") with good coverage of common variations.

3 / 3

Distinctiveness Conflict Risk

The Databricks-specific triggers and incident-response niche are narrow and distinct; unlikely to fire for unrelated skills.

3 / 3

Total

12

/

12

Passed

Validation

87%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation14 / 16 Passed

Validation for skill structure

CriteriaDescriptionResult

allowed_tools_field

'allowed-tools' contains unusual tool name(s)

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

14

/

16

Passed

Repository
jeremylongshore/claude-code-plugins-plus-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.