databricks-incident-runbook

Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Content

72%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

A highly actionable incident runbook with executable code, a clear decision tree, and well-structured one-level-deep references. It loses points on conciseness for some explanatory padding and on workflow clarity because destructive/batch steps lack explicit validation checkpoints before proceeding.

Suggestions

Tighten conciseness: drop explanatory sentences like the auth-warning rationale and redundant install-command duplications; trust Claude to know pip/brew basics.

Add explicit validation checkpoints before destructive or batch operations — e.g. confirm `databricks runs get` shows the expected failed task before running `runs repair --rerun-tasks FAILED`, and verify the DESCRIBE HISTORY version exists before `RESTORE TABLE ... TO VERSION AS OF`.

Replace the static Error Handling table's generic guidance with a short feedback loop (detect failure cause → apply matching fix → re-run triage to confirm recovery) for at least the cluster-restart-loop and repair-too-old cases.

Dimension	Reasoning	Score
Conciseness	Mostly efficient and action-oriented, but carries some padding Claude doesn't need — e.g. the Prerequisites preamble "Before this runbook runs, the responder must have", the sentence "Running this skill without auth produces misleading output", and version-pin verbosity like "Install: pip install ... or brew install ...". It explains rather than just instructs in a few spots.	2 / 3
Actionability	Provides fully executable bash/SQL commands throughout (triage script, cluster get/events, runs get/repair, RESTORE TABLE, permissions update) that are copy-paste ready, plus a concrete decision tree routing errors to numbered steps.	3 / 3
Workflow Clarity	The six-step sequence is clear, but batch/destructive operations (runs repair, permissions update, RESTORE TABLE, xargs cancel-all) lack explicit validation-then-proceed checkpoints. The triage script has no fail-state branching beyond the API check, and an error-handling table substitutes for inline feedback loops.	2 / 3
Progressive Disclosure	The body is an overview that cleanly offloads long material to three real one-level-deep references (communication-templates.md, evidence-collection.md, postmortem-template.md), each clearly signaled with a relative link and a one-line purpose. Verified the referenced files exist in ./references/.	3 / 3
	Total	10 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

A strong description: it names concrete actions, gives explicit 'Use when' guidance, and lists natural trigger phrases tightly scoped to Databricks incident response. The third-person voice is consistent throughout.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions — "triage, mitigation, and postmortem" — naming the specific incident-response phases the skill performs rather than vague language.	3 / 3
Completeness	Explicitly states both what it does (triage/mitigation/postmortem) and when to use it ("Use when responding to Databricks-related outages..."), with an explicit trigger clause.	3 / 3
Trigger Term Quality	Gives natural phrases an on-call engineer would actually say ("databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed") with good coverage of common variations.	3 / 3
Distinctiveness Conflict Risk	The Databricks-specific triggers and incident-response niche are narrow and distinct; unlikely to fire for unrelated skills.	3 / 3
	Total	12 / 12 Passed

Validation

87%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 14 / 16 Passed

Validation for skill structure

Criteria	Description	Result
allowed_tools_field	'allowed-tools' contains unusual tool name(s)	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	14 / 16 Passed

Repository: jeremylongshore/claude-code-plugins-plus-skills
Commit: 02d6341

Reviewed: about 3 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.