spark-python-data-source

Build custom Python data sources for Apache Spark using the PySpark DataSource API — batch and streaming readers/writers for external systems. Use this skill whenever someone wants to connect Spark to an external system (database, API, message queue, custom protocol), build a Spark connector or plugin in Python, implement a DataSourceReader or DataSourceWriter, pull data from or push data to a system via Spark, or work with the PySpark DataSource API in any way. Even if they just say "read from X in Spark" or "write DataFrame to Y" and there's no native connector, this skill applies.

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Quality

Content

87%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is a well-structured, token-efficient overview that adds genuinely non-obvious PySpark-specific knowledge and routes detail into 8 real, clearly-signaled reference files. Its one weakness is workflow clarity: the build/test sequence is present but lacks an explicit validation checkpoint and feedback loop, which the rubric flags for batch and database-style operations.

Suggestions

Add an explicit validate→fix→retry checkpoint in the Project Setup / Testing flow: state that `uv run pytest` and `uv run ruff check` must pass before running `uv build`, so the build is gated on validation rather than just listed alongside it.

Provide a short verification checklist for completing a data source (e.g., confirm `name()` returns the format string, run unit tests with mocked external calls, then an integration test against the real system) to make the multi-step process and its checkpoints explicit.

Add an explicit checkpoint for batch/streaming write paths — e.g., validate that streaming offsets are non-overlapping before committing, and retry-with-backoff on transient write failures — since external write operations are exactly where the rubric expects a feedback loop.

Dimension	Reasoning	Score
Conciseness	The body is lean and assumes Claude's competence — it explicitly declines to repeat general Python best practices ('still apply but aren't repeated here') and concentrates on non-obvious PySpark-specific gotchas (executor-side imports, serialization-driven flat inheritance), with every section earning its tokens; the single 'Spark 4.0+' mention is load-bearing (the API's availability floor) rather than decorative.	3 / 3
Actionability	Provides fully executable, copy-paste-ready guidance — `uv init`/`uv add`/`uv run pytest`/`uv build` commands, a concrete project layout, a runnable pytest example that patches `requests.post`, and exact class signatures like `class YourBatchWriter(YourWriter, DataSourceWriter)`; the full implementation skeleton is intentionally deferred to a reference, which is appropriate structure rather than a defect.	3 / 3
Workflow Clarity	Steps are clearly sequenced (architecture → project setup → key decisions → testing) and validation commands exist (`uv run pytest`, `uv run ruff check`), but there is no explicit validate→fix→retry feedback loop or gating checkpoint; because the skill targets batch and database-style external operations, the rubric caps workflow_clarity at 2 when feedback loops are missing.	2 / 3
Progressive Disclosure	The body is a concise overview that points to 8 one-level-deep reference files, each clearly signaled with a 'Read when...' trigger; all referenced files were verified to exist, and the few cross-links between references are lateral sibling pointers (e.g. production→authentication) rather than deeply nested chains, matching the 'clear overview with well-signaled one-level-deep references' anchor.	3 / 3
	Total	11 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong across the board: it states concrete capabilities, includes natural trigger phrasings users would actually say, gives an explicit 'Use this skill whenever...' clause for both what and when, and carves out a distinct niche with low conflict risk. It uses third-person/imperative voice with no first- or second-person phrasing to penalize.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions — 'Build custom Python data sources', 'batch and streaming readers/writers', 'build a Spark connector or plugin in Python', 'implement a DataSourceReader or DataSourceWriter', 'pull data from or push data to a system' — matching the 'multiple specific concrete actions' anchor rather than the score-2 'some actions, not comprehensive'.	3 / 3
Completeness	Explicitly answers both what ('Build custom Python data sources for Apache Spark using the PySpark DataSource API — batch and streaming readers/writers for external systems') and when ('Use this skill whenever someone wants to connect Spark to an external system... this skill applies'), with an explicit 'Use this skill whenever...' trigger clause matching the score-3 anchor.	3 / 3
Trigger Term Quality	Strong coverage of natural phrasings a user would actually say — 'connect Spark to an external system (database, API, message queue)', 'build a Spark connector', 'read from X in Spark', 'write DataFrame to Y' — matching the 'good coverage of natural terms' anchor; technical jargon like 'DataSourceReader' coexists with these without displacing them.	3 / 3
Distinctiveness Conflict Risk	Occupies a clear niche — the PySpark DataSource API and custom connectors for systems 'where there's no native connector' — with distinct triggers unlikely to fire for general Python or general Spark skills, matching the 'clear niche with distinct triggers; unlikely to conflict' anchor.	3 / 3
	Total	12 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository: databricks-solutions/ai-dev-kit
Commit: 1c43c21

Reviewed: 3 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.