databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

Quality

92%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

Passed

No findings from the security scan

Quality

Content

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is highly actionable with executable code, a clear validated workflow, and well-structured one-level references. Its only weakness is conciseness: catalog/schema rules and some rationale are repeated or padded.

Suggestions

Consolidate the catalog/schema rule into one place; it currently appears in the opening blockquote, Critical Rule 6, and the 'MUST DO: Confirm Catalog' section, which is redundant.

Trim the 'Why no flat distributions' justification and other rationale paragraphs to lean directives, since Claude already understands why skewed data is useful for analysis.

Fix the minor typo 'futur dashboard' in Critical Rule 2 and tighten that rule's phrasing to remove unnecessary padding.

Dimension	Reasoning	Score
Conciseness	The body is mostly action-oriented and avoids explaining basic concepts, but catalog/schema emphasis is repeated across the top quote, Critical Rule 6, and the 'MUST DO' section, and the 'Why no flat distributions' rationale is justification-heavy with minor redundancy. Not the lean/efficient level 3, but above the concept-explaining level 1.	2 / 3
Actionability	Provides fully executable, copy-paste-ready code (Databricks Connect serverless setup, pandas_udf Faker pattern, weighted categories, log-normal amounts, FK join pattern, infrastructure CREATE statements) plus a concrete uv install command, matching the executable anchor.	3 / 3
Workflow Clarity	Sequenced Generation Planning Workflow (Step 1-3) with Pre/Post-Generation checklists, explicit 'Do NOT proceed until user approves', and validation via get_volume_folder_details, satisfying the clear-sequence-with-validation anchor for a batch/UC-write operation.	3 / 3
Progressive Disclosure	SKILL.md acts as an overview with a clearly signaled references table pointing one level deep to real files (references/1-data-patterns.md, references/2-troubleshooting.md) plus a scripts bundle, matching the well-signaled one-level-deep anchor.	3 / 3
	Total	11 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong: third-person voice, concrete capabilities, explicit 'Use when' triggers, and a clear Databricks-specific niche. It covers what and when without vague fluff.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete actions and details ('Generate realistic synthetic data using Spark + Faker', 'Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta)', 'generate locally and upload to volumes'), matching the multiple-specific-actions anchor.	3 / 3
Completeness	Clearly answers both what (generation, formats, scale) and when via an explicit 'Use when user mentions...' clause, so it is not capped at 2.	3 / 3
Trigger Term Quality	Good coverage of natural terms users would say ('synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', 'sample data'), matching the good-coverage anchor rather than the partial-coverage level below.	3 / 3
Distinctiveness Conflict Risk	Niche is clearly scoped to Databricks synthetic data with Spark+Faker and distinct trigger phrases, making overlap with other skills unlikely.	3 / 3
	Total	12 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository: databricks-solutions/ai-dev-kit
Path: databricks-skills/databricks-synthetic-data-gen/SKILL.md
Commit: a7e1d51

Reviewed: about 5 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.