databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

Quality

88%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that hits all the key criteria. It provides specific capabilities, includes a comprehensive set of natural trigger terms in an explicit 'Use when...' clause, and occupies a clearly distinct niche. The description is concise yet thorough, covering the what, how, and when effectively.

Dimension	Reasoning	Score
Specificity	Lists multiple concrete capabilities: generating synthetic data using Spark + Faker, serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), scaling from thousands to millions of rows, and local generation with upload to volumes for small datasets.	3 / 3
Completeness	Clearly answers both 'what does this do' (generate realistic synthetic data with Spark + Faker, multiple formats, scalable) AND 'when should Claude use it' with an explicit 'Use when...' clause listing specific trigger terms.	3 / 3
Trigger Term Quality	Includes a strong set of natural trigger terms users would actually say: 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', 'sample data'. Also includes technical terms like Parquet, JSON, CSV, Delta, and Spark that users working in this domain would mention.	3 / 3
Distinctiveness Conflict Risk	Highly distinctive with a clear niche: synthetic/test data generation using Spark + Faker with specific output formats. The combination of data generation, Faker, and Spark-based execution creates a unique profile unlikely to conflict with other skills like general data processing or analytics.	3 / 3
	Total	12 / 12 Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable skill with excellent workflow clarity and concrete executable patterns. Its main weakness is moderate verbosity — the catalog confirmation is repeated excessively, and the business story philosophy section could be more concise. The progressive disclosure is adequate but the main file carries a lot of inline content that could be offloaded to references.

Suggestions

Reduce redundancy around catalog confirmation — it's stated in the critical rules, the workflow steps, the pre-generation checklist, and the plan template. Consolidate to one authoritative location.

Consider moving the detailed plan example (Step 2) and Common Patterns section into reference files to keep the main SKILL.md leaner as an overview.

Dimension	Reasoning	Score
Conciseness	The skill is fairly long (~250 lines) with some redundancy — the 'Critical Rules' section repeats points already covered in the workflow and code sections (e.g., catalog confirmation is stated 3+ times, no .cache() is mentioned in rules, code comments, and troubleshooting). The 'Data Must Tell a Business Story' section is somewhat verbose for Claude. However, most content is genuinely instructive and domain-specific, not explaining things Claude already knows.	2 / 3
Actionability	The skill provides fully executable code patterns (Spark + Faker + Pandas UDFs), specific partition sizing guidance, concrete anti-pattern tables with alternatives, copy-paste ready infrastructure setup commands, and detailed plan templates. The FK pattern, weighted categories, and log-normal amounts are all directly usable.	3 / 3
Workflow Clarity	The workflow is clearly sequenced (gather requirements → present plan → confirm → generate → validate) with explicit validation checkpoints including pre-generation and post-generation checklists. The plan template includes a concrete approval gate ('Do NOT proceed to code generation until user approves'), and the post-generation checklist uses a specific tool (get_volume_folder_details) for verification.	3 / 3
Progressive Disclosure	The skill references two external files (1-data-patterns.md and 2-troubleshooting.md) with clear signaling in a table, which is good. However, the main SKILL.md itself is quite long and could benefit from moving some content (e.g., the detailed plan example, common patterns section) into reference files. The bundle files weren't provided, so we can't verify the references exist, but the structure is reasonable for a skill of this complexity.	2 / 3
	Total	10 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: databricks-solutions/ai-dev-kit
Commit: 93cb4e3

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.