CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

94

Quality

92%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that hits all the key criteria. It provides specific capabilities (formats, scale, execution modes), uses third-person voice throughout, and includes an explicit 'Use when...' clause with well-chosen natural trigger terms. The description is concise yet comprehensive, clearly distinguishing this skill from general data processing or other data-related skills.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions and capabilities: generate synthetic data using Spark + Faker, serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), scaling from thousands to millions of rows, and local generation with upload to volumes for small datasets.

3 / 3

Completeness

Clearly answers both 'what does this do' (generate realistic synthetic data with Spark + Faker, multiple formats, scalable) AND 'when should Claude use it' with an explicit 'Use when...' clause listing specific trigger terms.

3 / 3

Trigger Term Quality

Includes a strong set of natural trigger terms users would actually say: 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', 'sample data'. These cover common variations of how users would phrase their need.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with a clear niche: synthetic/test data generation specifically using Spark + Faker. The combination of technology stack (Spark, Faker), output formats, and specific trigger terms makes it unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Implementation

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, well-structured skill that provides highly actionable guidance for synthetic data generation on Databricks. Its main strength is the clear workflow with approval gates, executable code patterns, and comprehensive anti-pattern documentation. The primary weakness is moderate verbosity — some concepts are repeated across sections (catalog confirmation, story-driven data principles) and the plan example, while illustrative, is lengthy.

Suggestions

Consolidate the catalog confirmation guidance — it appears in Critical Rules (#6), the workflow (Step 1, dedicated MUST DO section, and pre-generation checklist). A single authoritative location with brief references would save ~20 lines.

Consider trimming the 'Data Must Tell a Business Story' section by merging the 'Key principles' bullets with the 'Critical Rules' list, as rules 1-4 largely restate the same concepts.

DimensionReasoningScore

Conciseness

The skill is fairly long (~300 lines) and includes some redundancy — the 'Critical Rules' section repeats guidance found elsewhere (e.g., catalog confirmation appears in rules, workflow, and checklist). The 'Data Must Tell a Business Story' section, while valuable, is somewhat verbose with principles that could be condensed. However, most content is domain-specific and not explaining things Claude already knows.

2 / 3

Actionability

The skill provides fully executable code examples including the complete Databricks Connect + Faker pattern, concrete patterns for weighted categories, log-normal distributions, FK integrity, and infrastructure setup. Commands are copy-paste ready with specific library versions and partition sizing guidance.

3 / 3

Workflow Clarity

The workflow is clearly sequenced (gather requirements → present plan → confirm features → generate → validate) with explicit validation checkpoints including pre-generation and post-generation checklists. The plan-before-code pattern with user approval gates is well-defined, and the post-generation validation using get_volume_folder_details provides a feedback loop.

3 / 3

Progressive Disclosure

The skill provides a clear overview with well-signaled one-level-deep references to 'references/1-data-patterns.md' for ML patterns and 'references/2-troubleshooting.md' for error resolution. The reference table clearly indicates when to consult each file. Related skills are also linked appropriately.

3 / 3

Total

11

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
databricks-solutions/ai-dev-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.