CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

94

Quality

92%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that hits all the marks. It provides specific capabilities (formats, scale, execution modes), uses third-person voice throughout, and includes an explicit 'Use when...' clause with well-chosen natural trigger terms. The description is concise yet comprehensive, clearly distinguishing this skill from other data-related skills.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions and capabilities: generate synthetic data using Spark + Faker, serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), scaling from thousands to millions of rows, and local generation with upload to volumes for small datasets.

3 / 3

Completeness

Clearly answers both 'what does this do' (generate realistic synthetic data using Spark + Faker with multiple formats and scale options) AND 'when should Claude use it' with an explicit 'Use when...' clause listing specific trigger terms.

3 / 3

Trigger Term Quality

Includes a strong set of natural trigger terms users would actually say: 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', 'sample data'. These cover common variations of how users would phrase their need.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with a clear niche: synthetic/test data generation specifically using Spark + Faker. The combination of technology stack (Spark, Faker), output formats, and specific trigger terms makes it unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Implementation

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, well-structured skill that provides clear actionable guidance for synthetic data generation on Databricks. Its main strengths are excellent progressive disclosure with well-signaled references, fully executable code examples, and a thorough planning workflow with validation checkpoints. The primary weakness is some redundancy across sections (repeated UDF code, repeated catalog/schema reminders) that could be tightened to improve token efficiency.

Suggestions

Remove the duplicate generate_amount UDF from the Quick Start section since it's already in Common Patterns, or consolidate into a single location with a cross-reference.

Consolidate the repeated 'ask for catalog/schema' instruction — it appears in Critical Rules, the planning workflow header, Step 1, and Best Practices. State it once prominently and reference it.

DimensionReasoningScore

Conciseness

The skill is fairly well-organized but includes some redundancy — the Quick Start code repeats the generate_amount UDF that also appears in Common Patterns, the planning workflow is quite verbose with multiple checklists, and some instructions (like 'ask for catalog/schema') are repeated across Critical Rules, Planning Workflow, and Best Practices. However, it mostly avoids explaining concepts Claude already knows.

2 / 3

Actionability

The skill provides fully executable code examples with complete Spark + Faker + Pandas UDF patterns, specific SQL commands for infrastructure creation, concrete distribution parameters (log-normal means/stds), and copy-paste ready code snippets for common patterns like weighted distributions and date ranges.

3 / 3

Workflow Clarity

The generation planning workflow is clearly sequenced with explicit steps (gather requirements → present spec → confirm → generate → validate), includes pre-generation and post-generation checklists, and has clear validation checkpoints including user approval gates before proceeding. The critical rules about .cache()/.persist() and the master-tables-first ordering provide important guardrails.

3 / 3

Progressive Disclosure

Excellent progressive disclosure with a clear Quick Reference table linking to 6 separate reference files and a script, each with a 'When to Use' description. The main skill provides a concise overview with actionable quick-start content while deferring detailed setup, troubleshooting, and domain guidance to one-level-deep references.

3 / 3

Total

11

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
databricks-solutions/ai-dev-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.