CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

94

Quality

92%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Catalog and schema are always user-supplied — never default to any value. If the user hasn't provided them, ask. For any UC write, always create the schema if it doesn't exist before writing data.

Databricks Synthetic Data Generation

Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).

Quick Reference

TopicGuideWhen to Use
Setup & Executionreferences/1-setup-and-execution.mdSetting up environment, choosing compute, installing dependencies
Generation Approachesreferences/2-generation-approaches.mdChoosing Spark UDFs vs Polars local, writing generation code
Data Patternsreferences/3-data-patterns.mdCreating realistic distributions, referential integrity, time patterns
Domain Guidancereferences/4-domain-guidance.mdE-commerce, IoT, financial, support/CRM domain patterns
Output Formatsreferences/5-output-formats.mdChoosing output format, saving to volumes/tables
Troubleshootingreferences/6-troubleshooting.mdFixing errors, debugging issues
Example Scriptscripts/generate_synthetic_data.pyComplete Spark + Pandas UDF example

Package Manager

Prefer uv for all Python operations. Fall back to pip only if uv is not available.

# Preferred
uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
uv run python generate_data.py

# Fallback if uv not available
pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
python generate_data.py

Critical Rules

  1. Strongly prefer to use Spark + Faker + Pandas UDFs for data generation (scalable, parallel)
  2. If user specifies local then use Polars locally instead of Spark, but suggest Spark if > 30,000 rows.
  3. Present a plan for user approval before generating any code
  4. Ask for catalog/schema - do not default
  5. Use serverless compute unless user explicitly requests classic cluster
  6. Generate raw data only - no pre-aggregated fields (unless user requests)
  7. Create master tables first - then generate related tables with valid FKs
  8. NEVER use .cache() or .persist() with serverless compute - these operations are NOT supported and will fail with AnalysisException: PERSIST TABLE is not supported on serverless compute. Instead, write master tables to Delta first, then read them back for FK joins.

Generation Planning Workflow

Before generating any code, you MUST present a plan for user approval.

⚠️ MUST DO: Confirm Catalog Before Proceeding

You MUST explicitly ask the user which catalog to use. Do not assume or proceed without confirmation.

Example prompt to user:

"Which Unity Catalog should I use for this data?"

When presenting your plan, always show the selected catalog prominently:

📍 Output Location: catalog_name.schema_name
   Volume: /Volumes/catalog_name/schema_name/raw_data/

This makes it easy for the user to spot and correct if needed.

Step 1: Gather Requirements

Ask the user about:

  • Catalog/Schema - Which catalog to use?
  • What domain/scenario? (e-commerce, support tickets, IoT sensors, etc.)
  • How many tables? What relationships between them?
  • Approximate row counts per table?
  • Output format preference? (Delta table is default)

Step 2: Present Table Specification

Show a clear specification with YOUR ASSUMPTIONS surfaced. Always start with the output location:

📍 Output Location: {user_catalog}.ecommerce_demo
   Volume: /Volumes/{user_catalog}/ecommerce_demo/raw_data/
TableColumnsDescriptionRowsKey Assumptions
customerscustomer_id, name, email, tier, regionSynthetic customer profiles5,000Tier: Free 60%, Pro 30%, Enterprise 10%
ordersorder_id, customer_id (FK), amount, statusCustomer purchase transactions15,000Enterprise customers generate 5x more orders

Include column-level descriptions in the plan (these become column comments in Unity Catalog):

TableColumnComment
customerscustomer_idUnique customer identifier (CUST-XXXXX)
customerstierCustomer tier: Free, Pro, Enterprise
orderscustomer_idFK to customers.customer_id
ordersamountOrder total in USD

Assumptions I'm making:

  • Amount distribution: log-normal by tier (Enterprise ~$1800, Pro ~$245, Free ~$55)
  • Status: 65% delivered, 15% shipped, 10% processing, 5% pending, 5% cancelled

Ask user: "Does this look correct? Any adjustments to the catalog, tables, or distributions?"

Step 3: Ask About Data Features

  • Skew (non-uniform distributions) - Enabled by default
  • Joins (referential integrity) - Enabled by default
  • Bad data injection (for data quality testing)
  • Multi-language text
  • Incremental mode (append vs overwrite)

Pre-Generation Checklist

  • Catalog confirmed - User explicitly approved which catalog to use
  • Output location shown prominently in plan (easy to spot/change)
  • Table specification shown and approved
  • Assumptions about distributions confirmed
  • User confirmed compute preference (serverless recommended)
  • Data features selected

Do NOT proceed to code generation until user approves the plan, including the catalog.

Post-Generation Checklist

After generating data, use get_volume_folder_details to validate the output matches requirements:

  • Row counts match the plan
  • Schema matches expected columns and types
  • Data distributions look reasonable (check column stats)

Quick Start: Spark + Faker + Pandas UDFs

from databricks.connect import DatabricksSession, DatabricksEnv
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DoubleType
import pandas as pd
import numpy as np

# Setup serverless session with dependencies installed on the cluster
# IMPORTANT: Any library used inside Pandas UDFs (faker, holidays, etc.) must be listed here
env = DatabricksEnv().withDependencies("faker")
spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()

# Define Pandas UDFs
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
    from faker import Faker
    fake = Faker()
    return pd.Series([fake.name() for _ in range(len(ids))])

@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
    amounts = []
    for tier in tiers:
        if tier == "Enterprise":
            amounts.append(float(np.random.lognormal(7.5, 0.8)))
        elif tier == "Pro":
            amounts.append(float(np.random.lognormal(5.5, 0.7)))
        else:
            amounts.append(float(np.random.lognormal(4.0, 0.6)))
    return pd.Series(amounts)

# Generate customers
customers_df = (
    spark.range(0, 10000, numPartitions=16)
    .select(
        F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"),
        fake_name(F.col("id")).alias("name"),
        F.when(F.rand() < 0.6, "Free")
         .when(F.rand() < 0.9, "Pro")
         .otherwise("Enterprise").alias("tier"),
    )
    .withColumn("arr", generate_amount(F.col("tier")))
)

# Save to Unity Catalog
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")

Common Patterns

Weighted Tier Distribution

F.when(F.rand() < 0.6, "Free")
 .when(F.rand() < 0.9, "Pro")
 .otherwise("Enterprise").alias("tier")

Log-Normal Amounts (Realistic Pricing)

@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
    return pd.Series([
        float(np.random.lognormal({"Enterprise": 7.5, "Pro": 5.5, "Free": 4.0}[t], 0.7))
        for t in tiers
    ])

Date Range (Last 6 Months)

from datetime import datetime, timedelta
END_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)

F.date_add(F.lit(START_DATE.date()), (F.rand() * 180).cast("int")).alias("order_date")

Infrastructure Creation

# Always in script - assume catalog exists
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")

Execution Modes

ModeBest ForSetup
DB Connect 16.4+ ServerlessLocal dev, Python 3.12+DatabricksEnv().withDependencies(...)
Serverless JobProduction, scheduledJob with environments parameter
Classic ClusterFallback onlyUse Databricks CLI to install libraries. databricks libraries install --json '{"cluster_id": "<cluster_id>", "libraries": [{"pypi": {"package": "faker"}}, {"pypi": {"package": "holidays"}}]}'

See references/1-setup-and-execution.md for detailed setup instructions.

Output Formats

FormatUse CaseCode
Parquet (default)SDP pipeline inputdf.write.parquet(path)
JSONLog-style ingestiondf.write.json(path)
CSVLegacy systemsdf.write.option("header", "true").csv(path)
Delta TableDirect analyticsdf.write.saveAsTable("catalog.schema.table")

See references/5-output-formats.md for detailed options.

Best Practices Summary

Execution

  • Use serverless (instant start, no cluster wait)
  • Ask for catalog/schema
  • Present plan before generating

Data Generation

  • Spark + Faker + Pandas UDFs for all cases
  • Master tables first, then related tables with valid FKs
  • Non-linear distributions (log-normal, Pareto, exponential)
  • Time patterns (weekday/weekend, holidays, seasonality)
  • Row coherence (correlated attributes)

Output

  • Create infrastructure in script (CREATE SCHEMA/VOLUME IF NOT EXISTS)
  • Do NOT create catalogs - assume they exist
  • Delta tables as default
  • Add table and column comments for discoverability in Unity Catalog (see references/5-output-formats.md)

Related Skills

  • databricks-unity-catalog - Managing catalogs, schemas, and volumes
  • databricks-bundles - DABs for production deployment

Common Issues

IssueSolution
ModuleNotFoundError: fakerSee references/1-setup-and-execution.md
Faker UDF is slowUse pandas_udf for batch processing
Out of memoryIncrease numPartitions in spark.range()
Referential integrity errorsWrite master table to Delta first, read back for FK joins
PERSIST TABLE is not supported on serverlessNEVER use .cache() or .persist() with serverless - write to Delta table first, then read back
F.window vs Window confusionUse from pyspark.sql.window import Window for row_number(), rank(), etc. F.window is for streaming only.
Broadcast variables not supportedNEVER use spark.sparkContext.broadcast() with serverless

See references/6-troubleshooting.md for full troubleshooting guide.

Repository
databricks-solutions/ai-dev-kit
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.