Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.
94
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Catalog and schema are always user-supplied — never default to any value. If the user hasn't provided them, ask. For any UC write, always create the schema if it doesn't exist before writing data.
Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).
| Topic | Guide | When to Use |
|---|---|---|
| Setup & Execution | references/1-setup-and-execution.md | Setting up environment, choosing compute, installing dependencies |
| Generation Approaches | references/2-generation-approaches.md | Choosing Spark UDFs vs Polars local, writing generation code |
| Data Patterns | references/3-data-patterns.md | Creating realistic distributions, referential integrity, time patterns |
| Domain Guidance | references/4-domain-guidance.md | E-commerce, IoT, financial, support/CRM domain patterns |
| Output Formats | references/5-output-formats.md | Choosing output format, saving to volumes/tables |
| Troubleshooting | references/6-troubleshooting.md | Fixing errors, debugging issues |
| Example Script | scripts/generate_synthetic_data.py | Complete Spark + Pandas UDF example |
Prefer uv for all Python operations. Fall back to pip only if uv is not available.
# Preferred
uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
uv run python generate_data.py
# Fallback if uv not available
pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays
python generate_data.py.cache() or .persist() with serverless compute - these operations are NOT supported and will fail with AnalysisException: PERSIST TABLE is not supported on serverless compute. Instead, write master tables to Delta first, then read them back for FK joins.Before generating any code, you MUST present a plan for user approval.
You MUST explicitly ask the user which catalog to use. Do not assume or proceed without confirmation.
Example prompt to user:
"Which Unity Catalog should I use for this data?"
When presenting your plan, always show the selected catalog prominently:
📍 Output Location: catalog_name.schema_name
Volume: /Volumes/catalog_name/schema_name/raw_data/This makes it easy for the user to spot and correct if needed.
Ask the user about:
Show a clear specification with YOUR ASSUMPTIONS surfaced. Always start with the output location:
📍 Output Location: {user_catalog}.ecommerce_demo
Volume: /Volumes/{user_catalog}/ecommerce_demo/raw_data/| Table | Columns | Description | Rows | Key Assumptions |
|---|---|---|---|---|
| customers | customer_id, name, email, tier, region | Synthetic customer profiles | 5,000 | Tier: Free 60%, Pro 30%, Enterprise 10% |
| orders | order_id, customer_id (FK), amount, status | Customer purchase transactions | 15,000 | Enterprise customers generate 5x more orders |
Include column-level descriptions in the plan (these become column comments in Unity Catalog):
| Table | Column | Comment |
|---|---|---|
| customers | customer_id | Unique customer identifier (CUST-XXXXX) |
| customers | tier | Customer tier: Free, Pro, Enterprise |
| orders | customer_id | FK to customers.customer_id |
| orders | amount | Order total in USD |
Assumptions I'm making:
Ask user: "Does this look correct? Any adjustments to the catalog, tables, or distributions?"
Do NOT proceed to code generation until user approves the plan, including the catalog.
After generating data, use get_volume_folder_details to validate the output matches requirements:
from databricks.connect import DatabricksSession, DatabricksEnv
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, DoubleType
import pandas as pd
import numpy as np
# Setup serverless session with dependencies installed on the cluster
# IMPORTANT: Any library used inside Pandas UDFs (faker, holidays, etc.) must be listed here
env = DatabricksEnv().withDependencies("faker")
spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate()
# Define Pandas UDFs
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
from faker import Faker
fake = Faker()
return pd.Series([fake.name() for _ in range(len(ids))])
@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
amounts = []
for tier in tiers:
if tier == "Enterprise":
amounts.append(float(np.random.lognormal(7.5, 0.8)))
elif tier == "Pro":
amounts.append(float(np.random.lognormal(5.5, 0.7)))
else:
amounts.append(float(np.random.lognormal(4.0, 0.6)))
return pd.Series(amounts)
# Generate customers
customers_df = (
spark.range(0, 10000, numPartitions=16)
.select(
F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"),
fake_name(F.col("id")).alias("name"),
F.when(F.rand() < 0.6, "Free")
.when(F.rand() < 0.9, "Pro")
.otherwise("Enterprise").alias("tier"),
)
.withColumn("arr", generate_amount(F.col("tier")))
)
# Save to Unity Catalog
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")F.when(F.rand() < 0.6, "Free")
.when(F.rand() < 0.9, "Pro")
.otherwise("Enterprise").alias("tier")@F.pandas_udf(DoubleType())
def generate_amount(tiers: pd.Series) -> pd.Series:
return pd.Series([
float(np.random.lognormal({"Enterprise": 7.5, "Pro": 5.5, "Free": 4.0}[t], 0.7))
for t in tiers
])from datetime import datetime, timedelta
END_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)
F.date_add(F.lit(START_DATE.date()), (F.rand() * 180).cast("int")).alias("order_date")# Always in script - assume catalog exists
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")| Mode | Best For | Setup |
|---|---|---|
| DB Connect 16.4+ Serverless | Local dev, Python 3.12+ | DatabricksEnv().withDependencies(...) |
| Serverless Job | Production, scheduled | Job with environments parameter |
| Classic Cluster | Fallback only | Use Databricks CLI to install libraries. databricks libraries install --json '{"cluster_id": "<cluster_id>", "libraries": [{"pypi": {"package": "faker"}}, {"pypi": {"package": "holidays"}}]}' |
See references/1-setup-and-execution.md for detailed setup instructions.
| Format | Use Case | Code |
|---|---|---|
| Parquet (default) | SDP pipeline input | df.write.parquet(path) |
| JSON | Log-style ingestion | df.write.json(path) |
| CSV | Legacy systems | df.write.option("header", "true").csv(path) |
| Delta Table | Direct analytics | df.write.saveAsTable("catalog.schema.table") |
See references/5-output-formats.md for detailed options.
CREATE SCHEMA/VOLUME IF NOT EXISTS)| Issue | Solution |
|---|---|
ModuleNotFoundError: faker | See references/1-setup-and-execution.md |
| Faker UDF is slow | Use pandas_udf for batch processing |
| Out of memory | Increase numPartitions in spark.range() |
| Referential integrity errors | Write master table to Delta first, read back for FK joins |
PERSIST TABLE is not supported on serverless | NEVER use .cache() or .persist() with serverless - write to Delta table first, then read back |
F.window vs Window confusion | Use from pyspark.sql.window import Window for row_number(), rank(), etc. F.window is for streaming only. |
| Broadcast variables not supported | NEVER use spark.sparkContext.broadcast() with serverless |
See references/6-troubleshooting.md for full troubleshooting guide.
b4071a0
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.