CtrlK
BlogDocsLog inGet started
Tessl Logo

databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

75

Quality

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is highly actionable with executable code, clear sequenced workflows and checklists, and well-organized one-level references. The main weakness is moderate verbosity from repeating the business-story framing across multiple sections.

Suggestions

Consolidate the 'Data Must Tell a Business Story' framing so the problem→impact→analysis→solution pattern is stated once rather than repeated across the key principles, critical rules, and worked example.

Tighten the worked planning example — the full $2.3M outage narrative duplicates guidance already in the story principles.

Link scripts/generate_synthetic_data.py from the body (e.g., a 'Full reference script' pointer) so the existing bundle is discoverable rather than orphaned.

DimensionReasoningScore

Conciseness

The body is mostly efficient and operational (rules, anti-pattern tables, code patterns) but the 'Data Must Tell a Business Story' framing is restated across the key-principles, critical-rules, and worked planning-example sections, and the example plan narrates the story at length — tightening this repetition would reach level 3; it is above level 1 because it avoids explaining basics Claude already knows.

2 / 3

Actionability

Provides fully executable, copy-paste-ready code (Databricks Connect serverless setup, pandas UDF, weighted-category and FK patterns), concrete partition counts by scale, exact output-format and install commands, and a specific common-issues table — matching the anchor for executable code with specific examples.

3 / 3

Workflow Clarity

Sequences the generation workflow (Gather Requirements → Present Plan → Ask About Features) with an explicit catalog-confirmation gate, a pre-generation checklist with the checkpoint 'Do NOT proceed to code generation until user approves', and a post-generation validation step via get_volume_folder_details; it is above level 2 because checkpoints and checklists are explicit rather than implied.

3 / 3

Progressive Disclosure

SKILL.md is a clear overview with a well-signaled References table pointing to two one-level-deep leaf files (verified to contain no nested .md references); the only minor gap is that scripts/generate_synthetic_data.py is not linked from the body, but the overview-plus-references structure fits level 3 better than the poorly-signaled/inline level 2.

3 / 3

Total

11

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong across all dimensions: it states concrete capabilities, includes natural trigger terms, explicitly covers both what and when, and occupies a distinct niche. It is concise yet complete with no over-claims.

DimensionReasoningScore

Specificity

Enumerates multiple concrete capabilities — 'Generate realistic synthetic data using Spark + Faker', 'serverless execution, multiple output formats (Parquet/JSON/CSV/Delta)', 'scales from thousands to millions of rows', and 'generate locally and upload to volumes' — matching the anchor for listing multiple specific concrete actions; it is above level 2 because it goes beyond naming the domain to specify formats, execution modes, and scale.

3 / 3

Completeness

Clearly answers both what (generation capabilities, formats, scale, execution modes) and when via an explicit 'Use when user mentions ...' clause, satisfying the anchor for both what AND when with explicit triggers.

3 / 3

Trigger Term Quality

Provides broad natural-language triggers a user would actually say — 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', 'sample data' — giving good coverage rather than the partial set of level 2.

3 / 3

Distinctiveness Conflict Risk

Targets a clear niche — Databricks synthetic/test data generation with Spark + Faker — with distinctive triggers unlikely to fire for unrelated skills; voice is third person ('Generate', 'Supports') with no first/second-person penalty.

3 / 3

Total

12

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation16 / 16 Passed

Validation for skill structure

No warnings or errors.

Repository
databricks-solutions/ai-dev-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.