Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.
94
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that hits all the key criteria. It provides specific capabilities (formats, scale, execution modes), uses third-person voice throughout, and includes an explicit 'Use when...' clause with well-chosen natural trigger terms. The description is concise yet comprehensive, clearly distinguishing this skill from general data processing or other data-related skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions and capabilities: generate synthetic data using Spark + Faker, serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), scaling from thousands to millions of rows, and local generation with upload to volumes for small datasets. | 3 / 3 |
Completeness | Clearly answers both 'what does this do' (generate realistic synthetic data with Spark + Faker, multiple formats, scalable) AND 'when should Claude use it' with an explicit 'Use when...' clause listing specific trigger terms. | 3 / 3 |
Trigger Term Quality | Includes a strong set of natural trigger terms users would actually say: 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', 'sample data'. These cover common variations of how users would phrase their need. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive with a clear niche: synthetic/test data generation specifically using Spark + Faker. The combination of technology stack (Spark, Faker), output formats, and specific trigger terms makes it unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, well-structured skill that provides highly actionable guidance for synthetic data generation on Databricks. Its main strength is the clear workflow with approval gates, executable code patterns, and comprehensive anti-pattern documentation. The primary weakness is moderate verbosity — some concepts are repeated across sections (catalog confirmation, story-driven data principles) and the plan example, while illustrative, is lengthy.
Suggestions
Consolidate the catalog confirmation guidance — it appears in Critical Rules (#6), the workflow (Step 1, dedicated MUST DO section, and pre-generation checklist). A single authoritative location with brief references would save ~20 lines.
Consider trimming the 'Data Must Tell a Business Story' section by merging the 'Key principles' bullets with the 'Critical Rules' list, as rules 1-4 largely restate the same concepts.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is fairly long (~300 lines) and includes some redundancy — the 'Critical Rules' section repeats guidance found elsewhere (e.g., catalog confirmation appears in rules, workflow, and checklist). The 'Data Must Tell a Business Story' section, while valuable, is somewhat verbose with principles that could be condensed. However, most content is domain-specific and not explaining things Claude already knows. | 2 / 3 |
Actionability | The skill provides fully executable code examples including the complete Databricks Connect + Faker pattern, concrete patterns for weighted categories, log-normal distributions, FK integrity, and infrastructure setup. Commands are copy-paste ready with specific library versions and partition sizing guidance. | 3 / 3 |
Workflow Clarity | The workflow is clearly sequenced (gather requirements → present plan → confirm features → generate → validate) with explicit validation checkpoints including pre-generation and post-generation checklists. The plan-before-code pattern with user approval gates is well-defined, and the post-generation validation using get_volume_folder_details provides a feedback loop. | 3 / 3 |
Progressive Disclosure | The skill provides a clear overview with well-signaled one-level-deep references to 'references/1-data-patterns.md' for ML patterns and 'references/2-troubleshooting.md' for error resolution. The reference table clearly indicates when to consult each file. Related skills are also linked appropriately. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
02aac8c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.