data-warehouse-experimentation

Running experiments out of the data warehouse instead of via dedicated experiment platforms. SQL-based assignment, exposure logging discipline, metric definitions in dbt models, statistical analysis in SQL or Python, variance reduction with CUPED, sequential testing, and the operational tradeoffs vs platforms like Statsig and Optimizely. Triggers on warehouse-native experimentation, run experiments in BigQuery, run experiments in Snowflake, dbt experiments, SQL t-test, CUPED variance reduction, exposure log, sample ratio mismatch, sequential testing, mSPRT, doubly robust estimation, build vs buy experimentation. Also triggers when the team is choosing between platform and warehouse, building warehouse-native experiment infrastructure, auditing one, or running an experiment with a custom metric the platform cannot handle.

Quality

88%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

Passed

No findings from the security scan

Quality

Content

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is a high-quality, actionable playbook with executable SQL/Python, a clear architecture and checklist, and exemplary progressive disclosure into 8 real reference files. Its one weak spot is conciseness: the body is long and contains redundant recap (the 12-considerations framework) and rhetorical prose that restate earlier sections.

Suggestions

Cut or collapse the '12 considerations' framework: it recapitulates the dedicated sections (assignment, exposure, power, CUPED, sequential testing, multiple comparisons). Keep it only if it supplies a distinct decision ordering not already covered; otherwise drop it to remove the redundancy.

Tighten the closing 'build-vs-buy' section: the 'If you are building warehouse-native because...' prose restates the 'When warehouse-native is the right call' factors. Reduce it to a few actionable bullets.

Merge the opening paragraph with 'What this skill is for' — both cover the platform-vs-warehouse framing and sibling-skill disambiguation. Consolidating removes ~200 tokens with no information loss.

Dimension	Reasoning	Score
Conciseness	The body is largely expert-level and not padded with basics Claude already knows, but at ~28KB it carries redundancy: the '12 considerations' framework recapitulates the dedicated sections (assignment, exposure, power, CUPED, sequential testing, multiple comparisons), the closing rhetorical prose restates the 'when it is the right call' section, and the intro overlaps 'What this skill is for' — so it 'could be tightened' rather than earning 'every token earns its place.'	2 / 3
Actionability	It provides multiple fully executable, copy-paste-ready blocks — BigQuery hash assignment SQL, a complete Welch's t-test SQL CTE, a dbt metric model, scipy/statsmodels Python for the t-test, CUPED, and power analysis — plus a concrete exposure-event schema table; this matches 'fully executable code/commands; copy-paste ready' (the CUPED `pre_period_revenue()` helper is the only mildly illustrative line).	3 / 3
Workflow Clarity	It offers a clearly sequenced 4-component architecture (Assignment → Exposure logging → Metric definitions → Analysis), an explicit 12-step design checklist, and concrete validation checkpoints with feedback loops ("Check the SRM before computing any metric," audit-before-celebrating, pre-register stop criteria, pitfall → diagnose → fix), satisfying 'clear sequence with explicit validation steps; feedback loops; checklists.'	3 / 3
Progressive Disclosure	The body is an overview that signals one-level-deep references inline ("Detail in [`references/...`]" at each section) and summarizes all 8 in a dedicated 'Reference files' section; every referenced file exists in references/, content is appropriately split, and navigation is easy — matching 'clear overview with well-signaled one-level-deep references.'	3 / 3
	Total	11 / 12 Passed

Description

92%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong: it states concrete capabilities, packs in natural trigger phrasings, and explicitly covers both what and when. Its only weakness is a mild conflict risk on the build-vs-buy / platform-vs-warehouse trigger, which overlaps a sibling skill.

Dimension	Reasoning	Score
Specificity	It lists multiple specific concrete capabilities — "SQL-based assignment, exposure logging discipline, metric definitions in dbt models, statistical analysis in SQL or Python, variance reduction with CUPED, sequential testing, mSPRT, doubly robust estimation" — matching the anchor for listing multiple specific concrete actions; it is comprehensive, not just a named domain.	3 / 3
Completeness	It explicitly answers both what ("Running experiments out of the data warehouse...") and when ("Triggers on..." and "Also triggers when the team is choosing between platform and warehouse, building warehouse-native experiment infrastructure, auditing one, or running an experiment with a custom metric the platform cannot handle"), so the explicit-trigger threshold for a 3 is met rather than capped at 2.	3 / 3
Trigger Term Quality	The "Triggers on..." clause covers natural phrasings a data scientist would say — "run experiments in BigQuery," "run experiments in Snowflake," "dbt experiments," "SQL t-test," "CUPED variance reduction," "sample ratio mismatch," "sequential testing" — giving good coverage of natural terms rather than only jargon.	3 / 3
Distinctiveness Conflict Risk	The core niche (warehouse-native execution via SQL/dbt/CUPED) is distinct, but the trigger "choosing between platform and warehouse" / "build vs buy experimentation" overlaps with the sibling experimentation-platform-orchestrator skill that the body itself says owns that decision, so it could still trigger for the wrong skill; this is 'somewhat specific but could overlap' rather than 'unlikely to conflict'.	2 / 3
	Total	11 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	15 / 16 Passed

Repository: rampstackco/claude-skills
Path: skills/data-warehouse-experimentation/SKILL.md
Commit: bc6d961

Reviewed: about 6 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.