Running experiments out of the data warehouse instead of via dedicated experiment platforms. SQL-based assignment, exposure logging discipline, metric definitions in dbt models, statistical analysis in SQL or Python, variance reduction with CUPED, sequential testing, and the operational tradeoffs vs platforms like Statsig and Optimizely. Triggers on warehouse-native experimentation, run experiments in BigQuery, run experiments in Snowflake, dbt experiments, SQL t-test, CUPED variance reduction, exposure log, sample ratio mismatch, sequential testing, mSPRT, doubly robust estimation, build vs buy experimentation. Also triggers when the team is choosing between platform and warehouse, building warehouse-native experiment infrastructure, auditing one, or running an experiment with a custom metric the platform cannot handle.
63
75%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/data-warehouse-experimentation/SKILL.mdA senior data scientist's playbook for running experiments natively out of BigQuery, Snowflake, or any modern data warehouse, with metric definitions in dbt and statistical analysis in SQL or Python.
Most companies that run experiments at scale use a dedicated platform. Statsig, Optimizely, LaunchDarkly with experimentation, PostHog, Amplitude Experiment. The platforms are good. They handle assignment, instrumentation, and analysis in one product, and the SQL-savvy data team does not have to reinvent the variance reduction wheel.
There is a different operational model that mature data teams increasingly choose: warehouse-native experimentation. Assignment happens in code or via feature flags. Exposure events fire to the warehouse like any other event. Metrics are defined as dbt models. Statistical analysis runs as SQL or in a Python notebook against warehouse data. The "experiment platform" is just your existing data stack.
This skill covers when warehouse-native is the right call, the architecture, and the specific techniques that make it work: assignment patterns, exposure logging discipline, metric definitions in dbt, t-tests and CUPED in SQL, sequential testing, and the pitfalls that take down homegrown setups.
When to use this skill: deciding between platform vs warehouse-native, building a warehouse-native experiment infrastructure, auditing an existing one, or running a specific experiment when the platform of record cannot handle a custom metric or segmentation.
This skill spans the operational execution model for warehouse-native experimentation. It does not replace the methodology and interpretation skills; it composes with them.
experiment-design covers methodology: hypotheses, sample size, randomization unit, primary metric. Tool-agnostic. Read it first to design the experiment correctly regardless of where it runs.experimentation-analytics covers interpretation: confidence intervals, p-values, effect size, decision frameworks. Tool-agnostic. Read it when results land.experimentation-platform-orchestrator covers the platform-vs-warehouse decision in detail. Read it to decide whether to use a platform or this skill.feature-flagging covers assignment infrastructure when not running through a platform. Read it for the flag-management discipline that this skill assumes.data-warehouse-experimentation) covers the operational execution: SQL-based assignment, exposure logging, metric definitions in dbt, statistical analysis in SQL or Python, variance reduction, sequential testing.The distinction is between "what to do" (the methodology and interpretation skills) and "how to do it without a vendor platform" (this skill). Read this skill after you have decided warehouse-native is the right call. If you are still deciding, start with experimentation-platform-orchestrator.
Six factors push the decision toward warehouse-native.
Five factors push toward platform.
Detail and a decision tree in references/warehouse-vs-platform-decision.md. Many mature teams use both; warehouse-native for the hard cases, platform for fast iteration on standard experiments.
Four components, in order of data flow.
The flow. User visits the product. Assignment determines the bucket (control or treatment). If the user is exposed to the variant (sees the treatment-specific behavior), an exposure event fires to the warehouse. The user takes actions, generating metric events to the same warehouse. At analysis time, exposure joins to metrics on the assignment unit (typically user_id); the analysis computes lift and produces a decision.
The exposure-event pattern is critical. Without it you can compute only an "intent-to-treat" analysis (everyone assigned, regardless of whether they saw the variant). With it you compute the "exposed" analysis on the population that actually experienced the variant. The latter is usually what you want, especially when the variant only affects a subset of the assigned users (e.g., users who reached a specific page).
Three approaches.
Deterministic hash assignment. The default for warehouse-native.
MOD(ABS(FARM_FINGERPRINT(CONCAT(user_id, 'exp_button_color_v1'))), 100) < 50The salt ('exp_button_color_v1') ensures different experiments produce uncorrelated assignments for the same user. Reproducible (same input always produces the same bucket), no service dependency, salt isolation across experiments. The assignment can be computed inline in any SQL query.
Feature flag assignment. Rely on a feature flag service (LaunchDarkly, Statsig flags, Unleash, internal) to do bucketing; the warehouse just records the assignment that the flag service chose.
-- Read assignment from the flag service's logs
SELECT user_id, variant_id, assigned_at
FROM flag_service.assignments
WHERE flag_key = 'exp_button_color_v1'This works when the flag service is the source of truth for assignment and the warehouse mirrors the assignment table. Useful when assignment must respect flag-service rules (e.g., percentage rollouts, targeting rules) that are inconvenient to replicate in SQL.
Randomized assignment table. Pre-randomize users into a table at experiment start.
CREATE TABLE exp_button_color_v1_assignments AS
SELECT
user_id,
CASE WHEN RAND() < 0.5 THEN 'control' ELSE 'treatment' END AS variant_id
FROM dim_users
WHERE eligible = true;Less common; useful when the eligibility set is fixed at experiment start and you want assignment to be deterministic and explicit (e.g., for compliance audit). The downside: new users joining mid-experiment are not in the table; either skip them or fall back to hash assignment.
The deterministic hash approach is the default for warehouse-native because it requires no service dependency and produces stable, auditable assignments. Detail in references/assignment-and-exposure-patterns.md.
The single most important discipline in warehouse-native experimentation.
Required exposure event schema:
| Field | Type | Notes |
|---|---|---|
experiment_id | string | Unique identifier per experiment. |
variant_id | string | The variant the user was bucketed into. |
user_id | string | The assignment unit. |
exposed_at | timestamp | ISO 8601 UTC. The moment exposure fired. |
context_* | various | Optional context properties: device, page, account_id. |
Fire exposure exactly when the user has seen the variant-specific behavior. Not at page load. Not at session start. Not at app open.
The "delayed exposure" trap. If the variant only matters at button click and you fire exposure at page load, every page-load user enters the analysis whether or not they ever saw the variant. The control group includes users who never reached the button; the treatment group does too. The analysis dilutes the real effect.
Worked example. The treatment shows a new pricing page; the control shows the old one. Fire exposure when the pricing page loads, not when the user lands on the homepage. Users who never reach the pricing page are not exposed to either variant; they should not be in the analysis.
The "always-fire" trap. Some implementations fire exposure on every variant-specific interaction. The user clicks the button five times; exposure fires five times. The exposure log is now five times larger than it should be, and analysis tools that count distinct user_ids in exposure handle this correctly while tools that count rows do not.
The discipline. Fire exactly one exposure event per user per experiment, at the moment of first variant-specific exposure. Use a deterministic flag in the client (or a server-side cache) to enforce single-fire. Detail in references/assignment-and-exposure-patterns.md.
Defining metrics as dbt models gives you four things.
fct_orders model feeds the board dashboard, the experiment analysis, and the executive report. Aligned definitions, no drift.Pattern.
-- models/experiments/exp_metrics_revenue.sql
SELECT
user_id,
SUM(CASE WHEN refunded THEN 0 ELSE amount_cents END) AS net_revenue_cents,
MIN(occurred_at) AS first_purchase_at
FROM {{ ref('fct_orders') }}
WHERE occurred_at >= '{{ var("experiment_start") }}'
GROUP BY user_idThe experiment analysis joins this to the exposure log on user_id and computes group means.
The variance discipline. The same metric definition is used in board dashboards AND in experiment analysis. No "experiment-specific revenue calculation" that is slightly different. Otherwise you get the "the experiment said the revenue lifted but the board did not move" problem, which is almost always a metric-definition mismatch.
The namespace pattern. Use exp_metrics_* for experiment-shaped models that group by user_id and produce one row per user. Use fct_* for the underlying fact tables that feed both metric models and dashboards. Detail in references/metric-definitions-in-dbt.md.
The basic two-sample Welch's t-test in SQL.
WITH metric_by_variant AS (
SELECT
e.variant_id,
COUNT(*) AS n,
AVG(m.net_revenue_cents) AS mean,
VAR_SAMP(m.net_revenue_cents) AS variance
FROM exposures e
LEFT JOIN exp_metrics_revenue m USING (user_id)
WHERE e.experiment_id = 'exp_button_color_v1'
GROUP BY e.variant_id
)
SELECT
control.mean AS control_mean,
treatment.mean AS treatment_mean,
treatment.mean - control.mean AS absolute_lift,
(treatment.mean - control.mean) / NULLIF(control.mean, 0) AS relative_lift,
-- Welch's t-statistic
(treatment.mean - control.mean) /
SQRT(treatment.variance / treatment.n + control.variance / control.n)
AS t_statistic
FROM
(SELECT * FROM metric_by_variant WHERE variant_id = 'control') control,
(SELECT * FROM metric_by_variant WHERE variant_id = 'treatment') treatmentConvert the t-statistic to a p-value or confidence interval using a SQL function (BigQuery: a UDF; Snowflake: native or a stored procedure) or compute in Python on the result of the SQL query.
The SQL pattern is fine for simple t-tests on continuous metrics. For proportions tests, swap variance for p * (1 - p). For non-parametric tests (Mann-Whitney), the SQL gets ugly fast; switch to Python.
Anything more complex than a simple t-test (CUPED, bootstrap, doubly robust estimation, sequential testing) is easier in Python. Use SQL for the SUM-and-AVG aggregations; ship the result to Python for the statistical math. Detail in references/statistical-analysis-templates.md.
The Python pattern, typically in a Jupyter or Hex notebook.
import pandas as pd
import numpy as np
from scipy import stats
# Pull aggregated data from the warehouse
df = warehouse.query("""
SELECT user_id, variant_id, net_revenue_cents
FROM exp_results
WHERE experiment_id = 'exp_button_color_v1'
""")
control = df[df.variant_id == 'control'].net_revenue_cents
treatment = df[df.variant_id == 'treatment'].net_revenue_cents
# Welch's t-test
t, p = stats.ttest_ind(treatment, control, equal_var=False)
# Confidence interval on the mean difference
diff = treatment.mean() - control.mean()
se = np.sqrt(treatment.var() / len(treatment) + control.var() / len(control))
ci_low, ci_high = diff - 1.96 * se, diff + 1.96 * se
print(f"Lift: {diff:.2f} cents (95% CI: [{ci_low:.2f}, {ci_high:.2f}])")
print(f"p-value: {p:.4f}")Python gives you access to the full statistical ecosystem (scipy, statsmodels, numpy) for techniques SQL cannot easily express.
The notebook pattern. One notebook per experiment, parameterized by experiment_id. Version-controlled in git or as Hex projects. Each notebook produces a written-up decision document at the end, archived in a queryable repository (Notion, GitHub markdown, or a dedicated experiment-results table in the warehouse).
Detail and bootstrap templates in references/statistical-analysis-templates.md.
The most powerful variance reduction technique for warehouse-native experimentation: CUPED (Controlled-experiment Using Pre-Experiment Data). Originally from Microsoft.
The intuition. If you can predict a user's metric behavior from pre-experiment data, you can subtract out that predicted variance, leaving a smaller residual to test on.
# Pre-experiment metric for each user (e.g., last 28 days revenue)
pre = pre_period_revenue(user_id)
# Theta is the regression coefficient of the metric on the pre-period
theta = np.cov(metric, pre)[0, 1] / np.var(pre)
# Adjusted metric
adjusted_metric = metric - theta * (pre - pre.mean())Run the t-test on the adjusted metric instead of the raw metric. The mean is preserved (CUPED does not change the point estimate) but the variance is smaller, so the confidence interval is narrower.
CUPED typically reduces variance by 30 to 50 percent on engagement metrics. That is equivalent to running an experiment 1.5x to 2x longer for the same statistical power. Worth the engineering investment for any team running 5+ experiments per quarter.
Other variance reduction techniques.
Detail with worked examples in references/variance-reduction-techniques.md.
Before running, compute required sample size.
from statsmodels.stats.power import tt_ind_solve_power
# Solve for sample size given desired MDE
n_per_arm = tt_ind_solve_power(
effect_size=0.05, # Cohen's d
nobs1=None,
alpha=0.05,
power=0.8
)
# Or solve for MDE given sample size
mde = tt_ind_solve_power(
effect_size=None,
nobs1=8000,
alpha=0.05,
power=0.8
)The "we need 10x more users than we thought" lesson. Most underpowered experiments come from optimistic effect-size assumptions. The team designs the experiment expecting a 10% lift; the actual effect is 1%, undetectable at the planned sample size; the experiment runs forever or stops with an inconclusive result.
The fix. Use the historical distribution of past experiments' observed effects to set realistic MDE expectations. If the median observed effect across the last 30 experiments is 0.5%, plan for a 0.5% MDE on new experiments. The optimism asymmetry is real; correcting it requires looking at the actual distribution of effects, not the wished-for distribution.
Detail in references/power-analysis-calculations.md.
The "peeking" problem. Looking at experiment results before completion inflates the false-positive rate. The naive solution is "do not peek." The practical solution is sequential testing methods that allow valid early stopping.
Three approaches.
statsmodels or custom code; not natively in SQL.For warehouse-native, the practical recommendation is mSPRT in Python. Document the alpha-spending function used. Train one team member on the math; do not rely on a black-box implementation.
The honest version. If you do not have someone on the team who understands sequential testing math, just do not peek. Pre-register sample size, run to completion, analyze once. Sequential testing is statistically correct only when implemented correctly; an incorrect implementation is worse than no peeking discipline at all.
Detail in references/sequential-testing-patterns.md.
Eleven patterns recur in warehouse-native experimentation. Detail in references/common-pitfalls.md.
When designing or running a warehouse-native experiment, walk these 12 considerations.
experimentation-platform-orchestrator if undecided.The output of the framework is an experiment record. Pre-registered sample size and stop criteria, the assignment salt, the exposure log specification, the dbt metric model, the analysis notebook, and a written-up decision. The record lives in version control or in a dedicated experiment-tracking system; the analysis is reproducible from the record.
references/warehouse-vs-platform-decision.md - When each operational model is the right call. Cost considerations at different scales. Hybrid patterns. Migration patterns.references/assignment-and-exposure-patterns.md - Hash assignment SQL templates for BigQuery and Snowflake. Salt naming conventions. Exposure event schema. SRM check SQL.references/metric-definitions-in-dbt.md - dbt model patterns for experiment metrics. Reusing fct models. The exp_metrics namespace. Versioning.references/statistical-analysis-templates.md - SQL and Python templates for Welch's t-test, proportions test, Mann-Whitney, bootstrap. Notebook structure.references/variance-reduction-techniques.md - CUPED math and Python implementation with worked example. Stratification. Regression adjustment. Doubly robust estimation primer.references/power-analysis-calculations.md - MDE math. Sample size calculations. Calibrating effect-size assumptions from historical experiments.references/sequential-testing-patterns.md - mSPRT, confidence sequences, group sequential designs. Honest framing on when to peek.references/common-pitfalls.md - Eleven failure patterns with diagnoses and fixes.Warehouse-native experimentation is powerful but expensive in engineering time. A first-year experimentation team should almost always start with a platform; the platform handles 90 percent of cases and lets you focus on hypotheses, not infrastructure. The team that graduates to warehouse-native does so because their volume, custom metric needs, or trust requirements outgrew what platforms offer.
If you are building warehouse-native because "platforms cost too much" without first running the math: you are underestimating the cost of your team's engineering time. The platform fee that looks expensive on a procurement form is often cheaper than three months of a data engineer's time spent reinventing CUPED.
If you are building it because the platform cannot handle your specific needs: you are probably right and the investment will pay back. Platforms are general; your business is specific. The custom metric the platform cannot express is often the metric that matters most for your decisions.
Honest middle ground: many mature teams use both. The platform for fast iteration on standard experiments where time-to-result matters more than custom depth. Warehouse-native for the hard cases where the platform's metric library or segmentation cannot reach. The hybrid is operationally complex; document the rule for which experiments go where, and revisit annually.
8e70d03
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.