Running experiments out of the data warehouse instead of via dedicated experiment platforms. SQL-based assignment, exposure logging discipline, metric definitions in dbt models, statistical analysis in SQL or Python, variance reduction with CUPED, sequential testing, and the operational tradeoffs vs platforms like Statsig and Optimizely. Triggers on warehouse-native experimentation, run experiments in BigQuery, run experiments in Snowflake, dbt experiments, SQL t-test, CUPED variance reduction, exposure log, sample ratio mismatch, sequential testing, mSPRT, doubly robust estimation, build vs buy experimentation. Also triggers when the team is choosing between platform and warehouse, building warehouse-native experiment infrastructure, auditing one, or running an experiment with a custom metric the platform cannot handle.
63
75%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/data-warehouse-experimentation/SKILL.mdQuality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an excellent skill description that thoroughly covers the domain of warehouse-native experimentation with highly specific capabilities, abundant natural trigger terms, and clear guidance on when to use it. It distinguishes itself well from both general data/SQL skills and platform-based experimentation skills. The description is comprehensive without being padded, and uses appropriate third-person voice throughout.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions and techniques: SQL-based assignment, exposure logging, metric definitions in dbt models, statistical analysis, CUPED variance reduction, sequential testing, and operational tradeoffs vs named platforms (Statsig, Optimizely). | 3 / 3 |
Completeness | Clearly answers both 'what' (SQL-based assignment, exposure logging, metric definitions, statistical analysis, CUPED, sequential testing, platform tradeoffs) and 'when' with explicit triggers ('Triggers on warehouse-native experimentation...', 'Also triggers when the team is choosing between platform and warehouse, building warehouse-native experiment infrastructure, auditing one, or running an experiment with a custom metric the platform cannot handle'). | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'warehouse-native experimentation', 'run experiments in BigQuery', 'run experiments in Snowflake', 'dbt experiments', 'SQL t-test', 'CUPED', 'exposure log', 'sample ratio mismatch', 'sequential testing', 'mSPRT', 'build vs buy experimentation'. These are highly specific and natural terms a practitioner would use. | 3 / 3 |
Distinctiveness Conflict Risk | Occupies a very clear niche: warehouse-native experimentation as opposed to platform-based experimentation. The specific mention of BigQuery, Snowflake, dbt, and named platforms like Statsig and Optimizely makes it highly distinguishable from general data analysis, general A/B testing, or general SQL skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
50%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
The skill demonstrates deep domain expertise and provides genuinely useful, executable code examples for warehouse-native experimentation. Its main weakness is extreme verbosity — it reads more like a comprehensive blog post or technical guide than a lean skill file, with extensive prose explaining concepts Claude already understands and significant content duplication between the main body and what should be in reference files. The workflow could benefit from explicit validation gates (e.g., mandatory SRM check before any metric analysis).
Suggestions
Cut the body by 60-70%: remove explanatory prose about why things matter (Claude knows), collapse the 'when to use' section to a bullet list, and move all full code examples to reference files, keeping only minimal illustrative snippets in the main body.
Embed explicit validation checkpoints into the workflow: add a mandatory SRM check step before any metric computation, and a metric-definition alignment verification step before publishing results.
Move the detailed SQL/Python templates (t-test, CUPED, power analysis) entirely into the referenced files and replace them with one-line descriptions pointing to those files, since the references already exist for this purpose.
Remove the 'What this skill is for' section that explains relationships to other skills — this is meta-context that consumes tokens without adding actionable guidance for the task at hand.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~400+ lines. It explains concepts Claude already knows (what platforms are, what dbt does, what a t-test is), includes lengthy prose justifications for decisions, and repeats the same points multiple times (e.g., exposure logging discipline is explained in the architecture section, the exposure log section, the pitfalls section, and the framework checklist). The 'when warehouse-native is the right call' section reads like a blog post, not an instruction set. | 1 / 3 |
Actionability | The skill provides fully executable SQL and Python code throughout: deterministic hash assignment in SQL, exposure log schema, dbt metric model patterns, a complete Welch's t-test in SQL, Python t-test with scipy, CUPED implementation, and power analysis with statsmodels. Code examples are copy-paste ready with realistic field names and patterns. | 3 / 3 |
Workflow Clarity | The 12-consideration framework provides a clear sequence, and the data flow architecture (assignment → exposure → metrics → analysis) is well-articulated. However, there are no explicit validation checkpoints or feedback loops in the workflow steps. The SRM check is mentioned as a consideration but not integrated as a mandatory gate before analysis. The pitfalls section lists problems but doesn't embed verification steps into the workflow itself. | 2 / 3 |
Progressive Disclosure | The skill references 8 separate reference files with clear descriptions and links, which is good structure. However, the main SKILL.md itself contains enormous amounts of detail that should be in those reference files (full SQL templates, full Python examples, lengthy prose on when to use warehouse-native). The body should be a concise overview pointing to the references, but instead it duplicates much of what the references presumably contain. Additionally, no bundle files were provided, so the references are unverifiable. | 2 / 3 |
Total | 8 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
8e70d03
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.