Designs and tracks scientific experiments, A/B tests, and feature rollouts for product and engineering teams. Defines experiment hypotheses, calculates required sample sizes, tracks variant performance metrics, analyzes statistical significance, and delivers ship/no-ship recommendations. Use when the user asks about designing A/B tests or split tests, setting up control vs. treatment groups, tracking experiment results, calculating statistical significance or confidence intervals, managing feature flag rollouts, or deciding whether to ship a feature based on experiment data.
93
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Manages the full experiment lifecycle: hypothesis definition, statistical design, execution monitoring, results analysis, and ship/no-ship decisions.
Validation checkpoint: If calculated sample size exceeds available daily traffic × planned duration, reduce scope, extend duration, or narrow the hypothesis before proceeding.
Validation checkpoint: If data collection rate in soft launch is < 95% of expected, halt and fix instrumentation before full launch.
Validation checkpoint: If variant/control split deviates by > 5% from the planned ratio (e.g., 50/50 target but observing 55/45), investigate for assignment bugs before continuing.
Full copy-paste templates are in TEMPLATES.md. Key documents:
Full Python implementations are in STATISTICAL_METHODS.md. Available functions:
sample_size(baseline_rate, mde, alpha, power) — two-proportion z-test sample size per variant (e.g., 10% baseline + 2 pp MDE → ~3,842 per variant at 80% power, α = 0.05).test_proportions(...) — chi-squared test for conversion rates; returns p-value, absolute lift, and relative lift.test_means(...) — Welch's t-test for continuous metrics (e.g., revenue per user).proportion_ci(conversions, n, alpha) — Wilson score confidence interval for proportions.| Data type | Recommended test |
|---|---|
| Conversion rate (binary) | Two-proportion z-test / chi-squared |
| Continuous metric (revenue, time) | Welch's t-test |
| Non-normal continuous metric | Mann-Whitney U |
| Multiple variants (> 2) | ANOVA + Bonferroni correction |
| Sequential / always-on testing | O'Brien-Fleming or mSPRT |
010799b
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.