CtrlK
BlogDocsLog inGet started
Tessl Logo

experiment-designer

The Experiment Designer specialist for Headout's PM OS. Use this skill when a feature or change needs to be validated via an A/B test or controlled experiment before full rollout. It designs the experiment end-to-end: hypothesis, variants, user assignment, sample size, guardrails, measurement window, and the BigQuery/Statsig setup needed to track results. This skill is conditional — not every feature needs a formal experiment design. Use it when the PM says "we want to A/B test this", "how do we measure if this works", "design an experiment for X", "what's our holdout strategy", "help me think through the experiment setup", or when a spec includes an experiment as its validation method. The Experiment Designer works with Headout's experimentation infrastructure: Statsig for assignment and feature flags, BigQuery for outcome measurement, Mixpanel for behavioral signals, and Delphi (#ask-delphi) for ad hoc queries.

91

1.14x
Quality

88%

Does it follow best practices?

Impact

96%

1.14x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Evaluation results

100%

14%

Checkout Flow Scarcity Badge Experiment

Full experiment design output format and decision framework

Criteria
Without context
With context

Correct output filename

100%

100%

Structured hypothesis format

50%

100%

Specific variant definition

100%

100%

Assignment unit reasoning

100%

100%

Traffic split justification

83%

100%

Target segment dimensions

100%

100%

Single primary metric

100%

100%

MDE and baseline stated

100%

100%

Sample size and duration calculation

100%

100%

Guardrail metrics named

100%

100%

Minimum 2-week window

100%

100%

Statsig flag named

20%

100%

New events specified

100%

100%

BQ query or table referenced

0%

100%

Pre-defined decision framework

100%

100%

90%

9%

Post-Booking Confirmation Trust Signal Experiment

Low-traffic surface and proxy metric handling

Criteria
Without context
With context

Traffic problem flagged

100%

100%

Low-traffic alternatives offered

80%

100%

Alternative experiment type recommended

10%

50%

Instrumentation gap identified

100%

100%

Proxy metric documented

40%

100%

Single primary metric

100%

20%

Guardrail metrics included

100%

100%

Decision framework present

100%

100%

Assumptions named explicitly

75%

87%

Minimum 2-week window

100%

100%

Output file correct name

100%

100%

Feasibility assessment section

100%

100%

100%

13%

Pricing Transparency Multivariate Experiment

Structured critique and multivariate experiment design

Criteria
Without context
With context

Critique section present

100%

100%

Internal validity threat flagged

100%

100%

Novelty effect considered

100%

100%

Guardrail sufficiency addressed

62%

100%

Metric gaming risk assessed

100%

100%

Post-hoc decision framework flagged

100%

100%

Pre-committed decision framework present

100%

100%

Multivariate justification

85%

100%

New events instrumented

100%

100%

Statsig flag specified

0%

100%

Sample size covers all arms

100%

100%

MDE conservatism assessed

42%

100%

Output file correct name

100%

100%

Pre-experiment checklist

100%

100%

Repository
headout/pm-os-marketplace
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.