CtrlK
BlogDocsLog inGet started
Tessl Logo

experiment-designer

The Experiment Designer specialist for Headout's PM OS. Use this skill when a feature or change needs to be validated via an A/B test or controlled experiment before full rollout. It designs the experiment end-to-end: hypothesis, variants, user assignment, sample size, guardrails, measurement window, and the BigQuery/Statsig setup needed to track results. This skill is conditional — not every feature needs a formal experiment design. Use it when the PM says "we want to A/B test this", "how do we measure if this works", "design an experiment for X", "what's our holdout strategy", "help me think through the experiment setup", or when a spec includes an experiment as its validation method. The Experiment Designer works with Headout's experimentation infrastructure: Statsig for assignment and feature flags, BigQuery for outcome measurement, Mixpanel for behavioral signals, and Delphi (#ask-delphi) for ad hoc queries.

91

1.14x
Quality

88%

Does it follow best practices?

Impact

96%

1.14x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, highly actionable experiment design skill with excellent workflow clarity and a well-structured multi-step process including validation gates and a pre-finalization critique. Its main weaknesses are moderate verbosity (repeating key points across sections, explaining concepts Claude already knows) and a monolithic structure that could benefit from splitting reference material into separate files. The worked example and decision framework template are particularly valuable.

Suggestions

Trim repeated points — the 'define decision framework before results' message appears in three places (Step 3k, Structured Critique, Common Issues); consolidate to one authoritative statement.

Remove explanatory asides that Claude already knows (e.g., what novelty effects are, what p-hacking means) and replace with just the actionable instruction (e.g., 'Check for novelty effect risk' rather than explaining what it is).

DimensionReasoningScore

Conciseness

The skill is thorough but includes some unnecessary verbosity — philosophical framing about poorly designed experiments, explanations of concepts Claude already knows (what p-hacking is, what novelty effects are), and repeated emphasis on the same points (e.g., 'define decision framework before, not after' appears in Steps 3k, the critique section, and Common Issues). The content could be tightened by ~30% without losing actionability.

2 / 3

Actionability

The skill provides highly concrete, executable guidance: specific hypothesis templates, exact decision tree logic, named tools (Statsig, BigQuery, Mixpanel), specific metrics (S2O, C2O, CVR, GBV, ARPU, CM1), sample size calculation parameters, a complete output template with checklist, and a worked example with the scarcity booster. Every step tells Claude exactly what to produce.

3 / 3

Workflow Clarity

The workflow is clearly sequenced (Steps 1 → 2 → 2.5 → 3 → Critique → Output) with explicit validation checkpoints: Step 2 validates experiment readiness before design, Step 2.5 uses AskUserQuestion to surface blind spots with clear completion criteria, and the Structured Critique section serves as a review gate before finalizing. The decision framework itself includes feedback loops for inconclusive results.

3 / 3

Progressive Disclosure

The skill references `${CLAUDE_PLUGIN_ROOT}/CLAUDE.md` for context loading, which is appropriate. However, the content is monolithic — all experiment design details, the critique framework, common issues, and the example are inline in a single long file. The structured critique section and common issues could reasonably be split into separate reference files, and no bundle files exist to support this.

2 / 3

Total

10

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is an excellent skill description that clearly defines its scope, provides rich trigger terms, and explicitly states both what it does and when to use it. The conditional usage note adds valuable context for skill selection, and the specific tooling references ground it in a concrete domain. The description is well-structured, concise, and highly actionable for Claude's skill selection process.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: designs hypothesis, variants, user assignment, sample size, guardrails, measurement window, and BigQuery/Statsig setup. Also specifies the tools involved (Statsig, BigQuery, Mixpanel, Delphi).

3 / 3

Completeness

Clearly answers both 'what' (designs experiments end-to-end with specific components listed) and 'when' (explicit 'Use when' clause with multiple trigger phrases and conditional usage guidance). Also includes negative scoping ('not every feature needs a formal experiment design').

3 / 3

Trigger Term Quality

Excellent coverage of natural trigger terms users would say: 'A/B test', 'experiment', 'how do we measure if this works', 'holdout strategy', 'experiment setup', 'controlled experiment'. These are highly natural phrases a PM would use.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive with a clear niche around A/B testing and experiment design. The specific tooling (Statsig, BigQuery, Mixpanel), domain (Headout PM OS), and trigger terms make it very unlikely to conflict with other skills.

3 / 3

Total

12

/

12

Passed

Validation

100%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository
headout/pm-os-marketplace
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.