exp-driven-dev

Builds features with A/B testing in mind using Ronny Kohavi's frameworks and Netflix/Airbnb experimentation culture. Use when implementing feature flags, choosing metrics, designing experiments, or building for fast iteration. Focuses on guardrail metrics, statistical significance, and experiment-driven development.

Quality

83%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Experimentation-Driven Development

When This Skill Activates

Claude uses this skill when:

Building new features that affect core metrics
Implementing A/B testing infrastructure
Making data-driven decisions
Setting up feature flags for gradual rollouts
Choosing which metrics to track

Core Frameworks

1. Experiment Design (Source: Ronny Kohavi, Microsoft/Netflix)

The HITS Framework:

H - Hypothesis:

"We believe that [change] will cause [metric] to [increase/decrease] because [reason]"

I - Implementation:

Feature flag setup
Treatment vs control
Sample size calculation

T - Test:

Run for statistical significance
Monitor guardrail metrics
Watch for unexpected effects

S - Ship or Stop:

Ship if positive
Stop if negative
Iterate if inconclusive

Example:

Hypothesis:
"We believe that adding social proof ('X people bought this') 
will increase conversion rate by 10% 
because it reduces purchase anxiety."

Implementation:
- Control: No social proof
- Treatment: Show "X people bought"
- Sample size: 10,000 users per variant
- Duration: 2 weeks

Test:
- Primary metric: Conversion rate
- Guardrails: Cart abandonment, return rate

Ship or Stop:
- If conversion +5% or more → Ship
- If conversion -2% or less → Stop
- If inconclusive → Iterate and retest

2. Metric Selection

Primary Metric:

ONE metric you're trying to move
Directly tied to business value
Clear success threshold

Guardrail Metrics:

Metrics that shouldn't degrade
Prevent gaming the system
Ensure quality maintained

Example:

Feature: Streamlined checkout

Primary Metric:
✅ Purchase completion rate (+10%)

Guardrail Metrics:
⚠️ Cart abandonment (don't increase)
⚠️ Return rate (don't increase)
⚠️ Support tickets (don't increase)
⚠️ Load time (stay <2s)

3. Statistical Significance

The Math:

Minimum sample size = (Effect size, Confidence, Power)

Typical settings:
- Confidence: 95% (p < 0.05)
- Power: 80% (detect 80% of real effects)
- Effect size: Minimum detectable change

Example:
- Baseline conversion: 10%
- Minimum detectable effect: +1% (to 11%)
- Required: ~15,000 users per variant

Common Mistakes:

❌ Stopping test early (peeking bias)
❌ Running too short (seasonal effects)
❌ Too many variants (dilutes sample)
❌ Changing test mid-flight

4. Feature Flag Architecture

Implementation:

// Feature flag pattern
function checkoutFlow(user) {
  if (isFeatureEnabled(user, 'new-checkout')) {
    return newCheckoutExperience();
  } else {
    return oldCheckoutExperience();
  }
}

// Gradual rollout
function isFeatureEnabled(user, feature) {
  const rolloutPercent = getFeatureRollout(feature);
  const userBucket = hashUserId(user.id) % 100;
  return userBucket < rolloutPercent;
}

// Experiment assignment
function assignExperiment(user, experiment) {
  const variant = consistentHash(user.id, experiment);
  track('experiment_assigned', {
    userId: user.id,
    experiment: experiment,
    variant: variant
  });
  return variant;
}

Decision Tree: Should We Experiment?

NEW FEATURE
│
├─ Affects core metrics? ──────YES──→ EXPERIMENT REQUIRED
│  NO ↓
│
├─ Risky change? ──────────────YES──→ EXPERIMENT RECOMMENDED
│  NO ↓
│
├─ Uncertain impact? ──────────YES──→ EXPERIMENT USEFUL
│  NO ↓
│
├─ Easy to A/B test? ─────────YES──→ WHY NOT EXPERIMENT?
│  NO ↓
│
└─ SHIP WITHOUT TEST ←────────────────┘
   (But still feature flag for rollback)

Action Templates

Template 1: Experiment Spec

# Experiment: [Name]

## Hypothesis
**We believe:** [change]
**Will cause:** [metric] to [increase/decrease]
**Because:** [reasoning]

## Variants

### Control (50%)
[Current experience]

### Treatment (50%)
[New experience]

## Metrics

### Primary Metric
- **What:** [metric name]
- **Current:** [baseline]
- **Target:** [goal]
- **Success:** [threshold]

### Guardrail Metrics
- **Metric 1:** [name] - Don't decrease
- **Metric 2:** [name] - Don't increase
- **Metric 3:** [name] - Maintain

## Sample Size
- **Users needed:** [X per variant]
- **Duration:** [Y days]
- **Confidence:** 95%
- **Power:** 80%

## Implementation
```javascript
if (experiment('feature-name') === 'treatment') {
  // New experience
} else {
  // Old experience
}

Success Criteria

Primary metric improved by [X]%
No guardrail degradation
Statistical significance reached
No unexpected negative effects

Decision

If positive: Ship to 100%
If negative: Rollback, iterate
If inconclusive: Extend or redesign

### Template 2: Feature Flag Implementation

```typescript
// features.ts
export const FEATURES = {
  'new-checkout': {
    rollout: 10,  // 10% of users
    enabled: true,
    description: 'New streamlined checkout flow'
  },
  'ai-recommendations': {
    rollout: 0,  // Not live yet
    enabled: false,
    description: 'AI-powered product recommendations'
  }
};

// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
  const config = FEATURES[feature];
  if (!config || !config.enabled) return false;
  
  const bucket = consistentHash(userId) % 100;
  return bucket < config.rollout;
}

// usage in code
if (isEnabled(user.id, 'new-checkout')) {
  return <NewCheckout />;
} else {
  return <OldCheckout />;
}

Template 3: Experiment Dashboard

# Experiment Dashboard

## Active Experiments

### Experiment 1: [Name]
- **Status:** Running
- **Started:** [date]
- **Progress:** [X]% sample size reached
- **Primary metric:** [current result]
- **Guardrails:** ✅ All healthy

### Experiment 2: [Name]
- **Status:** Complete
- **Result:** Treatment won (+15% conversion)
- **Decision:** Ship to 100%
- **Shipped:** [date]

## Key Metrics

### Experiment Velocity
- **Experiments launched:** [X per month]
- **Win rate:** [Y]%
- **Average duration:** [Z] days

### Impact
- **Revenue impact:** +$[X]
- **Conversion improvement:** +[Y]%
- **User satisfaction:** +[Z] NPS

## Learnings
- [Key insight 1]
- [Key insight 2]
- [Key insight 3]

Quick Reference

🧪 Experiment Checklist

Before Starting:

During Experiment:

Don't peek early (wait for significance)
Monitor guardrails daily
Watch for unexpected effects
Log any external factors (holidays, outages)

After Experiment:

Statistical significance reached
Guardrails not degraded
Decision made (ship/stop/iterate)
Learning documented

Real-World Examples

Example 1: Netflix Experimentation

Volume: 250+ experiments running at once Approach: Everything is an experiment Culture: "Strong opinions, weakly held - let data decide"

Example Test:

Hypothesis: Bigger thumbnails increase engagement
Result: No improvement, actually hurt browse time
Decision: Rollback
Learning: Saved $$ by not shipping

Example 2: Airbnb's Experiments

Test: New search ranking algorithm Primary: Bookings per search Guardrails:

Search quality (ratings of bookings)
Host earnings (don't concentrate bookings)
Guest satisfaction

Result: +3% bookings, all guardrails healthy → Ship

Example 3: Stripe's Feature Flags

Approach: Every feature behind flag Benefits:

Instant rollback (flip flag)
Gradual rollout (1% → 5% → 25% → 100%)
Test in production safely

Example:

if (experiments.isEnabled('instant-payouts')) {
  return <InstantPayouts />;
}

Common Pitfalls

❌ Mistake 1: Peeking Too Early

Problem: Stopping test before statistical significance Fix: Calculate sample size upfront, wait for it

❌ Mistake 2: No Guardrails

Problem: Gaming the metric (increase clicks but hurt quality) Fix: Always define guardrails

❌ Mistake 3: Too Many Variants

Problem: Not enough users per variant Fix: Limit to 2-3 variants max

❌ Mistake 4: Ignoring External Factors

Problem: Holiday spike looks like treatment effect Fix: Note external events, extend duration

Related Skills

metrics-frameworks - For choosing right metrics
growth-embedded - For growth experiments
ship-decisions - For when to ship vs test more
strategic-build - For deciding what to test

Key Quotes

Ronny Kohavi:

"The best way to predict the future is to run an experiment."

Netflix Culture:

"Strong opinions, weakly held. Let data be the tie-breaker."

Airbnb:

"We trust our intuition to generate hypotheses, and we trust data to make decisions."

Further Learning

references/experiment-design-guide.md - Complete methodology
references/statistical-significance.md - Sample size calculations
references/feature-flags-implementation.md - Code examples
references/guardrail-metrics.md - Choosing guardrails

Repository: menkesu/awesome-pm-skills
Commit: 53530ef

Last updated: 4 months ago
Created: 4 months ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.