CtrlK
BlogDocsLog inGet started
Tessl Logo

experimentation-analytics

How to read experiment results without fooling yourself. Confidence intervals, p-values, multiple testing, sequential testing, CUPED, heterogeneous treatment effects, ratio metrics, network effects, dashboard reconciliation, and the interpretation failures that produce confidently wrong shipping decisions.

55

Quality

62%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/experimentation-analytics/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a comprehensive knowledge document about experiment interpretation that reads more like a statistics textbook chapter than a concise, actionable skill for Claude. Its greatest strength is the progressive disclosure structure with well-organized references and the practical CI decision rules. Its greatest weakness is extreme verbosity—it explains many statistical concepts Claude already knows at length, and lacks executable examples (code, SQL, specific calculations) that would make the guidance immediately actionable.

Suggestions

Cut explanatory content about well-known statistical concepts (what a p-value is, what SUTVA stands for, how Bonferroni works) to brief reminders, reducing the document by 40-60%. Focus on the decision rules and gotchas that are genuinely non-obvious.

Add a concrete step-by-step workflow at the top: 'When reading a result panel: 1. Verify allocation stability → 2. Read CI width and bounds → 3. Check guardrail statuses → 4. Apply multiple testing context → 5. Make ship/kill/iterate decision.' Include validation checkpoints.

Add executable examples: a Python/SQL snippet for computing delta-method CIs on ratio metrics, a worked numerical example showing how naive vs correct variance estimation changes a ship decision, or a concrete example of Bonferroni correction applied to a real metric set.

Consolidate the 'common interpretation failures' section and the '14 considerations framework' section—they substantially overlap and the framework section largely just summarizes what was already said in detail above it.

DimensionReasoningScore

Conciseness

Extremely verbose at ~4000+ words. Extensively explains statistical concepts Claude already knows (what a p-value is, what a CI means, what SUTVA stands for, how Bonferroni correction works). The introductory paragraphs, 'what this skill is for' section, and closing section are largely meta-commentary that doesn't add actionable value. Much of this reads like a statistics textbook rather than a concise skill reference.

1 / 3

Actionability

Provides practical decision rules (e.g., the 5 CI interpretation rules) and heuristics that are genuinely useful, but lacks any executable code, commands, or concrete worked examples with actual numbers flowing through a calculation. The guidance is specific in places (e.g., 'ask which variance estimator your platform uses') but mostly descriptive rather than executable. No code snippets, no SQL queries, no notebook examples despite mentioning exporting data to notebooks.

2 / 3

Workflow Clarity

The 14-consideration framework provides a clear checklist structure, and the CI decision rules are well-sequenced. However, there is no explicit step-by-step workflow for reading a result panel (e.g., 'Step 1: check allocation, Step 2: read CI, Step 3: check guardrails...'). The content reads more as a reference encyclopedia than a sequenced workflow with validation checkpoints. Missing explicit decision trees or flowcharts for the ship/kill/iterate decision.

2 / 3

Progressive Disclosure

Excellent progressive disclosure structure. The main document provides overview-level coverage of each topic with clear references to 7 well-organized reference files (cheatsheet, interpretation guide, statistical methods, reconciliation patterns, presentation templates, platform comparison, failure patterns). References are one level deep, clearly signaled with relative paths, and logically organized. However, no bundle files were provided to verify these references actually exist.

3 / 3

Total

8

/

12

Passed

Description

82%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description excels at specificity and trigger term coverage, listing a comprehensive set of concrete statistical and experimentation concepts that practitioners would naturally reference. Its main weakness is the absence of an explicit 'Use when...' clause, which would help Claude know precisely when to select this skill. The description is distinctive and clearly carved out for experiment/A/B test interpretation.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user asks about interpreting A/B test results, experiment analysis, statistical significance, or making shipping decisions based on experiment data.'

Consider adding common synonyms like 'A/B testing', 'online experiments', 'randomized controlled trials', or 'experimentation platform' to broaden trigger term coverage.

DimensionReasoningScore

Specificity

Lists multiple specific concrete topics and actions: confidence intervals, p-values, multiple testing, sequential testing, CUPED, heterogeneous treatment effects, ratio metrics, network effects, dashboard reconciliation, and interpretation failures. These are highly specific statistical and experimentation concepts.

3 / 3

Completeness

The 'what' is well-covered with the list of statistical concepts and the framing of reading experiment results. However, there is no explicit 'Use when...' clause or equivalent trigger guidance telling Claude when to select this skill, which caps this at 2 per the rubric guidelines.

2 / 3

Trigger Term Quality

Excellent coverage of natural terms users would say when seeking help with experiment analysis: 'confidence intervals', 'p-values', 'multiple testing', 'CUPED', 'heterogeneous treatment effects', 'ratio metrics', 'network effects', 'A/B testing' is implied through 'experiment results' and 'shipping decisions'. These are terms practitioners naturally use.

3 / 3

Distinctiveness Conflict Risk

Highly distinctive niche focused specifically on A/B experiment interpretation and statistical pitfalls. The combination of CUPED, sequential testing, network effects, and shipping decisions creates a very clear domain that is unlikely to conflict with general statistics or data analysis skills.

3 / 3

Total

11

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
rampstackco/claude-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.