How to read experiment results without fooling yourself. Confidence intervals, p-values, multiple testing, sequential testing, CUPED, heterogeneous treatment effects, ratio metrics, network effects, dashboard reconciliation, and the interpretation failures that produce confidently wrong shipping decisions.
52
58%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/experimentation-analytics/SKILL.mdQuality
Discovery
82%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong description with excellent specificity and domain-relevant trigger terms that would help Claude match it to the right user queries about experiment analysis and A/B testing interpretation. Its main weakness is the lack of an explicit 'Use when...' clause, which would help Claude know precisely when to select this skill over others. The description reads more like a course syllabus than a skill selection guide.
Suggestions
Add an explicit 'Use when...' clause, e.g., 'Use when the user asks about interpreting A/B test results, statistical significance, experiment analysis, or making shipping decisions based on experimental data.'
Consider adding the term 'A/B testing' explicitly since it's the most common way users refer to this domain, even though it's strongly implied by the other terms.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete topics: confidence intervals, p-values, multiple testing, sequential testing, CUPED, heterogeneous treatment effects, ratio metrics, network effects, dashboard reconciliation, and interpretation failures. These are highly specific statistical and experimentation concepts. | 3 / 3 |
Completeness | The 'what' is well-covered with the list of statistical concepts and the framing of 'how to read experiment results without fooling yourself.' However, there is no explicit 'Use when...' clause or equivalent trigger guidance telling Claude when to select this skill, which caps this at 2 per the rubric guidelines. | 2 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say when seeking help with experiment analysis: 'experiment results', 'confidence intervals', 'p-values', 'multiple testing', 'CUPED', 'A/B testing' is implied through 'shipping decisions', 'treatment effects', 'ratio metrics', 'network effects'. These are terms practitioners naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | Highly distinctive niche focused specifically on A/B testing and experiment interpretation. The combination of specific statistical terms like CUPED, sequential testing, heterogeneous treatment effects, and the framing around 'shipping decisions' makes this very unlikely to conflict with general statistics or data analysis skills. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
35%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill reads more like a comprehensive statistics textbook chapter than a concise, actionable skill for Claude. Its greatest strength is the depth of coverage and the practical decision rules for CI interpretation, but it is severely undermined by verbosity — explaining concepts Claude already understands (p-values, SUTVA, delta method, Bayesian vs frequentist) at textbook length. The content would be dramatically improved by moving detailed explanations to reference files and keeping only decision rules, checklists, and concrete examples in the main body.
Suggestions
Cut the main body by 60-70%: move detailed statistical explanations (p-value semantics, CUPED mechanics, delta method, Bayesian vs frequentist, network effects) into the referenced files and keep only the decision rules and actionable heuristics in SKILL.md.
Add a concrete step-by-step workflow for reading a result panel (e.g., 'Step 1: Verify allocation balance → Step 2: Read primary metric CI → Step 3: Check guardrails → Step 4: File decision') with explicit validation checkpoints.
Remove meta-commentary sections ('What this skill is for', the lengthy introduction, the closing philosophical section) — these consume tokens without adding actionable guidance.
Add at least one concrete worked example with actual numbers showing the full interpretation flow from panel reading to ship/kill/iterate decision.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~3500+ words. Extensively explains statistical concepts Claude already knows (what a p-value is, what a CI means, SUTVA definition, delta method explanation). The introductory paragraphs, 'what this skill is for' section, and closing section are largely meta-commentary that doesn't add actionable value. Much of this reads like a statistics textbook rather than a concise skill reference. | 1 / 3 |
Actionability | Provides concrete decision rules (the 5 CI interpretation rules are genuinely useful) and specific platform names, but contains zero executable code, no concrete commands, no worked numerical examples inline, and no copy-paste-ready artifacts. The guidance is specific in places (e.g., 'never report as X, always report as Y') but mostly descriptive rather than executable. | 2 / 3 |
Workflow Clarity | The 14-consideration framework provides a clear checklist structure, and the CI decision rules are well-sequenced. However, there is no explicit step-by-step workflow for reading a result panel (e.g., 'Step 1: check allocation, Step 2: read CI, Step 3: check guardrails'). The content describes what to think about but doesn't sequence the actual interpretation process with validation checkpoints. | 2 / 3 |
Progressive Disclosure | References to 7 supporting files are well-organized in a dedicated section with clear descriptions, and inline references are appropriately placed. However, the main SKILL.md itself is monolithic — much of the detailed statistical explanation (CUPED mechanics, delta method details, Bayesian vs frequentist comparison) should be in the reference files rather than inline, and no bundle files were provided to verify the references exist. | 2 / 3 |
Total | 7 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
8e70d03
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.