When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
55
62%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./config/claude/skills/ab-test-setup/SKILL.mdQuality
Discovery
62%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description excels at trigger term coverage and distinctiveness, including a helpful cross-reference to a related skill. However, it is notably weak on specificity—it reads more like a routing rule than a skill description, telling Claude when to use it but barely explaining what concrete actions or outputs the skill provides. Adding specific capabilities would significantly improve it.
Suggestions
Add concrete actions the skill performs, e.g., 'Generates experiment hypotheses, calculates required sample sizes, creates variant copy, designs test plans, and analyzes statistical significance of results.'
Restructure to lead with specific capabilities before the trigger clause, following the pattern: '[What it does]. Use when [triggers].'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions 'plan, design, or implement an A/B test or experiment' but these are very high-level actions without concrete specifics. It doesn't list what the skill actually does—no specific deliverables like 'create hypothesis documents,' 'calculate sample sizes,' 'generate variant copy,' or 'analyze test results.' | 1 / 3 |
Completeness | The 'when' is explicitly and thoroughly covered with a 'Use when' clause and trigger terms. However, the 'what' is weak—it says 'plan, design, or implement' but doesn't describe what concrete outputs or actions the skill performs. The description is almost entirely trigger-focused with minimal capability description. | 2 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms: 'A/B test,' 'split test,' 'experiment,' 'test this change,' 'variant copy,' 'multivariate test,' 'hypothesis.' These are terms users would naturally use when requesting help with experimentation. | 3 / 3 |
Distinctiveness Conflict Risk | The description carves out a clear niche around A/B testing and experimentation, and even explicitly delineates boundaries by directing tracking implementation to 'analytics-tracking.' The trigger terms are specific to this domain and unlikely to conflict with other skills. | 3 / 3 |
Total | 9 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a comprehensive A/B testing guide with strong workflow structure and useful reference tables, but it's more of a knowledge document than an actionable skill for Claude. It explains concepts Claude likely already understands (statistical significance, p-values) while lacking concrete implementation code. The progressive disclosure structure is partially implemented but undermined by missing bundle files and too much inline content.
Suggestions
Remove explanations of concepts Claude already knows (statistical significance definitions, what p-values mean, why you shouldn't peek) and replace with terse reminders or decision rules.
Add concrete, executable code examples for at least one testing tool (e.g., PostHog feature flag setup, LaunchDarkly variant configuration) to increase actionability.
Move the sample size quick reference table and common mistakes into referenced bundle files to reduce the main skill's length and improve progressive disclosure.
Provide the referenced bundle files (references/sample-size-guide.md, references/test-templates.md) or remove the references to avoid broken links.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably well-organized but includes some unnecessary explanations Claude already knows (e.g., explaining what statistical significance means, what p-values are, the peeking problem). Some sections like 'Core Principles' state obvious experimentation concepts. The tables and quick references are efficient, but overall it could be tightened by ~30%. | 2 / 3 |
Actionability | The skill provides structured frameworks (hypothesis template, checklists, tables) which are useful, but lacks executable code or concrete implementation examples. There are no code snippets for setting up tracking, configuring tools, or implementing variants. The guidance is more conceptual/procedural than copy-paste ready, though the hypothesis example and metrics examples add some concreteness. | 2 / 3 |
Workflow Clarity | The skill presents a clear sequential workflow from initial assessment through hypothesis formation, test design, implementation, running, and analysis. The pre-launch checklist with checkboxes, DO/DON'T lists during testing, and the analysis checklist provide explicit validation checkpoints. The flow is logical and well-sequenced with clear decision points. | 3 / 3 |
Progressive Disclosure | The skill references two external files (references/sample-size-guide.md and references/test-templates.md) and related skills, which is good structure. However, no bundle files are provided, so these references are broken. Additionally, the main file is quite long (~200+ lines) with content like the full sample size table and common mistakes that could be split into reference files to keep the overview leaner. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
3974caa
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.