When the user wants to plan, design, or implement an A/B test or experiment, or build a growth experimentation program. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," "how long should I run this test," "growth experiments," "experiment velocity," "experiment backlog," "ICE score," "experimentation program," or "experiment playbook." Use this whenever someone is comparing two approaches and wants to measure which performs better, or when they want to build a systematic experimentation practice. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.
90
87%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
89%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong skill description with excellent trigger term coverage and clear completeness, explicitly addressing both what the skill does and when to use it. The cross-references to related skills (analytics-tracking, page-cro) are a notable strength for reducing conflict. The main weakness is that the 'what' portion could be more specific about concrete actions the skill performs beyond the general 'plan, design, or implement.'
Suggestions
Add more specific concrete actions to the 'what' portion, e.g., 'Generates hypotheses, calculates required sample sizes, designs test variants, prioritizes experiment backlogs using ICE scoring, and determines test duration.'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions planning, designing, and implementing A/B tests, building a growth experimentation program, and comparing approaches to measure performance. However, it doesn't list multiple concrete specific actions like 'create hypothesis documents, calculate sample sizes, design test variants, prioritize experiment backlogs' — it stays at a moderate level of specificity. | 2 / 3 |
Completeness | Clearly answers both 'what' (plan, design, implement A/B tests, build experimentation programs, compare approaches) and 'when' with extensive explicit trigger terms and use-case guidance. Also includes helpful cross-references to related skills (analytics-tracking, page-cro). | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms including 'A/B test,' 'split test,' 'experiment,' 'variant copy,' 'multivariate test,' 'hypothesis,' 'statistical significance,' 'ICE score,' 'experiment backlog,' and many more variations that users would naturally say. | 3 / 3 |
Distinctiveness Conflict Risk | The description carves out a clear niche around A/B testing and experimentation, and explicitly differentiates itself from related skills by referencing analytics-tracking for implementation and page-cro for page-level conversion optimization, reducing conflict risk. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, comprehensive skill that provides highly actionable guidance for A/B testing and experimentation programs. Its main weakness is moderate verbosity — some sections explain concepts Claude already understands (statistical significance definitions, basic test type descriptions) and could be trimmed. The workflow clarity and progressive disclosure are excellent, with proper checklists, validation steps, and external references.
Suggestions
Trim explanations of concepts Claude already knows, such as the definition of statistical significance, what A/B vs MVT tests are, and the general description of client-side vs server-side testing — keep only the actionable tool recommendations and decision criteria.
Condense the 'Common Mistakes' section into the relevant workflow sections (e.g., merge design mistakes into the hypothesis/design sections) rather than repeating guidance in a separate section.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably well-structured but includes some unnecessary explanation Claude already knows (e.g., explaining what statistical significance means, what A/B vs MVT tests are, the peeking problem). The Growth Experimentation Program section adds significant length. Some tables restate common knowledge rather than providing novel, project-specific guidance. | 2 / 3 |
Actionability | The skill provides concrete, actionable frameworks: a fill-in-the-blank hypothesis template, specific sample size tables, checklists with checkboxes, ICE scoring methodology, an experiment playbook template with specific fields, and clear cadence recommendations. The guidance is specific enough to execute immediately. | 3 / 3 |
Workflow Clarity | Multi-step processes are clearly sequenced with explicit validation checkpoints: the pre-launch checklist includes tracking verification and QA, the analysis checklist has ordered steps, the experiment loop is numbered, and the cadence section provides clear review cycles. The 'During the Test' DO/Avoid lists serve as guardrails. | 3 / 3 |
Progressive Disclosure | The skill provides a clear overview with well-signaled one-level-deep references to external files (references/sample-size-guide.md, references/test-templates.md) and related skills (page-cro, analytics-tracking, copywriting). Content is appropriately structured with headers and tables for scanning, and detailed reference material is correctly deferred. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
2c7c108
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.