When the user wants to plan, design, or implement an A/B test or experiment, or build a growth experimentation program. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," "how long should I run this test," "growth experiments," "experiment velocity," "experiment backlog," "ICE score," "experimentation program," or "experiment playbook." Use this whenever someone is comparing two approaches and wants to measure which performs better, or when they want to build a systematic experimentation practice. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.
90
87%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
89%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong skill description with excellent trigger term coverage and clear completeness, explicitly addressing both what the skill does and when to use it. The cross-references to related skills are a notable strength for reducing conflict. The main weakness is that the 'what' portion could be more specific about concrete actions the skill performs beyond planning/designing/implementing.
Suggestions
Add more specific concrete actions to the capability description, e.g., 'create test hypotheses, calculate required sample sizes, design experiment structures, prioritize experiment backlogs using ICE scoring, analyze statistical significance of results.'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions planning, designing, and implementing A/B tests, building a growth experimentation program, and comparing approaches to measure performance. However, it doesn't list multiple concrete specific actions like 'create hypothesis documents, calculate sample sizes, analyze test results, prioritize experiment backlogs' — it stays at a moderate level of specificity. | 2 / 3 |
Completeness | The description clearly answers both 'what' (plan, design, implement A/B tests, build experimentation programs, compare approaches) and 'when' with extensive explicit trigger terms and use-case guidance. It also includes helpful cross-references to related skills (analytics-tracking, page-cro). | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms including 'A/B test,' 'split test,' 'experiment,' 'variant copy,' 'multivariate test,' 'hypothesis,' 'statistical significance,' 'ICE score,' 'experiment backlog,' and many more variations a user would naturally say. | 3 / 3 |
Distinctiveness Conflict Risk | The description carves out a clear niche around A/B testing and experimentation, and explicitly differentiates itself from related skills by referencing 'analytics-tracking' for tracking implementation and 'page-cro' for page-level conversion optimization, reducing conflict risk. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, well-structured skill that provides highly actionable guidance for A/B testing and experimentation programs. Its main strength is the concrete frameworks, templates, and reference tables that make it immediately executable. The primary weakness is moderate verbosity—some sections explain concepts Claude already knows (statistical significance basics, core principles of testing) that could be trimmed to save tokens.
Suggestions
Trim the 'Core Principles' section and 'The Peeking Problem' explanation—Claude already understands these concepts. Reduce to brief reminders rather than explanations.
Condense the 'Statistical Significance' explanation (95% confidence, p-values) to a single line since Claude knows statistics well.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is fairly comprehensive but includes some unnecessary explanations Claude already knows (e.g., explaining what statistical significance means, what A/B tests are, the peeking problem). The tables and frameworks are efficient, but sections like 'Core Principles' explain basics that Claude would already understand. The Growth Experimentation Program section adds substantial length but provides genuinely useful structured frameworks. | 2 / 3 |
Actionability | The skill provides highly concrete, actionable guidance: specific hypothesis templates with fill-in-the-blank structure, sample size reference tables with exact numbers, ICE scoring methodology, pre-launch checklists, analysis checklists, experiment playbook templates with specific fields, and clear metrics targets. The guidance is specific enough to execute immediately. | 3 / 3 |
Workflow Clarity | Multi-step processes are clearly sequenced with explicit validation checkpoints: the pre-launch checklist includes tracking verification and QA, the experiment loop is numbered and cyclical, the analysis checklist has ordered steps with decision gates (e.g., 'Reach sample size? If not, result is preliminary'), and the cadence section provides clear feedback loops at weekly/bi-weekly/monthly/quarterly intervals. Guardrail metrics serve as explicit stop conditions. | 3 / 3 |
Progressive Disclosure | The skill is well-structured with clear sections and appropriate references to external files (references/sample-size-guide.md, references/test-templates.md) that are one level deep and clearly signaled. It also references related skills (page-cro, analytics-tracking, copywriting) and checks for existing context files. The main content serves as a comprehensive overview without burying critical information in nested references. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
1bcff9f
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.