When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
72
66%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./config/claude/skills/ab-test-setup/SKILL.mdQuality
Discovery
62%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description excels at trigger term coverage and distinctiveness, including a helpful boundary reference to a related skill. However, it is notably weak on specificity—it fails to describe what the skill actually does beyond vague verbs like 'plan, design, implement.' Adding concrete capabilities (e.g., 'generates hypothesis statements, calculates sample sizes, creates variant copy') would significantly improve it.
Suggestions
Add specific concrete actions the skill performs, e.g., 'Generates hypothesis statements, calculates required sample sizes, creates variant copy, and designs experiment frameworks.'
Replace or supplement the vague 'plan, design, or implement' with enumerated deliverables or outputs the skill produces.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description mentions 'plan, design, or implement an A/B test or experiment' but these are very high-level actions without concrete specifics. It doesn't list what the skill actually does—no specific outputs, deliverables, or capabilities are named beyond the generic verbs 'plan, design, implement.' | 1 / 3 |
Completeness | The 'when' is very well covered with explicit trigger terms and a 'Use when' equivalent clause. However, the 'what' is weak—it says 'plan, design, or implement' but doesn't describe what the skill actually produces or its concrete capabilities. The boundary note about analytics-tracking is a nice touch but doesn't compensate for the missing 'what.' | 2 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms: 'A/B test,' 'split test,' 'experiment,' 'test this change,' 'variant copy,' 'multivariate test,' 'hypothesis.' These are terms users would naturally use when requesting this type of work. | 3 / 3 |
Distinctiveness Conflict Risk | The description carves out a clear niche around A/B testing and experimentation, with distinct trigger terms. It even explicitly delineates a boundary with the analytics-tracking skill, reducing conflict risk. | 3 / 3 |
Total | 9 / 12 Passed |
Implementation
70%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured skill that excels in workflow clarity and progressive disclosure, with clear sequential steps and appropriate references to deeper materials. Its main weaknesses are moderate verbosity (explaining concepts Claude already understands like statistical significance and the peeking problem) and limited actionability—it's more of a procedural guide than an executable one, which is somewhat appropriate for a planning/design skill but could still benefit from more concrete examples.
Suggestions
Remove explanations of concepts Claude already knows (e.g., what statistical significance means, why peeking is bad) and replace with just the actionable rule (e.g., 'Require 95% confidence; never stop before reaching pre-calculated sample size').
Add a concrete, end-to-end worked example showing a complete test plan from hypothesis through analysis decision, rather than scattered partial examples across sections.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably well-organized but includes some unnecessary explanations Claude already knows (e.g., explaining what statistical significance means, explaining the peeking problem in detail, defining test types). Some sections like 'Core Principles' state obvious experimentation concepts. However, the tables and quick references are efficient. | 2 / 3 |
Actionability | The skill provides structured frameworks (hypothesis template, checklists, tables) which are useful, but lacks executable code or concrete implementation examples. The implementation section mentions tools but doesn't show actual code for setting up a test in any of them. The guidance is more conceptual/procedural than copy-paste ready. | 2 / 3 |
Workflow Clarity | The workflow is clearly sequenced from initial assessment through hypothesis formation, test design, sample size calculation, implementation, running, and analysis. The pre-launch checklist with checkboxes, DO/DON'T lists during the test, and the analysis checklist provide explicit validation checkpoints and a clear sequential process. | 3 / 3 |
Progressive Disclosure | The skill provides a clear overview with well-signaled one-level-deep references to detailed materials (references/sample-size-guide.md, references/test-templates.md). Content is appropriately split into scannable sections with tables, and related skills are clearly linked at the bottom. | 3 / 3 |
Total | 10 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
05d40bb
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.