Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
68
51%
Does it follow best practices?
Impact
100%
1.09xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/ab-test-setup/SKILL.mdQuality
Discovery
40%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description identifies a clear niche (A/B test setup) and mentions key structural elements, giving it reasonable distinctiveness. However, it lacks an explicit 'Use when...' clause, which is critical for skill selection, and the specific capabilities could be more concretely enumerated. Adding trigger guidance and common user-facing keywords would significantly improve its effectiveness.
Suggestions
Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to set up an A/B test, design an experiment, or plan a split test.'
Include common keyword variations such as 'experiment', 'split test', 'variant testing', 'conversion tracking', and 'sample size'.
List more concrete actions, e.g., 'Guides users through defining hypotheses, selecting success metrics, calculating sample sizes, and validating execution readiness before launch.'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (A/B testing) and mentions some specific elements (hypothesis, metrics, execution readiness), but doesn't list concrete actions beyond 'setting up'. It describes structure rather than specific capabilities like 'define hypotheses, configure metrics, validate sample sizes'. | 2 / 3 |
Completeness | Describes what the skill does (structured guide for A/B test setup with mandatory gates) but has no explicit 'Use when...' clause or equivalent trigger guidance. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' itself is only moderately clear, so this scores at 1. | 1 / 3 |
Trigger Term Quality | Includes 'A/B tests' which is a strong natural keyword, plus 'hypothesis' and 'metrics' which users might mention. However, it misses common variations like 'experiment', 'split test', 'variant testing', 'conversion', or 'statistical significance'. | 2 / 3 |
Distinctiveness Conflict Risk | A/B testing with mandatory gates for hypothesis, metrics, and execution readiness is a fairly distinct niche. It's unlikely to conflict with other skills given the specific domain of experimentation setup with structured checkpoints. | 3 / 3 |
Total | 8 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured process guide with strong workflow clarity, featuring explicit gates and refusal conditions that make the A/B test setup process safe and rigorous. Its main weaknesses are the lack of concrete executable examples (no formulas, templates, or filled-out examples) and some unnecessary philosophical framing that consumes tokens without adding actionable value. The content would benefit from concrete artifacts like a sample hypothesis statement, a sample size calculation, and references to supplementary files.
Suggestions
Add a concrete, filled-out example of a complete hypothesis (e.g., 'Observation: Cart abandonment is 68%. Change: Add progress indicator. Expectation: Reduce abandonment by 5pp. Audience: Mobile users. Metric: Cart completion rate.').
Include an executable sample size calculation formula or code snippet (e.g., Python using statsmodels) rather than just listing the inputs needed.
Remove the 'Final Reminder' section and trim motivational language—Claude doesn't need encouragement to follow instructions.
Extract detailed sections (Metrics Definition, Analysis Discipline, Documentation template) into referenced supplementary files to improve progressive disclosure.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably structured but includes motivational/philosophical content ('A/B testing is not about proving ideas right...') and explanations of concepts Claude already knows (what A/B, A/B/n, MVT tests are). The 'Final Reminder' and some framing text could be cut without losing actionability. | 2 / 3 |
Actionability | The skill provides clear checklists and gates, which are actionable as process guidance. However, it lacks concrete executable examples—no sample size calculation code/formulas, no specific analytics tool commands, no example hypothesis template filled out. It describes what to do but rarely shows how concretely. | 2 / 3 |
Workflow Clarity | The multi-step process is clearly sequenced with numbered stages, explicit hard gates (Hypothesis Lock, Execution Readiness Gate), and clear refusal/stop conditions. Validation checkpoints are well-defined with feedback loops (e.g., 'If assumptions are weak → warn and recommend delaying'). | 3 / 3 |
Progressive Disclosure | The content is well-organized with clear section headers, but it's a monolithic document with no references to external files for detailed topics (e.g., sample size calculation methods, example test records, metric selection guides). Some sections like Metrics Definition and Analysis could be split out. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
d739c8b
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.