Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
48
51%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/ab-test-setup/SKILL.mdQuality
Discovery
40%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description identifies a clear niche (A/B test setup) and mentions key structural elements, giving it reasonable distinctiveness. However, it lacks an explicit 'Use when...' clause, which is critical for skill selection, and the specific actions/capabilities are underspecified. Adding trigger guidance and more concrete action verbs would significantly improve it.
Suggestions
Add an explicit 'Use when...' clause, e.g., 'Use when the user wants to plan, design, or set up an A/B test, split test, or experiment.'
Include common trigger term variations such as 'split test', 'experiment design', 'variant testing', 'conversion optimization'.
List more specific concrete actions, e.g., 'Guides users through defining hypotheses, selecting success metrics, calculating sample sizes, and validating execution readiness before launch.'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (A/B testing) and mentions some specific elements (hypothesis, metrics, execution readiness), but doesn't list concrete actions beyond 'setting up'. It describes structure rather than specific capabilities like 'define hypotheses, configure metrics, validate sample sizes'. | 2 / 3 |
Completeness | Describes what the skill does (structured guide for A/B test setup with mandatory gates) but has no explicit 'Use when...' clause or equivalent trigger guidance. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' itself is only moderately clear, so this scores at 1. | 1 / 3 |
Trigger Term Quality | Includes 'A/B tests' which is a strong natural keyword, plus 'hypothesis' and 'metrics' which users might mention. However, it misses common variations like 'split test', 'experiment', 'variant testing', 'conversion', or 'statistical significance'. | 2 / 3 |
Distinctiveness Conflict Risk | A/B testing with mandatory gates for hypothesis, metrics, and execution readiness is a fairly distinct niche. It's unlikely to conflict with other skills given the specific domain of experimentation setup with structured checkpoints. | 3 / 3 |
Total | 8 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured procedural skill with strong workflow clarity, featuring explicit hard gates and refusal conditions that make the process safe and bounded. Its main weaknesses are moderate verbosity (motivational content, concept explanations Claude doesn't need) and a lack of concrete examples — a sample hypothesis, example metric definitions, or a sample size calculation would significantly improve actionability. The monolithic structure is acceptable but could benefit from splitting detailed sections into reference files.
Suggestions
Add a concrete example hypothesis that passes the quality checklist, showing the exact format expected (observation, change, direction, audience, success criteria).
Include a sample size calculation example — either a formula or a specific tool/command (e.g., a Python snippet using statsmodels) to make the sample size step executable.
Remove the 'Final Reminder' motivational section and trim the 'Purpose & Scope' to a single line — these consume tokens without adding actionable guidance.
Consider splitting the 'Analyzing Results' and 'Documentation & Learning' sections into a separate reference file to keep the main skill focused on the setup workflow.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is reasonably structured but includes some unnecessary padding — the 'Final Reminder' motivational section, the 'Purpose & Scope' preamble, and the 'When to Use' / 'Limitations' boilerplate at the end add little value. Claude already understands A/B testing concepts like peeking and statistical power; the skill should focus on the procedural gates rather than explaining why they matter. | 2 / 3 |
Actionability | The skill provides clear checklists and gate conditions, which are actionable for a process-oriented skill. However, it lacks concrete examples — no sample hypothesis, no sample size calculation formula or tool command, no example metric definition. The guidance is specific enough to follow but not copy-paste ready in any dimension. | 2 / 3 |
Workflow Clarity | The multi-step process is clearly sequenced with numbered phases, two explicit hard gates (Hypothesis Lock and Execution Readiness Gate), and clear refusal conditions. The workflow includes validation checkpoints and feedback loops (e.g., 'If assumptions are weak → warn and recommend delaying'; 'If any item is missing, stop and resolve it'). | 3 / 3 |
Progressive Disclosure | The content is a single monolithic file with no references to supporting documents. At ~150 lines covering hypothesis design, metrics definition, sample size planning, execution monitoring, analysis, and documentation, some sections (e.g., detailed analysis discipline, documentation templates) could be split into separate reference files. However, for a process-oriented skill without a bundle, the section headers provide reasonable navigation. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
8e8aa13
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.