ab-test-setup

Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.

Quality

—

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Quality

Content

62%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is a well-sequenced, gate-driven workflow with strong validation checkpoints and a notably concrete tracking-verification section, but it is monolithic with no progressive disclosure, repeats a few principles, and leaves the sample-size computation step abstract.

Suggestions

Make the sample-size step actionable: provide the formula or a concrete command/library call for computing required sample size from baseline rate, MDE, significance, and power.

Trim repeated principles (e.g. 'no peeking' restated in multiple sections) and remove the generic 'When to Use' boilerplate line that restates the overview.

Consider splitting the tracking-verification and analysis sections into a reference file linked from SKILL.md to introduce one-level-deep progressive disclosure.

Dimension	Reasoning	Score
Conciseness	The body is mostly lean bullet-lists without padding concepts Claude already knows, but it repeats principles across sections (e.g. 'no peeking' appears multiple times) and ends with boilerplate 'When to Use'/'Limitations' filler ('This skill is applicable to execute the workflow or actions described in the overview') that could be trimmed.	2 / 3
Actionability	The tracking-verification section is concrete with executable thresholds ('within 30 seconds', '5+ events per variant', '±5% of configured allocation'), but the core analytic step only says 'Define upfront: Baseline rate, MDE...' without any formula or tool for computing sample size, leaving a key action incomplete.	2 / 3
Workflow Clarity	It presents a clearly sequenced, numbered gate process (1️⃣–8️⃣) with explicit hard gates, 'Do NOT proceed until confirmed' checkpoints, and feedback loops ('If any of the above fails, stop and resolve it before Gate 8'), matching the clear-sequence-with-validation anchor for score 3.	3 / 3
Progressive Disclosure	There are no bundle files or external references at all, so the skill is a single ~250-line monolithic document with content (tracking verification, analysis, documentation) that could be split out but is inline; it is well-organized yet lacks any one-level-deep pointers to deeper material.	2 / 3
	Total	9 / 12 Passed

Description

57%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is specific to a clear A/B-testing niche and identifies the right components, but it omits any explicit 'Use when' trigger guidance and offers only a thin set of natural trigger terms, which limits its completeness and trigger coverage.

Suggestions

Add an explicit 'Use when...' clause naming concrete trigger phrases, e.g. 'Use when designing A/B tests, experiments, or split tests, or when the user mentions hypotheses, sample size, or test guardrails.'

Broaden trigger terms to include natural variations users would say: 'A/B testing', 'experimentation', 'split test', 'conversion rate optimization', 'experiment design'.

List a couple more concrete actions (e.g. 'define hypotheses, select metrics, compute sample size, verify tracking') to push specificity toward a full action list.

Dimension	Reasoning	Score
Specificity	It names the domain ('setting up A/B tests') and references specific components ('hypothesis, metrics, and execution readiness'), but does not enumerate multiple concrete actions, placing it at 'names domain and some actions, not comprehensive' rather than the full list of score 3.	2 / 3
Completeness	It clearly states what the skill does, but contains no 'Use when...' clause or equivalent explicit trigger guidance for when to invoke it; per the judging guideline, a missing explicit trigger caps completeness at 2.	2 / 3
Trigger Term Quality	It includes a natural term ('A/B tests') a user would say, but lacks common variations like 'A/B testing', 'experimentation', 'split test', or 'conversion rate optimization', so it is 'some relevant keywords but missing common variations'.	2 / 3
Distinctiveness Conflict Risk	'A/B test setup with mandatory gates for hypothesis, metrics, and execution readiness' carves a clear niche that is unlikely to trigger for unrelated skills, satisfying the distinct-triggers anchor for score 3.	3 / 3
	Total	9 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	15 / 16 Passed

Repository: boisenoise/skills-collections
Commit: f36337d

Reviewed: 2 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.