A discipline for designing experiments (A/B tests, multivariate, holdouts) so the results actually answer the question you asked. Hypothesis writing, sample size, duration, segment analysis, interpretation, decision-making, and the common failure modes that produce confidently wrong shipping decisions.
52
58%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/experiment-design/SKILL.mdQuality
Discovery
82%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong description with excellent specificity and natural trigger terms covering the experimentation domain comprehensively. Its main weakness is the absence of an explicit 'Use when...' clause, which would help Claude know precisely when to select this skill. The second-person 'you' in 'the question you asked' is a minor voice issue but doesn't significantly harm clarity.
Suggestions
Add an explicit 'Use when...' clause, e.g., 'Use when the user asks about designing A/B tests, calculating sample sizes, interpreting experiment results, or making shipping decisions based on test data.'
Switch from second person ('the question you asked') to third person voice to align with style guidelines, e.g., 'so results actually answer the intended question.'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: designing experiments (A/B tests, multivariate, holdouts), hypothesis writing, sample size calculation, duration planning, segment analysis, interpretation, and decision-making. Also mentions failure modes. | 3 / 3 |
Completeness | Clearly answers 'what does this do' with detailed capabilities, but lacks an explicit 'Use when...' clause or equivalent trigger guidance. The 'when' is only implied through the domain description, which per the rubric caps completeness at 2. | 2 / 3 |
Trigger Term Quality | Includes strong natural keywords users would say: 'A/B tests', 'multivariate', 'holdouts', 'hypothesis', 'sample size', 'duration', 'segment analysis', 'experiments'. These are terms practitioners naturally use when seeking help with experimentation. | 3 / 3 |
Distinctiveness Conflict Risk | Occupies a clear niche around experimentation and A/B testing with distinct trigger terms like 'A/B tests', 'holdouts', 'sample size', and 'multivariate'. Unlikely to conflict with other skills unless there's a dedicated statistics skill. | 3 / 3 |
Total | 11 / 12 Passed |
Implementation
35%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a knowledgeable and well-structured guide to experiment design, but it is far too verbose for a skill file—it reads more like a blog post or internal wiki article than a concise instruction set for Claude. It explains many concepts Claude already understands (statistics, p-hacking, network effects) at length rather than stating actionable rules concisely. The progressive disclosure architecture is well-designed but entirely undelivered since no bundle files exist.
Suggestions
Cut the content by 60-70%: remove explanations of why things matter (Claude knows) and reduce each section to its actionable rule, constraint, or decision criterion. For example, the peeking section needs only: 'Daily peeking inflates false positive rate to 30%+. Use sequential testing (mSPRT) if platform supports it. Otherwise pre-commit analysis date and do not decide early.'
Provide the referenced bundle files (hypothesis-templates.md, sample-size-tables.md, results-interpretation-checklist.md, common-failures.md, pre-experiment-readiness-checklist.md) or inline the most critical ones—the pre-experiment checklist and results checklist are essential workflow artifacts that are currently missing.
Add a concrete, copy-paste-ready experiment design template (e.g., a markdown template with fields for hypothesis, primary metric, guardrails, MDE, duration, segments, decision rule) inline in the SKILL.md rather than deferring to a nonexistent reference file.
Remove the philosophical/motivational framing (intro paragraphs about 'sloppy experimentation', closing paragraphs about 'saying I don't know') which consume tokens without adding actionable guidance.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~3500+ words. It extensively explains concepts Claude already knows (what p-hacking is, what novelty effects are, what network effects are, basic statistics). Entire paragraphs explain why things matter rather than just stating the actionable rule. The 'What NOT to A/B test' section spends hundreds of words explaining obvious things like 'don't A/B test legal requirements.' The introductory framing ('The default state of experimentation in most companies is sloppy...') and closing philosophical paragraphs add no actionable value. | 1 / 3 |
Actionability | The skill provides concrete frameworks (four-part hypothesis structure, three-bucket decision model, pre-commitment checklist items) and good/bad hypothesis examples. However, it lacks executable artifacts—no code, no templates inline, no sample size formulas, no concrete decision-rule syntax. The actionable content is described rather than provided directly, with templates and checklists deferred to reference files that don't exist in the bundle. | 2 / 3 |
Workflow Clarity | The 12-consideration framework provides a clear sequence, and the three-bucket decision model with ranked inconclusive resolution paths is well-structured. However, there is no explicit step-by-step workflow with validation checkpoints for the full experiment lifecycle. The pre-experiment readiness checklist and results interpretation checklist are referenced but not provided inline or in bundle files, leaving gaps in the actual workflow. | 2 / 3 |
Progressive Disclosure | The skill references seven well-organized reference files and three related skills, with clear one-level-deep navigation. However, none of the referenced bundle files actually exist, meaning the progressive disclosure structure is promised but not delivered. The main SKILL.md also contains far too much inline content that should be in the reference files (e.g., the detailed network effects discussion, the full peeking math explanation). | 2 / 3 |
Total | 7 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
8e70d03
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.