CtrlK
BlogDocsLog inGet started
Tessl Logo

experiment-design

A discipline for designing experiments (A/B tests, multivariate, holdouts) so the results actually answer the question you asked. Hypothesis writing, sample size, duration, segment analysis, interpretation, decision-making, and the common failure modes that produce confidently wrong shipping decisions.

52

Quality

58%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/experiment-design/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

82%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong description with excellent specificity and natural trigger terms covering the experimentation domain comprehensively. Its main weakness is the absence of an explicit 'Use when...' clause, which would help Claude know precisely when to select this skill. The second-person 'you' in 'the question you asked' is a minor voice issue but doesn't significantly harm clarity.

Suggestions

Add an explicit 'Use when...' clause, e.g., 'Use when the user asks about designing A/B tests, calculating sample sizes, interpreting experiment results, or making shipping decisions based on test data.'

Switch from second person ('the question you asked') to third person voice to align with style guidelines, e.g., 'so results actually answer the intended question.'

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: designing experiments (A/B tests, multivariate, holdouts), hypothesis writing, sample size calculation, duration planning, segment analysis, interpretation, and decision-making. Also mentions failure modes.

3 / 3

Completeness

Clearly answers 'what does this do' with detailed capabilities, but lacks an explicit 'Use when...' clause or equivalent trigger guidance. The 'when' is only implied through the domain description, which per the rubric caps completeness at 2.

2 / 3

Trigger Term Quality

Includes strong natural keywords users would say: 'A/B tests', 'multivariate', 'holdouts', 'hypothesis', 'sample size', 'duration', 'segment analysis', 'experiments'. These are terms practitioners naturally use when seeking help with experimentation.

3 / 3

Distinctiveness Conflict Risk

Occupies a clear niche around experimentation and A/B testing with distinct trigger terms like 'A/B tests', 'holdouts', 'sample size', and 'multivariate'. Unlikely to conflict with other skills unless there's a dedicated statistics skill.

3 / 3

Total

11

/

12

Passed

Implementation

35%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a knowledgeable and well-structured guide to experiment design, but it is far too verbose for a skill file—it reads more like a blog post or internal wiki article than a concise instruction set for Claude. It explains many concepts Claude already understands (statistics, p-hacking, network effects) at length rather than stating actionable rules concisely. The progressive disclosure architecture is well-designed but entirely undelivered since no bundle files exist.

Suggestions

Cut the content by 60-70%: remove explanations of why things matter (Claude knows) and reduce each section to its actionable rule, constraint, or decision criterion. For example, the peeking section needs only: 'Daily peeking inflates false positive rate to 30%+. Use sequential testing (mSPRT) if platform supports it. Otherwise pre-commit analysis date and do not decide early.'

Provide the referenced bundle files (hypothesis-templates.md, sample-size-tables.md, results-interpretation-checklist.md, common-failures.md, pre-experiment-readiness-checklist.md) or inline the most critical ones—the pre-experiment checklist and results checklist are essential workflow artifacts that are currently missing.

Add a concrete, copy-paste-ready experiment design template (e.g., a markdown template with fields for hypothesis, primary metric, guardrails, MDE, duration, segments, decision rule) inline in the SKILL.md rather than deferring to a nonexistent reference file.

Remove the philosophical/motivational framing (intro paragraphs about 'sloppy experimentation', closing paragraphs about 'saying I don't know') which consume tokens without adding actionable guidance.

DimensionReasoningScore

Conciseness

The skill is extremely verbose at ~3500+ words. It extensively explains concepts Claude already knows (what p-hacking is, what novelty effects are, what network effects are, basic statistics). Entire paragraphs explain why things matter rather than just stating the actionable rule. The 'What NOT to A/B test' section spends hundreds of words explaining obvious things like 'don't A/B test legal requirements.' The introductory framing ('The default state of experimentation in most companies is sloppy...') and closing philosophical paragraphs add no actionable value.

1 / 3

Actionability

The skill provides concrete frameworks (four-part hypothesis structure, three-bucket decision model, pre-commitment checklist items) and good/bad hypothesis examples. However, it lacks executable artifacts—no code, no templates inline, no sample size formulas, no concrete decision-rule syntax. The actionable content is described rather than provided directly, with templates and checklists deferred to reference files that don't exist in the bundle.

2 / 3

Workflow Clarity

The 12-consideration framework provides a clear sequence, and the three-bucket decision model with ranked inconclusive resolution paths is well-structured. However, there is no explicit step-by-step workflow with validation checkpoints for the full experiment lifecycle. The pre-experiment readiness checklist and results interpretation checklist are referenced but not provided inline or in bundle files, leaving gaps in the actual workflow.

2 / 3

Progressive Disclosure

The skill references seven well-organized reference files and three related skills, with clear one-level-deep navigation. However, none of the referenced bundle files actually exist, meaning the progressive disclosure structure is promised but not delivered. The main SKILL.md also contains far too much inline content that should be in the reference files (e.g., the detailed network effects discussion, the full peeking math explanation).

2 / 3

Total

7

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
rampstackco/claude-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.