experimentation-analytics

How to read experiment results without fooling yourself. Confidence intervals, p-values, multiple testing, sequential testing, CUPED, heterogeneous treatment effects, ratio metrics, network effects, dashboard reconciliation, and the interpretation failures that produce confidently wrong shipping decisions.

Quality

70%

Does it follow best practices?

Run evals on this skill

Adds up to 20 points to the overall score

View guide

Securityby

Passed

No findings from the security scan

Fix and improve this skill with Tessl

tessl review fix ./skills/experimentation-analytics/SKILL.md

Quality

Content

62%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

A well-structured, well-referenced playbook with excellent progressive disclosure and clear decision rules, but it runs long for a SKILL.md body and offers instructional rather than executable guidance without explicit validation checkpoints.

Suggestions

Tighten repeated explanations (CUPED and peeking are each explained 2-3 times across sections) and push detail into the reference files to improve conciseness.

Add a short explicit validation checkpoint in the 14-consideration framework (e.g., 'Confirm all required panel fields are present before interpreting') to raise workflow clarity.

Include at least one copy-paste-ready worked numeric example or command to lift actionability toward executable guidance.

Dimension	Reasoning	Score
Conciseness	The body is dense and mostly value-dense, but at ~320 lines it is far longer than a lean overview; some sections restate concepts and re-explain the same CUPED/peeking lessons multiple times, which could be tightened or offloaded to references.	2 / 3
Actionability	Guidance is concrete as decision rules (e.g., the five CI rules) but largely instructional rather than executable—there is no copy-paste code, and worked numeric examples are prose, so it falls short of fully executable/copy-paste-ready guidance.	2 / 3
Workflow Clarity	The 14-consideration framework gives a sequence, but there is no explicit validate→fix→retry checkpoint for batch/interpretive decisions, and the recommended 'surface missing panel fields before shipping' step is stated once rather than as a recurring checkpoint.	2 / 3
Progressive Disclosure	All seven referenced files exist and are linked once-level-deep with clearly signaled summaries; the body is an overview pointing to detailed references with easy navigation, matching the 3-anchor.	3 / 3
	Total	9 / 12 Passed

Description

77%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

A strong, highly specific description with natural trigger terms and clear niche separation. The main gap is the absence of an explicit "Use when..." clause tying the capabilities to a concrete trigger moment.

Suggestions

Add an explicit 'Use when reading an experiment result panel and deciding whether to ship, kill, or iterate' clause to lift completeness to 3.

Lead with the trigger context before the topic catalog so the 'when' is as prominent as the 'what'.

Dimension	Reasoning	Score
Specificity	Enumerates many concrete interpretive capabilities—"confidence intervals, p-values, multiple testing, sequential testing, CUPED, heterogeneous treatment effects, ratio metrics, network effects, dashboard reconciliation"—rather than vague abstractions.	3 / 3
Completeness	It thoroughly answers what the skill does but lacks an explicit "Use when..." trigger clause; the "when" is only implied by the catalog of topics, so per the guideline completeness caps at 2.	2 / 3
Trigger Term Quality	Covers natural terms users would say—"experiment results", "p-values", "confidence intervals", "dashboard reconciliation"—with good variation and the vernacular framing "without fooling yourself".	3 / 3
Distinctiveness Conflict Risk	The niche is sharply scoped to experiment-result interpretation and explicitly distinguished from adjacent skills (experiment-design, feature-flagging), making conflict with other skills unlikely.	3 / 3
	Total	11 / 12 Passed

Validation

93%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 15 / 16 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	15 / 16 Passed

Repository: rampstackco/claude-skills
Path: skills/experimentation-analytics/SKILL.md
Commit: bc6d961

Reviewed: about 5 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.