Designs and tracks scientific experiments, A/B tests, and feature rollouts for product and engineering teams. Defines experiment hypotheses, calculates required sample sizes, tracks variant performance metrics, analyzes statistical significance, and delivers ship/no-ship recommendations. Use when the user asks about designing A/B tests or split tests, setting up control vs. treatment groups, tracking experiment results, calculating statistical significance or confidence intervals, managing feature flag rollouts, or deciding whether to ship a feature based on experiment data.
93
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong, well-crafted skill description that clearly defines its domain (experiment design and analysis), lists specific concrete capabilities, and provides an explicit 'Use when...' clause with diverse natural trigger terms. It uses proper third-person voice throughout and covers both the 'what' and 'when' comprehensively, making it easy for Claude to select this skill appropriately from a large pool.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: defines hypotheses, calculates sample sizes, tracks variant performance metrics, analyzes statistical significance, and delivers ship/no-ship recommendations. | 3 / 3 |
Completeness | Clearly answers both 'what' (designs and tracks experiments, calculates sample sizes, analyzes significance, delivers recommendations) and 'when' with an explicit 'Use when...' clause listing six specific trigger scenarios. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'A/B tests', 'split tests', 'control vs. treatment groups', 'statistical significance', 'confidence intervals', 'feature flag rollouts', 'ship a feature', 'experiment data'. | 3 / 3 |
Distinctiveness Conflict Risk | Occupies a clear niche around experimentation and A/B testing with distinct triggers like 'sample sizes', 'ship/no-ship', 'feature flag rollouts', and 'statistical significance' that are unlikely to conflict with general analytics or data science skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a well-structured experiment tracking skill that excels in workflow clarity with explicit validation checkpoints at every stage and clear decision criteria. The progressive disclosure is excellent, providing just enough inline detail while pointing to dedicated files for implementations and templates. The main area for improvement is that actionability could be stronger with at least one inline executable code example rather than deferring all code to external files.
Suggestions
Include at least one inline executable code snippet (e.g., the sample_size calculation) so the skill has copy-paste ready code without requiring navigation to STATISTICAL_METHODS.md
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is lean and efficient throughout. It avoids explaining what A/B tests are or how statistics work conceptually, instead jumping straight into actionable workflow steps. Every section earns its place with specific thresholds, criteria, and references rather than padding. | 3 / 3 |
Actionability | The skill provides specific thresholds (e.g., '< 95% of expected', '> 5% deviation'), concrete metric examples, and references to Python implementations in STATISTICAL_METHODS.md. However, the actual executable code is deferred to external files rather than included inline, and the templates are similarly referenced but not shown. The guidance is concrete but not fully copy-paste ready within this file. | 2 / 3 |
Workflow Clarity | The four-step workflow is clearly sequenced with explicit validation checkpoints at each stage, including specific trigger conditions (e.g., data collection rate < 95%, split deviation > 5%). It includes feedback loops (halt and fix, reduce scope) and covers the full lifecycle from design through decision with clear go/no-go criteria. | 3 / 3 |
Progressive Disclosure | The skill provides a clear overview with well-signaled one-level-deep references to STATISTICAL_METHODS.md and TEMPLATES.md. The main file contains enough context (function signatures, test selection table, example values) to be useful standalone while appropriately deferring full implementations and templates to separate files. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
010799b
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.