experiment-tracker

Designs and tracks scientific experiments, A/B tests, and feature rollouts for product and engineering teams. Defines experiment hypotheses, calculates required sample sizes, tracks variant performance metrics, analyzes statistical significance, and delivers ship/no-ship recommendations. Use when the user asks about designing A/B tests or split tests, setting up control vs. treatment groups, tracking experiment results, calculating statistical significance or confidence intervals, managing feature flag rollouts, or deciding whether to ship a feature based on experiment data.

Quality

92%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Quality

Content

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a well-structured experiment tracking skill that excels in workflow clarity with explicit validation checkpoints at every stage and clear decision criteria. The progressive disclosure is excellent, providing just enough inline detail while pointing to dedicated files for implementations and templates. The main area for improvement is that actionability could be stronger with at least one inline executable code example rather than deferring all code to external files.

Suggestions

Include at least one inline executable code snippet (e.g., the sample_size calculation) so the skill has copy-paste ready code without requiring navigation to STATISTICAL_METHODS.md

Dimension	Reasoning	Score
Conciseness	The content is lean and efficient throughout. It avoids explaining what A/B tests are or how statistics work conceptually, instead jumping straight into actionable workflow steps. Every section earns its place with specific thresholds, criteria, and references rather than padding.	3 / 3
Actionability	The skill provides specific thresholds (e.g., '< 95% of expected', '> 5% deviation'), concrete metric examples, and references to Python implementations in STATISTICAL_METHODS.md. However, the actual executable code is deferred to external files rather than included inline, and the templates are similarly referenced but not shown. The guidance is concrete but not fully copy-paste ready within this file.	2 / 3
Workflow Clarity	The four-step workflow is clearly sequenced with explicit validation checkpoints at each stage, including specific trigger conditions (e.g., data collection rate < 95%, split deviation > 5%). It includes feedback loops (halt and fix, reduce scope) and covers the full lifecycle from design through decision with clear go/no-go criteria.	3 / 3
Progressive Disclosure	The skill provides a clear overview with well-signaled one-level-deep references to STATISTICAL_METHODS.md and TEMPLATES.md. The main file contains enough context (function signatures, test selection table, example values) to be useful standalone while appropriately deferring full implementations and templates to separate files.	3 / 3
	Total	11 / 12 Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong, well-crafted skill description that clearly defines its domain (experiment design and analysis), lists specific concrete capabilities, and provides an explicit 'Use when...' clause with diverse natural trigger terms. It uses proper third-person voice throughout and covers both the 'what' and 'when' comprehensively, making it easy for Claude to select this skill appropriately from a large pool.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: defines hypotheses, calculates sample sizes, tracks variant performance metrics, analyzes statistical significance, and delivers ship/no-ship recommendations.	3 / 3
Completeness	Clearly answers both 'what' (designs and tracks experiments, calculates sample sizes, analyzes significance, delivers recommendations) and 'when' with an explicit 'Use when...' clause listing six specific trigger scenarios.	3 / 3
Trigger Term Quality	Excellent coverage of natural terms users would say: 'A/B tests', 'split tests', 'control vs. treatment groups', 'statistical significance', 'confidence intervals', 'feature flag rollouts', 'ship a feature', 'experiment data'.	3 / 3
Distinctiveness Conflict Risk	Occupies a clear niche around experimentation and A/B testing with distinct triggers like 'sample sizes', 'ship/no-ship', 'feature flag rollouts', and 'statistical significance' that are unlikely to conflict with general analytics or data science skills.	3 / 3
	Total	12 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: OpenRoster-ai/awesome-agents
Commit: 010799b

Reviewed: 4 months ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.