experiment-tracker

Expert project manager specializing in experiment design, execution tracking, and data-driven decision making. Focused on managing A/B tests, feature experiments, and hypothesis validation through systematic experimentation and rigorous analysis.

Quality

30%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./pm-experiment-tracker/skills/SKILL.md

Quality

Discovery

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description identifies a clear domain (experimentation and A/B testing) but relies on buzzword-heavy language ('expert project manager', 'data-driven decision making', 'rigorous analysis') rather than concrete actions. It critically lacks a 'Use when...' clause, making it difficult for Claude to know when to select this skill. The first-person framing is avoided, but the description reads more like a resume summary than a functional skill selector.

Suggestions

Add an explicit 'Use when...' clause with trigger scenarios, e.g., 'Use when the user asks about designing A/B tests, tracking experiment results, calculating statistical significance, or deciding whether to ship a feature based on experiment data.'

Replace vague phrases like 'data-driven decision making' and 'rigorous analysis' with concrete actions such as 'define experiment hypotheses, set sample sizes, track variant performance metrics, analyze statistical significance, and recommend ship/no-ship decisions.'

Include common user-facing trigger terms and file/concept variations like 'split test', 'control vs treatment', 'experiment results', 'feature flag', 'rollout decision' to improve keyword coverage.

Dimension	Reasoning	Score
Specificity	Names the domain (experiment/project management) and some actions (experiment design, execution tracking, data-driven decision making, A/B tests, hypothesis validation), but these are more like category labels than concrete specific actions. It doesn't list discrete operations like 'create experiment plans', 'track variant metrics', or 'calculate statistical significance'.	2 / 3
Completeness	Describes what it does (experiment design, execution tracking, analysis) but completely lacks any 'Use when...' clause or explicit trigger guidance for when Claude should select this skill. Per the rubric, a missing 'Use when...' clause caps completeness at 2, and the 'what' portion is also somewhat vague, warranting a score of 1.	1 / 3
Trigger Term Quality	Includes some relevant keywords like 'A/B tests', 'feature experiments', 'hypothesis validation', and 'experimentation' that users might naturally say. However, it misses common variations like 'split test', 'experiment results', 'statistical significance', 'control group', 'variant', or 'experiment plan'.	2 / 3
Distinctiveness Conflict Risk	The experimentation/A/B testing focus provides some distinctiveness, but terms like 'project manager', 'data-driven decision making', and 'rigorous analysis' are generic enough to overlap with general project management or data analysis skills.	2 / 3
	Total	7 / 12 Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like a persona/role-play prompt than an actionable skill document. It is extremely verbose, spending most of its token budget on identity descriptions, communication style guidance, success metrics, and aspirational capabilities rather than concrete, executable instructions. The templates provide some structure but lack the specificity (actual statistical formulas, code examples, tool commands) needed to make them truly actionable.

Suggestions

Remove all persona/identity/communication style sections and focus exclusively on actionable instructions - Claude doesn't need to be told its personality traits to execute a skill effectively.

Add concrete, executable examples: include actual Python/R code for sample size calculation, statistical significance testing, and confidence interval computation rather than just describing these concepts.

Split the monolithic content into separate files: keep SKILL.md as a concise overview with links to DESIGN_TEMPLATE.md, RESULTS_TEMPLATE.md, STATISTICAL_METHODS.md, and WORKFLOW.md.

Add explicit validation checkpoints with concrete criteria, e.g., 'If observed power < 0.8, extend experiment duration by recalculating with current effect size' rather than vague references to 'quality assurance checks'.

Dimension	Reasoning	Score
Conciseness	Extremely verbose with extensive sections describing personality traits, identity, communication style, success metrics, and 'learning & memory' that are either things Claude already knows or vague aspirational statements. The content is heavily padded with emoji headers, motivational language, and redundant descriptions that don't add actionable value.	1 / 3
Actionability	The markdown templates for experiment design and results reporting are somewhat concrete and usable, but there is no executable code, no specific statistical commands or formulas, and no concrete examples with real numbers showing how to calculate sample sizes, p-values, or confidence intervals. Most guidance remains at the descriptive level rather than providing copy-paste ready instructions.	2 / 3
Workflow Clarity	The 4-step workflow process is sequenced and logical, but validation checkpoints are vague ('validate implementation', 'quality assurance checks') rather than explicit. There are no concrete feedback loops showing what to do when data quality fails or when early stopping criteria are triggered. Safety monitoring is mentioned but not operationalized with specific thresholds or procedures.	2 / 3
Progressive Disclosure	The content is a monolithic wall of text with no references to external files for detailed content. Everything is inline in one massive document. The final line references 'core training' which is not a real file reference. Advanced capabilities, templates, and workflow details could all be split into separate referenced documents but are instead dumped into a single file.	1 / 3
	Total	6 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: OpenRoster-ai/awesome-openroster
Commit: 09aef5d

Reviewed: about 1 month ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.