World-class senior data scientist skill specialising in statistical modeling, experiment design, causal inference, and predictive analytics. Covers A/B testing (sample sizing, two-proportion z-tests, Bonferroni correction), difference-in-differences, feature engineering pipelines (Scikit-learn, XGBoost), cross-validated model evaluation (AUC-ROC, AUC-PR, SHAP), and MLflow experiment tracking — using Python (NumPy, Pandas, Scikit-learn), R, and SQL. Use when designing or analysing controlled experiments, building and evaluating classification or regression models, performing causal analysis on observational data, engineering features for structured tabular datasets, or translating statistical findings into data-driven business decisions.
81
88%
Does it follow best practices?
Impact
73%
1.25xAverage score across 6 eval scenarios
Passed
No known issues
Quality
Discovery
92%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong, detailed description that excels in specificity and trigger term coverage, listing concrete techniques, tools, and metrics that a data scientist would naturally reference. The explicit 'Use when...' clause with five distinct scenarios ensures completeness. The main weakness is its very broad scope, which could create overlap with more specialized data science or ML skills in a large skill library. Note: the phrase 'World-class senior data scientist skill' is unnecessary fluff that doesn't aid selection.
Suggestions
Remove the subjective qualifier 'World-class senior data scientist skill specialising in' — it's fluff that doesn't help Claude select the skill. Start directly with the capabilities.
Consider narrowing the scope or explicitly noting boundaries (e.g., 'Not for deep learning, NLP, or unstructured data tasks') to reduce potential conflicts with other ML-related skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions and tools: A/B testing with sample sizing/z-tests/Bonferroni correction, difference-in-differences, feature engineering pipelines with named libraries, cross-validated model evaluation with specific metrics (AUC-ROC, AUC-PR, SHAP), and MLflow tracking. | 3 / 3 |
Completeness | Clearly answers both 'what' (statistical modeling, experiment design, causal inference, specific techniques and tools) and 'when' with an explicit 'Use when...' clause covering five distinct trigger scenarios: designing experiments, building models, causal analysis, feature engineering, and translating findings into decisions. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'A/B testing', 'experiment design', 'causal inference', 'predictive analytics', 'feature engineering', 'classification', 'regression', 'XGBoost', 'Scikit-learn', 'SHAP', 'cross-validated', 'SQL', 'Python', 'R'. These are terms data scientists and stakeholders naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | While the description is highly specific about data science capabilities, the broad scope covering statistical modeling, ML, causal inference, feature engineering, and business decisions could overlap with more focused skills (e.g., a dedicated ML model building skill, an A/B testing skill, or a general Python data analysis skill). The breadth increases conflict risk with narrower skills in the same domain. | 2 / 3 |
Total | 11 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, well-structured skill that provides highly actionable, executable code across four core data science workflows. The checklists embedded after each code block add valuable domain-specific guardrails. Minor weaknesses include some verbosity in docstrings, a generic Common Commands section (Docker/kubectl) that doesn't add skill-specific value, and the self-aggrandizing opening line.
Suggestions
Remove or trim the Common Commands section — generic pytest/docker/kubectl commands don't add value for Claude and waste tokens.
Trim function docstrings to remove obvious parameter descriptions (e.g., Claude knows what 'baseline_rate' means from context).
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is mostly efficient with executable code, but includes some unnecessary elements: the docstrings explaining obvious parameters (e.g., 'baseline_rate: current conversion rate'), the 'World-class senior data scientist' self-description, and the Common Commands section with generic Docker/Kubernetes commands that Claude already knows and aren't specific to this skill. | 2 / 3 |
Actionability | Excellent actionability — every workflow provides fully executable Python code with concrete functions, proper imports, and clear parameter signatures. The checklists add specific, actionable guidance (e.g., 'overfit_gap > 0.05 as a warning sign', 'Bonferroni correction: alpha / n_metrics'). Code is copy-paste ready. | 3 / 3 |
Workflow Clarity | Each workflow has a clear sequence with explicit validation checkpoints embedded in the checklists (e.g., 'Calculate sample size BEFORE starting', 'Never fit transformers on the full dataset — fit on train, transform test', 'Check overfit_gap > 0.05', 'Validate parallel trends in pre-period before trusting DiD estimates'). The checklists serve as effective feedback loops for error prevention. | 3 / 3 |
Progressive Disclosure | Content is well-structured with four clearly delineated workflow sections, each containing code + checklist. Advanced/detailed references are appropriately pointed to via one-level-deep links (references/statistical_methods_advanced.md, etc.). The skill serves as a clear overview with navigation to deeper material. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
967fe01
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.