World-class senior data scientist skill specialising in statistical modeling, experiment design, causal inference, and predictive analytics. Covers A/B testing (sample sizing, two-proportion z-tests, Bonferroni correction), difference-in-differences, feature engineering pipelines (Scikit-learn, XGBoost), cross-validated model evaluation (AUC-ROC, AUC-PR, SHAP), and MLflow experiment tracking — using Python (NumPy, Pandas, Scikit-learn), R, and SQL. Use when designing or analysing controlled experiments, building and evaluating classification or regression models, performing causal analysis on observational data, engineering features for structured tabular datasets, or translating statistical findings into data-driven business decisions.
81
88%
Does it follow best practices?
Impact
73%
1.25xAverage score across 6 eval scenarios
Passed
No known issues
Quality
Discovery
92%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong, detailed description that excels in specificity and trigger term coverage, listing concrete techniques, tools, and metrics that users would naturally reference. The explicit 'Use when...' clause with five distinct scenarios ensures completeness. The main weakness is the very broad scope, which could create overlap with more specialized skills, and the 'world-class senior data scientist' opener is unnecessary fluff that doesn't aid skill selection. First/second person voice is not used, which is correct.
Suggestions
Remove the subjective qualifier 'World-class senior data scientist skill' as it is fluff that doesn't help Claude distinguish when to select this skill.
Consider narrowing the scope or adding negative triggers (e.g., 'Do not use for deep learning/neural network tasks or unstructured text/image data') to reduce potential overlap with adjacent skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions and tools: A/B testing with sample sizing/z-tests/Bonferroni, difference-in-differences, feature engineering pipelines with named libraries, cross-validated model evaluation with specific metrics (AUC-ROC, AUC-PR, SHAP), and MLflow tracking. | 3 / 3 |
Completeness | Clearly answers both 'what' (statistical modeling, experiment design, causal inference, specific techniques and tools) and 'when' with an explicit 'Use when...' clause covering five distinct trigger scenarios: designing experiments, building models, causal analysis, feature engineering, and translating findings into decisions. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'A/B testing', 'experiment design', 'causal inference', 'predictive analytics', 'feature engineering', 'classification', 'regression', 'XGBoost', 'Scikit-learn', 'SHAP', 'SQL', 'cross-validated', 'sample sizing'. These are terms data scientists and stakeholders naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | While highly specific within data science, the broad scope covering statistical modeling, ML, causal inference, feature engineering, and business analytics could overlap with more focused skills (e.g., a dedicated ML model building skill, an A/B testing skill, or a general Python data analysis skill). The 'world-class senior data scientist' framing is also somewhat generic positioning rather than a clear niche delimiter. | 2 / 3 |
Total | 11 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, well-structured skill with excellent actionability — every workflow provides production-ready code with validation checklists. The progressive disclosure is clean with clear references to deeper material. The main weakness is moderate verbosity: the Common Commands section contains generic DevOps commands unrelated to data science, and some docstrings over-explain parameters that Claude would understand from context.
Suggestions
Remove or significantly trim the Common Commands section — generic pytest, Docker, and kubectl commands don't add data-science-specific value and waste tokens.
Trim docstring parameter descriptions that restate obvious information (e.g., 'baseline_rate: current conversion rate (e.g. 0.10)' could just be shown via the example call in the checklist).
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is mostly efficient with executable code, but includes some unnecessary elements: the opening line restates the title, docstrings explain obvious parameters (e.g., 'baseline_rate: current conversion rate'), and the Common Commands section includes generic Docker/Kubernetes/pytest commands that Claude already knows and aren't specific to data science workflows. | 2 / 3 |
Actionability | Every workflow provides fully executable, copy-paste-ready Python code with concrete functions, proper imports, and structured return values. The checklists add specific, actionable guidance (e.g., 'overfit_gap > 0.05 as a warning sign', 'Bonferroni correction: alpha / n_metrics'). | 3 / 3 |
Workflow Clarity | Each workflow has a clear sequence with explicit validation checkpoints embedded in the checklists (e.g., 'Calculate sample size BEFORE starting', 'Never fit transformers on the full dataset — fit on train, transform test', 'Check overfit_gap > 0.05', 'Validate parallel trends in pre-period before trusting DiD estimates'). The numbered checklists provide feedback loops and guard rails for each multi-step process. | 3 / 3 |
Progressive Disclosure | The skill is well-structured with four clearly delineated workflow sections, each self-contained with code + checklist, and a Reference Documentation section pointing to one-level-deep external files for advanced/detailed content. Navigation is clear and content is appropriately split. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
f567c61
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.