data-science-python

Python data science: notebook structure, data validation, reproducibility, and model documentation

1.11x

Quality

60%

Does it follow best practices?

Impact

76%

1.11x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Fix and improve this skill with Tessl

tessl review fix ./skills/data-science-python/SKILL.md

Data Science — Python

Notebook Structure

Organize notebooks in this order:

Imports & configuration
Data loading
Exploratory data analysis (EDA)
Feature engineering
Modeling
Evaluation
Conclusions & next steps

Keep notebooks for exploration. Move reusable logic to src/ Python modules with tests.

Data Validation

# Always validate after loading
assert df.shape[0] > 0, "DataFrame is empty"
assert df.isnull().sum().sum() == 0, f"Nulls found: {df.isnull().sum()}"
assert df['price'].between(0, 1_000_000).all(), "Price out of expected range"

# For production pipelines: use pandera
import pandera as pa
schema = pa.DataFrameSchema({
    "price": pa.Column(float, pa.Check.ge(0)),
    "category": pa.Column(str, pa.Check.isin(["A", "B", "C"])),
})
schema.validate(df)

Reproducibility

import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
# sklearn: pass random_state=SEED to all estimators

Pin all dependency versions in requirements.txt or pyproject.toml.
Track experiments: MLflow, or a simple experiments/ log with metadata JSON.
Save models with timestamp + metadata: model_rf_20260115_v1.pkl.

Model Documentation

Document for every model:

Training data: source, date range, size, preprocessing steps.
Feature list and engineering decisions.
Hyperparameters and tuning approach.
Evaluation metrics on held-out test set.
Known limitations and failure modes.

Code Quality

# Extract transforms into sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=SEED)),
])

Write unit tests for data processing functions in tests/.
Use pathlib.Path — never hardcode file paths.
Use logging not print in production code.

Repository: ucdavis/ai-skills-registry
Path: skills/data-science-python/SKILL.md
Commit: c0b2e4b

Last updated: about 2 months ago
Created: about 2 months ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.