Trains and fine-tunes ML models, builds data preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference into production applications, and designs RAG and LLM-powered systems. Covers MLOps workflows including experiment tracking, drift detection, retraining triggers, and A/B testing. Use when the user asks about training or fine-tuning a model, building ML pipelines, model serving or inference optimization, evaluating model performance, working with frameworks like PyTorch, TensorFlow, scikit-learn, or Hugging Face, setting up vector databases, prompt engineering, or taking an ML prototype to production.
90
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
For topics mentioned in this skill's scope but covered in dedicated guides, see:
These are workflow gates — go/no-go decisions that must pass before advancing to the next phase. (For the final pre-promotion sign-off artifact, see the Deployment Checklist in the worked examples.)
After training:
mlflow.log_params({...}); mlflow.log_metrics({...}); mlflow.sklearn.log_model(model, "model").Before deployment:
After launch:
Scenario: Binary classifier for churn prediction, logistic regression baseline vs. gradient boosting candidate.
Expected output artifact:
Model Evaluation Summary
========================
Task: Binary classification (churn, 30-day horizon)
Dataset: 45,000 train / 5,000 test (stratified split, class balance 18% positive)
Baseline (LR) Candidate (GBM)
AUC-ROC 0.71 0.83
F1 (thresh=0.4) 0.52 0.67
Precision 0.61 0.74
Recall 0.45 0.62
Latency (p50) 3 ms 11 ms
Slice analysis: AUC drops to 0.74 on users < 30 days tenure — flag for review.
Known failure mode: low recall on power users (top 5% by usage); dedicated sub-model recommended.
Recommendation: ship GBM if latency budget allows; re-evaluate slice issue before launch.Scenario: Wrap a trained scikit-learn model for synchronous REST inference.
# serve.py
import mlflow.sklearn
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
MODEL_URI = "models:/churn-classifier/Production"
model = mlflow.sklearn.load_model(MODEL_URI)
app = FastAPI()
class PredictRequest(BaseModel):
features: list[float] # ordered feature vector
class PredictResponse(BaseModel):
score: float
label: int
@app.post("/predict", response_model=PredictResponse)
def predict(req: PredictRequest):
x = np.array(req.features).reshape(1, -1)
score = float(model.predict_proba(x)[0, 1])
return PredictResponse(score=score, label=int(score >= 0.4))
@app.get("/health")
def health():
return {"status": "ok"}Scenario: Log a training run with hyperparameters, metrics, and the model artifact.
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
mlflow.set_experiment("churn-prediction")
params = {"n_estimators": 300, "max_depth": 5, "learning_rate": 0.05}
with mlflow.start_run(run_name="gbm-v1"):
mlflow.log_params(params)
clf = GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)
auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
mlflow.log_metric("auc_roc", auc)
mlflow.sklearn.log_model(
clf,
artifact_path="model",
registered_model_name="churn-classifier",
)
print(f"AUC-ROC: {auc:.4f}")Purpose: A final sign-off artifact to be completed and stored when promoting a model to production. Unlike the Validation Checkpoints above (which are per-phase go/no-go gates), this checklist is completed once, as a whole, immediately before the production promotion step.
Pre-Deployment Checklist
========================
[ ] Held-out test metrics meet or exceed agreed thresholds (document actual values)
[ ] Slice evaluation run; no unacceptable performance gaps on protected/high-risk groups
[ ] Model artifact versioned and registered (e.g., MLflow Model Registry stage = "Staging")
[ ] Serving endpoint smoke-tested with at least 10 representative + 5 adversarial inputs
[ ] Latency benchmark run: p50 ___ ms, p99 ___ ms (budget: p99 < ___ ms)
[ ] Feature pipeline tested end-to-end from raw source to model input
[ ] Previous model artifact tagged for rollback; serving config version-pinned
[ ] Monitoring dashboards live: input drift, output distribution, error rate, latency
[ ] Rollback runbook written and tested (who does it, how, in what time window)
[ ] Privacy / data-retention controls verified for inference logs
[ ] Human-review queue defined for low-confidence predictions (if required)010799b
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.