Trains and fine-tunes ML models, builds data preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference into production applications, and designs RAG and LLM-powered systems. Covers MLOps workflows including experiment tracking, drift detection, retraining triggers, and A/B testing. Use when the user asks about training or fine-tuning a model, building ML pipelines, model serving or inference optimization, evaluating model performance, working with frameworks like PyTorch, TensorFlow, scikit-learn, or Hugging Face, setting up vector databases, prompt engineering, or taking an ML prototype to production.
88
88%
Does it follow best practices?
Impact
81%
1.09xAverage score across 3 eval scenarios
Passed
No known issues
MLflow experiment tracking and model evaluation summary
MLflow experiment name
0%
0%
MLflow run name
0%
0%
MLflow log_params
0%
0%
MLflow log_metrics
0%
0%
MLflow log_model with registered name
0%
0%
Evaluation summary task description
40%
100%
Evaluation summary dataset info
100%
100%
Evaluation summary baseline vs candidate
100%
100%
Latency in evaluation summary
100%
100%
Slice evaluation present
100%
100%
Failure mode identified
100%
100%
Deployment recommendation
100%
100%
Stratified split used
100%
100%
No fabricated numbers in prose
100%
100%
FastAPI model serving endpoint with MLflow model registry
FastAPI used
100%
100%
Pydantic request schema
100%
100%
Pydantic response schema
100%
100%
response_model used
100%
100%
POST /predict endpoint
100%
100%
Health check endpoint
100%
100%
MLflow model registry URI
100%
100%
mlflow.sklearn.load_model used
0%
100%
numpy array reshape
100%
100%
Score extraction
0%
100%
Pre-deployment checklist and post-launch monitoring thresholds
Checklist: held-out metrics
100%
66%
Checklist: slice evaluation
100%
100%
Checklist: artifact versioning
100%
100%
Checklist: smoke test
100%
100%
Checklist: latency benchmark
66%
50%
Checklist: rollback artifact
100%
100%
Checklist: monitoring dashboards
100%
100%
Checklist: rollback runbook
100%
100%
Checklist: privacy controls
100%
100%
Drift metric: PSI threshold
100%
58%
Retraining trigger: performance drop
30%
70%
Retraining trigger: data volume
100%
50%
Rollback trigger: error rate
40%
100%
Rollback trigger: latency duration
100%
100%
010799b
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.