Trains and fine-tunes ML models, builds data preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference into production applications, and designs RAG and LLM-powered systems. Covers MLOps workflows including experiment tracking, drift detection, retraining triggers, and A/B testing. Use when the user asks about training or fine-tuning a model, building ML pipelines, model serving or inference optimization, evaluating model performance, working with frameworks like PyTorch, TensorFlow, scikit-learn, or Hugging Face, setting up vector databases, prompt engineering, or taking an ML prototype to production.
90
88%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
92%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong description that excels in specificity and completeness, with a well-structured 'Use when...' clause containing numerous natural trigger terms. Its main weakness is its extremely broad scope—covering nearly the entire ML lifecycle from data preprocessing to production deployment to LLM systems—which could create overlap with more specialized skills in a large skill library. The description uses proper third-person voice throughout.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: trains/fine-tunes ML models, builds preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference into production, designs RAG and LLM-powered systems, experiment tracking, drift detection, retraining triggers, and A/B testing. | 3 / 3 |
Completeness | Clearly answers both 'what' (trains models, builds pipelines, deploys as REST APIs, designs RAG systems, covers MLOps workflows) and 'when' with an explicit 'Use when...' clause listing specific trigger scenarios like training a model, building ML pipelines, working with specific frameworks, or taking prototypes to production. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'training', 'fine-tuning', 'ML pipelines', 'model serving', 'inference optimization', 'PyTorch', 'TensorFlow', 'scikit-learn', 'Hugging Face', 'vector databases', 'prompt engineering', 'ML prototype to production'. These are terms users would naturally use when seeking ML help. | 3 / 3 |
Distinctiveness Conflict Risk | While the ML/MLOps focus is fairly specific, the scope is extremely broad—covering everything from data preprocessing to prompt engineering to RAG systems to deployment. The 'prompt engineering' trigger could easily conflict with general LLM/coding skills, and 'vector databases' could overlap with database skills. The breadth increases conflict risk with more specialized skills. | 2 / 3 |
Total | 11 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, well-structured skill that provides actionable guidance across the ML lifecycle. Its greatest strengths are the concrete worked examples (evaluation summary, serving template, experiment tracking, deployment checklist) and the explicit validation checkpoints with measurable thresholds. Minor verbosity in the mission/workflow preamble and some explanatory text around the checkpoints could be trimmed, but overall token efficiency is reasonable given the breadth of the skill.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is generally efficient but includes some sections that could be tightened. The mission statement and some workflow descriptions are slightly verbose for what Claude already knows. However, the examples and checklists earn their space. The deployment checklist and evaluation summary, while long, provide genuinely useful templates. | 2 / 3 |
Actionability | The skill provides fully executable Python code for model serving (FastAPI), experiment tracking (MLflow), concrete metric thresholds (PSI > 0.2, p99 < 200ms), and copy-paste ready templates. The evaluation summary example and deployment checklist are specific and immediately usable. | 3 / 3 |
Workflow Clarity | The workflow is clearly sequenced with explicit validation checkpoints at three stages (after training, before deployment, after launch). Each checkpoint has concrete go/no-go criteria with specific thresholds. The relationship between per-phase gates and the final deployment checklist is explicitly clarified, and rollback/retraining triggers are defined. | 3 / 3 |
Progressive Disclosure | The skill cleanly separates concerns with one-level-deep references to RAG_SYSTEMS.md, VECTOR_DATABASES.md, and FRAMEWORK_GUIDES.md. The main content stays focused on the core workflow and examples, with clear internal navigation via anchor links between validation checkpoints and the deployment checklist. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
010799b
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.