Trains and fine-tunes ML models, builds data preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference into production applications, and designs RAG and LLM-powered systems. Covers MLOps workflows including experiment tracking, drift detection, retraining triggers, and A/B testing. Use when the user asks about training or fine-tuning a model, building ML pipelines, model serving or inference optimization, evaluating model performance, working with frameworks like PyTorch, TensorFlow, scikit-learn, or Hugging Face, setting up vector databases, prompt engineering, or taking an ML prototype to production.
88
88%
Does it follow best practices?
Impact
81%
1.09xAverage score across 3 eval scenarios
Passed
No known issues
Quality
Discovery
92%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong, well-structured skill description that clearly articulates specific capabilities and includes an explicit 'Use when...' clause with rich trigger terms. Its main weakness is its extremely broad scope, covering nearly the entire ML lifecycle from data preprocessing to production deployment to LLM systems, which could create overlap with more specialized skills in a large skill library. The description uses appropriate third-person voice throughout.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: trains/fine-tunes models, builds preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference, designs RAG/LLM systems, experiment tracking, drift detection, retraining triggers, and A/B testing. | 3 / 3 |
Completeness | Clearly answers both 'what' (trains models, builds pipelines, deploys as REST APIs, designs RAG systems, covers MLOps workflows) and 'when' with an explicit 'Use when...' clause listing specific trigger scenarios like training models, building pipelines, working with specific frameworks, and taking prototypes to production. | 3 / 3 |
Trigger Term Quality | Excellent coverage of natural terms users would say: 'training', 'fine-tuning', 'ML pipelines', 'model serving', 'inference optimization', 'PyTorch', 'TensorFlow', 'scikit-learn', 'Hugging Face', 'vector databases', 'prompt engineering', 'ML prototype to production'. These are terms users would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | While the ML/AI domain is well-defined, the scope is extremely broad—covering everything from data preprocessing to RAG systems to prompt engineering to MLOps. The 'prompt engineering' trigger could conflict with general LLM usage skills, and 'REST APIs' could overlap with web development skills. The breadth increases conflict risk with more specialized skills. | 2 / 3 |
Total | 11 / 12 Passed |
Implementation
85%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This is a strong, well-structured skill that provides actionable guidance with executable code examples, clear multi-phase workflows with explicit validation gates, and appropriate progressive disclosure to supplementary files. The main weakness is moderate verbosity — some sections (mission, constraint explanations, overlapping checklist/checkpoint content) could be tightened to better respect token budget. Overall it serves as an effective operational guide for AI engineering tasks.
Suggestions
Trim the Mission section to 1 line or remove it — Claude doesn't need a philosophical framing of the role.
Consider consolidating the Validation Checkpoints and Deployment Checklist into a single artifact to reduce redundancy, or more clearly differentiate them with less explanatory prose.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is generally efficient but has some areas of verbosity — the mission section restates things Claude already knows, and the deployment checklist + validation checkpoints have some overlap that could be tightened. The worked examples are useful but collectively make the file longer than necessary for a SKILL.md overview. | 2 / 3 |
Actionability | The skill provides fully executable Python code for model serving (FastAPI), experiment tracking (MLflow), concrete metric thresholds (PSI > 0.2, p99 < 200ms), and a detailed evaluation summary format. All code examples are copy-paste ready with real libraries and realistic patterns. | 3 / 3 |
Workflow Clarity | The workflow is clearly sequenced with explicit validation checkpoints at three stages (after training, before deployment, after launch). Each checkpoint has concrete pass/fail criteria, and the deployment checklist provides a comprehensive sign-off artifact. Feedback loops for drift detection and rollback triggers are well-defined. | 3 / 3 |
Progressive Disclosure | The skill cleanly separates concerns with one-level-deep references to RAG_SYSTEMS.md, VECTOR_DATABASES.md, and FRAMEWORK_GUIDES.md. The main file serves as an effective overview with worked examples inline (appropriate for a skill file) and clear navigation via internal anchors and external references. | 3 / 3 |
Total | 11 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
010799b
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.