ai-engineer

Trains and fine-tunes ML models, builds data preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference into production applications, and designs RAG and LLM-powered systems. Covers MLOps workflows including experiment tracking, drift detection, retraining triggers, and A/B testing. Use when the user asks about training or fine-tuning a model, building ML pipelines, model serving or inference optimization, evaluating model performance, working with frameworks like PyTorch, TensorFlow, scikit-learn, or Hugging Face, setting up vector databases, prompt engineering, or taking an ML prototype to production.

1.09x

Quality

88%

Does it follow best practices?

Impact

81%

1.09x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Quality

Discovery

92%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a strong, well-structured skill description that clearly articulates specific capabilities and includes an explicit 'Use when...' clause with rich trigger terms. Its main weakness is its extremely broad scope, covering nearly the entire ML lifecycle from data preprocessing to production deployment to LLM systems, which could create overlap with more specialized skills in a large skill library. The description uses appropriate third-person voice throughout.

Dimension	Reasoning	Score
Specificity	Lists multiple specific concrete actions: trains/fine-tunes models, builds preprocessing and feature engineering pipelines, deploys models as REST APIs, integrates inference, designs RAG/LLM systems, experiment tracking, drift detection, retraining triggers, and A/B testing.	3 / 3
Completeness	Clearly answers both 'what' (trains models, builds pipelines, deploys as REST APIs, designs RAG systems, covers MLOps workflows) and 'when' with an explicit 'Use when...' clause listing specific trigger scenarios like training models, building pipelines, working with specific frameworks, and taking prototypes to production.	3 / 3
Trigger Term Quality	Excellent coverage of natural terms users would say: 'training', 'fine-tuning', 'ML pipelines', 'model serving', 'inference optimization', 'PyTorch', 'TensorFlow', 'scikit-learn', 'Hugging Face', 'vector databases', 'prompt engineering', 'ML prototype to production'. These are terms users would naturally use.	3 / 3
Distinctiveness Conflict Risk	While the ML/AI domain is well-defined, the scope is extremely broad—covering everything from data preprocessing to RAG systems to prompt engineering to MLOps. The 'prompt engineering' trigger could conflict with general LLM usage skills, and 'REST APIs' could overlap with web development skills. The breadth increases conflict risk with more specialized skills.	2 / 3
	Total	11 / 12 Passed

Implementation

85%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, well-structured skill that provides actionable guidance with executable code examples, clear multi-phase workflows with explicit validation gates, and appropriate progressive disclosure to supplementary files. The main weakness is moderate verbosity — some sections (mission, constraint explanations, overlapping checklist/checkpoint content) could be tightened to better respect token budget. Overall it serves as an effective operational guide for AI engineering tasks.

Suggestions

Trim the Mission section to 1 line or remove it — Claude doesn't need a philosophical framing of the role.

Consider consolidating the Validation Checkpoints and Deployment Checklist into a single artifact to reduce redundancy, or more clearly differentiate them with less explanatory prose.

Dimension	Reasoning	Score
Conciseness	The content is generally efficient but has some areas of verbosity — the mission section restates things Claude already knows, and the deployment checklist + validation checkpoints have some overlap that could be tightened. The worked examples are useful but collectively make the file longer than necessary for a SKILL.md overview.	2 / 3
Actionability	The skill provides fully executable Python code for model serving (FastAPI), experiment tracking (MLflow), concrete metric thresholds (PSI > 0.2, p99 < 200ms), and a detailed evaluation summary format. All code examples are copy-paste ready with real libraries and realistic patterns.	3 / 3
Workflow Clarity	The workflow is clearly sequenced with explicit validation checkpoints at three stages (after training, before deployment, after launch). Each checkpoint has concrete pass/fail criteria, and the deployment checklist provides a comprehensive sign-off artifact. Feedback loops for drift detection and rollback triggers are well-defined.	3 / 3
Progressive Disclosure	The skill cleanly separates concerns with one-level-deep references to RAG_SYSTEMS.md, VECTOR_DATABASES.md, and FRAMEWORK_GUIDES.md. The main file serves as an effective overview with worked examples inline (appropriate for a skill file) and clear navigation via internal anchors and external references.	3 / 3
	Total	11 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Repository: OpenRoster-ai/awesome-agents
Commit: 010799b

Reviewed: about 1 month ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.