evaluating-machine-learning-models

This skill allows Claude to evaluate machine learning models using a comprehensive suite of metrics. It should be used when the user requests model performance analysis, validation, or testing. Claude can use this skill to assess model accuracy, precision, recall, F1-score, and other relevant metrics. Trigger this skill when the user mentions "evaluate model", "model performance", "testing metrics", "validation results", or requests a comprehensive "model evaluation".

1.01x

Quality

44%

Does it follow best practices?

Impact

99%

1.01x

Average score across 9 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./backups/skills-migration-20251108-070147/plugins/ai-ml/model-evaluation-suite/skills/model-evaluation-suite/SKILL.md

Quality

Discovery

82%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is reasonably well-structured with explicit trigger terms and a clear 'when to use' clause, which are its strongest aspects. However, it loses points on specificity by not detailing concrete actions beyond 'assess' and using the vague catch-all 'other relevant metrics.' It also uses second/third person inconsistently—phrases like 'This skill allows Claude' and 'Claude can use this skill' are acceptable third-person framing but 'It should be used when the user requests' is passive and slightly awkward. The description would benefit from more concrete action verbs and clearer differentiation from adjacent ML skills.

Suggestions

Replace vague phrases like 'other relevant metrics' and 'comprehensive suite of metrics' with specific additional capabilities (e.g., 'generate confusion matrices, ROC curves, cross-validation reports')

Improve distinctiveness by clarifying what this skill does NOT cover (e.g., model training, hyperparameter tuning) or by specifying the output format (e.g., 'produces evaluation reports with metric tables and visualizations')

Dimension	Reasoning	Score
Specificity	The description names the domain (ML model evaluation) and lists some specific metrics (accuracy, precision, recall, F1-score), but the phrase 'other relevant metrics' is vague and the actual concrete actions beyond 'evaluate' and 'assess' are not well-defined (e.g., does it generate reports, produce confusion matrices, create visualizations?).	2 / 3
Completeness	The description clearly answers both 'what' (evaluate ML models using metrics like accuracy, precision, recall, F1-score) and 'when' (explicit trigger clause with specific phrases like 'evaluate model', 'model performance', etc.).	3 / 3
Trigger Term Quality	The description includes explicit trigger terms that users would naturally say: 'evaluate model', 'model performance', 'testing metrics', 'validation results', and 'model evaluation'. These cover common natural language variations well.	3 / 3
Distinctiveness Conflict Risk	While ML model evaluation is a reasonably specific niche, terms like 'model performance' and 'testing metrics' could overlap with skills related to general data analysis, statistical testing, or ML training/tuning. The description could be more precise about what distinguishes evaluation from other ML workflow steps.	2 / 3
	Total	10 / 12 Passed

Implementation

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill content is largely descriptive rather than instructive, explaining what the skill does in abstract terms without providing any concrete, executable guidance. It lacks real code examples, specific command syntax, parameter details, expected output formats, and validation steps. The content reads more like a marketing description than an actionable skill file.

Suggestions

Replace the abstract examples with concrete, executable code showing actual `/eval-model` command syntax with real parameters, input formats, and expected output schemas.

Remove the 'When to Use This Skill' and 'How It Works' sections entirely—this information is already in the skill description and adds no actionable value.

Add a concrete workflow with validation steps, e.g., checking that input data is in the correct format before running evaluation, and handling common error cases.

Include a real example showing actual metric output (e.g., a JSON response with accuracy, precision, recall, F1-score values) so Claude knows what to expect and how to present results.

Dimension	Reasoning	Score
Conciseness	The content is verbose and padded with unnecessary explanations. Phrases like 'This skill empowers Claude to perform thorough evaluations' and 'providing detailed performance insights' add no value. The 'When to Use This Skill' section repeats information from the description. The 'How It Works' section describes abstract steps Claude already understands rather than providing concrete instructions.	1 / 3
Actionability	There is no executable code, no concrete commands with actual syntax, no specific metric calculation examples, and no real code snippets. The '/eval-model' command is mentioned but never shown with actual parameters, expected input formats, or output schemas. The examples describe what 'the skill will' do rather than showing how to do it.	1 / 3
Workflow Clarity	The steps are vague and abstract ('Invoke the /eval-model command', 'Analyze the model's performance') with no concrete parameters, no validation checkpoints, no error handling, and no feedback loops. There is no guidance on what to do if evaluation fails or produces unexpected results.	1 / 3
Progressive Disclosure	The content has some structural organization with clear section headers (Overview, How It Works, Examples, Best Practices, Integration), but there are no references to external files and the content is somewhat monolithic. Given there are no bundle files, the inline-only approach is acceptable but the sections contain too much filler content that could be trimmed rather than split.	2 / 3
	Total	5 / 12 Passed

Validation

100%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 11 / 11 Passed

Validation for skill structure

No warnings or errors.

Repository: jeremylongshore/claude-code-plugins-plus-skills
Commit: 13d35b8

Reviewed: about 5 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.