evaluating-machine-learning-models

Build this skill allows AI assistant to evaluate machine learning models using a comprehensive suite of metrics. it should be used when the user requests model performance analysis, validation, or testing. AI assistant can use this skill to assess model accuracy, p... Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.

Install with Tessl CLI

npx tessl i github:jeremylongshore/claude-code-plugins-plus-skills --skill evaluating-machine-learning-models

What are skills?

Review — 12%

Does it follow best practices?

If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:

npx tessl skill review --optimize ./path/to/skill

Learn more

Validation — 13 / 16 Passed

Validation for skill structure

SKILL.md

Review

Evals

Discovery

17%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description suffers from critical issues: it appears truncated mid-sentence, and the trigger guidance is entirely placeholder boilerplate text that provides no actual value. While it identifies the ML model evaluation domain, the lack of specific metrics, concrete actions, and meaningful trigger terms makes it nearly unusable for skill selection.

Suggestions

Replace the placeholder trigger text ('Use when appropriate context detected...') with specific trigger phrases like 'Use when user asks about model accuracy, precision, recall, F1 score, confusion matrix, ROC curves, or model validation'

Complete the truncated description and list specific metrics/actions: 'Calculate accuracy, precision, recall, F1 score, AUC-ROC, confusion matrices, and cross-validation scores'

Add natural user phrases as triggers: 'evaluate my model', 'check model performance', 'test accuracy', 'validate predictions', 'compare models'

Dimension	Reasoning	Score
Specificity	Names the domain (machine learning model evaluation) and mentions some actions like 'assess model accuracy', but the description is truncated ('p...') and uses vague phrases like 'comprehensive suite of metrics' without listing specific metrics or concrete actions.	2 / 3
Completeness	The 'what' is partially addressed but truncated. The 'when' section is pure placeholder text ('Use when appropriate context detected. Trigger with relevant phrases based on skill purpose') that provides no actual guidance on when to use this skill.	1 / 3
Trigger Term Quality	The 'Use when' clause is completely generic boilerplate ('appropriate context detected', 'relevant phrases based on skill purpose') with no actual trigger terms. While 'model performance analysis, validation, or testing' appear earlier, the trigger section provides zero natural keywords users would say.	1 / 3
Distinctiveness Conflict Risk	The ML model evaluation domain is somewhat specific, but the generic trigger language and incomplete description could cause overlap with other data analysis or ML-related skills. Terms like 'model performance' could conflict with other analytics skills.	2 / 3
	Total	6 / 12 Passed

Implementation

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is a template-like document filled with generic boilerplate that provides no actionable guidance. It describes what the skill does conceptually but never shows how to actually use it - no real code, no actual command syntax, no concrete examples. The content wastes tokens explaining obvious concepts while failing to deliver the specific, executable instructions Claude needs.

Suggestions

Replace abstract descriptions with actual executable code showing how to invoke the model evaluation suite with specific parameters and expected output format

Remove generic sections like 'Prerequisites', 'Instructions', 'Error Handling' that contain only placeholder text with no real content

Provide concrete command syntax for '/eval-model' including all parameters, input formats, and example output JSON/data structures

Cut the 'Overview', 'How It Works', and 'When to Use' sections entirely - Claude doesn't need explanations of when to evaluate models

Dimension	Reasoning	Score
Conciseness	Extremely verbose with extensive padding and explanations of obvious concepts. Sections like 'How It Works', 'When to Use', and generic boilerplate ('This skill provides automated assistance...') explain things Claude already knows and waste tokens.	1 / 3
Actionability	No executable code, no concrete commands, no actual implementation details. References '/eval-model' command but never shows syntax, parameters, or real usage. Examples describe what 'the skill will do' abstractly rather than providing copy-paste ready instructions.	1 / 3
Workflow Clarity	Steps are vague and abstract ('Invoke the /eval-model command', 'Analyze the model's performance'). No validation checkpoints, no error recovery, no concrete sequence of actual commands or code to execute.	1 / 3
Progressive Disclosure	Content is organized into sections with headers, but it's a monolithic document with no references to external files. The structure exists but contains too much inline content that could be split or removed entirely.	2 / 3
	Total	5 / 12 Passed

Validation

81%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 13 / 16 Passed

Validation for skill structure

Criteria	Description	Result
allowed_tools_field	'allowed-tools' contains unusual tool name(s)	Warning
metadata_version	'metadata' field is not a dictionary	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	13 / 16 Passed

Reviewed: about 7 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.