model-evaluation-metrics

Model Evaluation Metrics - Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category.

1.00x

Quality

Does it follow best practices?

Impact

92%

1.00x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./planned-skills/generated/07-ml-training/model-evaluation-metrics/SKILL.md

Quality

Discovery

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description is extremely weak — it essentially just names a topic ('model evaluation metrics') without describing any concrete capabilities, listing meaningful trigger terms, or providing explicit guidance on when to use the skill. The duplicated trigger term suggests auto-generation without review. It would be nearly indistinguishable from other ML-related skills in a large skill library.

Suggestions

Add specific concrete actions the skill performs, e.g., 'Computes precision, recall, F1 score, AUC-ROC, confusion matrices, and other classification/regression metrics for trained ML models.'

Add a 'Use when...' clause with natural trigger terms: 'Use when the user asks about accuracy, precision, recall, F1, confusion matrix, ROC curve, AUC, RMSE, MAE, model performance, or evaluating a trained model.'

Remove the duplicated trigger term and expand with diverse natural language variations users would actually say, such as 'how good is my model', 'evaluate predictions', 'classification report', 'test set performance'.

Dimension	Reasoning	Score
Specificity	The description names a domain ('model evaluation metrics') but lists no concrete actions. There are no specific capabilities like 'compute precision, recall, F1 scores, generate confusion matrices' — just a vague reference to the topic.	1 / 3
Completeness	The 'what' is essentially absent — it never explains what the skill actually does beyond naming the topic. The 'when' is only implied through the trigger line but lacks explicit guidance. Missing a 'Use when...' clause caps this at 2, and the weak 'what' brings it to 1.	1 / 3
Trigger Term Quality	The trigger terms are literally duplicated ('model evaluation metrics, model evaluation metrics') and extremely narrow. Missing natural variations users would say like 'accuracy', 'precision', 'recall', 'F1 score', 'confusion matrix', 'ROC curve', 'AUC', 'loss metrics', etc.	1 / 3
Distinctiveness Conflict Risk	The mention of 'ML Training' category and 'model evaluation metrics' provides some specificity to a niche, but the lack of concrete actions or file types means it could overlap with general ML skills, data analysis skills, or statistics skills.	2 / 3
	Total	5 / 12 Passed

Implementation

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is an empty shell—a template placeholder that contains no actual instructional content about model evaluation metrics. It lacks any concrete code, specific metrics definitions, library examples, or actionable workflows. It provides zero value beyond restating its own title in various phrasings.

Suggestions

Add concrete, executable code examples for common evaluation metrics (e.g., sklearn's accuracy_score, f1_score, roc_auc_score) with actual Python snippets.

Include a clear workflow for model evaluation: train/test split → predict → compute metrics → interpret results, with specific validation checkpoints.

Remove all meta-description sections (Purpose, When to Use, Example Triggers, Capabilities) and replace with actual instructional content covering classification metrics, regression metrics, and when to use each.

Add a quick-reference table of metrics (accuracy, precision, recall, F1, AUC-ROC, MSE, MAE, R²) with one-line descriptions and code snippets for each.

Dimension	Reasoning	Score
Conciseness	The content is entirely filler and meta-description. It explains what the skill does in abstract terms without providing any actual knowledge or instructions. Every section restates the same vague concept ('model evaluation metrics') without adding substance.	1 / 3
Actionability	There is zero concrete guidance—no code, no commands, no specific metrics (accuracy, F1, AUC, etc.), no formulas, no library usage examples. It only describes rather than instructs.	1 / 3
Workflow Clarity	No workflow or steps are defined. The 'step-by-step guidance' is merely claimed in a bullet point but never actually provided. There are no sequences, validation checkpoints, or processes.	1 / 3
Progressive Disclosure	The content is a flat, monolithic block of meta-descriptions with no meaningful structure, no references to detailed materials, and no navigation to deeper content. The sections are all boilerplate with no real information hierarchy.	1 / 3
	Total	4 / 12 Passed

Validation

81%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 9 / 11 Passed

Validation for skill structure

Criteria	Description	Result
allowed_tools_field	'allowed-tools' contains unusual tool name(s)	Warning
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	9 / 11 Passed

Repository: jeremylongshore/claude-code-plugins-plus-skills
Commit: 3076d78

Reviewed: 10 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.