Model Evaluation Metrics - Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category.
35
3%
Does it follow best practices?
Impact
92%
1.00xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./planned-skills/generated/07-ml-training/model-evaluation-metrics/SKILL.mdQuality
Discovery
7%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is extremely weak — it essentially just names a topic ('model evaluation metrics') without describing any concrete capabilities, listing meaningful trigger terms, or providing explicit guidance on when to use the skill. The duplicated trigger term suggests auto-generation without review. It would be nearly indistinguishable from other ML-related skills in a large skill library.
Suggestions
Add specific concrete actions the skill performs, e.g., 'Computes precision, recall, F1 score, AUC-ROC, confusion matrices, and other classification/regression metrics for trained ML models.'
Add a 'Use when...' clause with natural trigger terms: 'Use when the user asks about accuracy, precision, recall, F1, confusion matrix, ROC curve, AUC, RMSE, MAE, model performance, or evaluating a trained model.'
Remove the duplicated trigger term and expand with diverse natural language variations users would actually say, such as 'how good is my model', 'evaluate predictions', 'classification report', 'test set performance'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description names a domain ('model evaluation metrics') but lists no concrete actions. There are no specific capabilities like 'compute precision, recall, F1 scores, generate confusion matrices' — just a vague reference to the topic. | 1 / 3 |
Completeness | The 'what' is essentially absent — it never explains what the skill actually does beyond naming the topic. The 'when' is only implied through the trigger line but lacks explicit guidance. Missing a 'Use when...' clause caps this at 2, and the weak 'what' brings it to 1. | 1 / 3 |
Trigger Term Quality | The trigger terms are literally duplicated ('model evaluation metrics, model evaluation metrics') and extremely narrow. Missing natural variations users would say like 'accuracy', 'precision', 'recall', 'F1 score', 'confusion matrix', 'ROC curve', 'AUC', 'loss metrics', etc. | 1 / 3 |
Distinctiveness Conflict Risk | The mention of 'ML Training' category and 'model evaluation metrics' provides some specificity to a niche, but the lack of concrete actions or file types means it could overlap with general ML skills, data analysis skills, or statistics skills. | 2 / 3 |
Total | 5 / 12 Passed |
Implementation
0%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is an empty shell—a template placeholder that contains no actual instructional content about model evaluation metrics. It lacks any concrete code, specific metrics definitions, library examples, or actionable workflows. It provides zero value beyond restating its own title in various phrasings.
Suggestions
Add concrete, executable code examples for common evaluation metrics (e.g., sklearn's accuracy_score, f1_score, roc_auc_score) with actual Python snippets.
Include a clear workflow for model evaluation: train/test split → predict → compute metrics → interpret results, with specific validation checkpoints.
Remove all meta-description sections (Purpose, When to Use, Example Triggers, Capabilities) and replace with actual instructional content covering classification metrics, regression metrics, and when to use each.
Add a quick-reference table of metrics (accuracy, precision, recall, F1, AUC-ROC, MSE, MAE, R²) with one-line descriptions and code snippets for each.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is entirely filler and meta-description. It explains what the skill does in abstract terms without providing any actual knowledge or instructions. Every section restates the same vague concept ('model evaluation metrics') without adding substance. | 1 / 3 |
Actionability | There is zero concrete guidance—no code, no commands, no specific metrics (accuracy, F1, AUC, etc.), no formulas, no library usage examples. It only describes rather than instructs. | 1 / 3 |
Workflow Clarity | No workflow or steps are defined. The 'step-by-step guidance' is merely claimed in a bullet point but never actually provided. There are no sequences, validation checkpoints, or processes. | 1 / 3 |
Progressive Disclosure | The content is a flat, monolithic block of meta-descriptions with no meaningful structure, no references to detailed materials, and no navigation to deeper content. The sections are all boilerplate with no real information hierarchy. | 1 / 3 |
Total | 4 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
3076d78
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.