Model Evaluation Metrics - Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category.
32
0%
Does it follow best practices?
Impact
92%
1.00xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./planned-skills/generated/07-ml-training/model-evaluation-metrics/SKILL.mdQuality
Discovery
0%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is an extremely weak description that essentially just restates its title with no concrete actions, no meaningful trigger terms, and no explicit guidance on when to use it. The trigger terms are duplicated rather than varied, and the description reads as auto-generated boilerplate rather than a useful skill selector.
Suggestions
Add specific concrete actions the skill performs, e.g., 'Calculates precision, recall, F1 score, AUC-ROC, and confusion matrices for trained models. Compares model performance across experiments.'
Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks about accuracy, precision, recall, F1, confusion matrix, ROC curve, model performance, or evaluation results.'
Remove the duplicated trigger term and replace with diverse natural language variations users would actually say, such as 'model accuracy', 'how well does my model perform', 'classification report', 'loss curves', etc.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description names a domain ('model evaluation metrics') but lists no concrete actions. There are no specific capabilities like 'calculate precision/recall', 'generate confusion matrices', or 'compare model performance'. It is essentially a label, not a description of what the skill does. | 1 / 3 |
Completeness | The description fails to answer 'what does this do' beyond naming the topic, and the 'when' clause is just a redundant restatement of the title. There is no explicit 'Use when...' guidance with meaningful triggers. | 1 / 3 |
Trigger Term Quality | The trigger terms listed are just 'model evaluation metrics' repeated twice. There are no natural keyword variations users might say such as 'accuracy', 'precision', 'recall', 'F1 score', 'confusion matrix', 'ROC curve', 'AUC', 'loss function', etc. | 1 / 3 |
Distinctiveness Conflict Risk | The description is so vague that it could overlap with any ML-related skill. 'Model evaluation metrics' without specifying which metrics, what actions, or what context makes it indistinguishable from other ML training or evaluation skills. | 1 / 3 |
Total | 4 / 12 Passed |
Implementation
0%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is an empty shell with no actual content. It consists entirely of auto-generated boilerplate that describes what the skill would do without providing any actionable information about model evaluation metrics. There is no code, no specific metrics discussed, no formulas, no examples, and no workflows.
Suggestions
Add concrete, executable code examples for computing common evaluation metrics (accuracy, precision, recall, F1, AUC-ROC) using sklearn or PyTorch
Include a workflow for model evaluation: train/val/test split → compute metrics → interpret results → iterate, with specific validation checkpoints
Remove all meta-description sections ('Purpose', 'When to Use', 'Example Triggers') and replace with actual technical content such as metric selection guidance, code snippets, and common pitfalls
Add a quick-reference table mapping task types (classification, regression, ranking) to appropriate metrics with one-liner code examples for each
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is entirely filler and meta-description. It explains what the skill does in abstract terms without providing any actual knowledge or instructions. Every section restates the same vague concept ('model evaluation metrics') without adding substance. | 1 / 3 |
Actionability | There is zero concrete guidance—no code, no commands, no specific metrics, no formulas, no examples of computing precision/recall/F1/AUC or any other evaluation metric. It describes rather than instructs. | 1 / 3 |
Workflow Clarity | No workflow or steps are defined. The 'step-by-step guidance' is merely claimed in a bullet point but never actually provided. There are no sequences, validation checkpoints, or processes of any kind. | 1 / 3 |
Progressive Disclosure | The content has section headers but they contain no meaningful information—just repeated boilerplate. There are no references to detailed files, no navigation structure, and no actual content to disclose progressively. | 1 / 3 |
Total | 4 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
3e83543
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.