Build this skill allows AI assistant to evaluate machine learning models using a comprehensive suite of metrics. it should be used when the user requests model performance analysis, validation, or testing. AI assistant can use this skill to assess model accuracy, p... Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.
21
3%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/ai-ml/model-evaluation-suite/skills/evaluating-machine-learning-models/SKILL.mdQuality
Discovery
7%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is poorly constructed with multiple issues: it begins with 'Build this skill' which is meta-instruction rather than a description, uses first/second person framing ('AI assistant can use this skill'), is truncated mid-sentence, and ends with meaningless boilerplate trigger guidance. The core topic (ML model evaluation) is identifiable but insufficiently detailed to be useful for skill selection.
Suggestions
Replace the generic boilerplate 'Use when appropriate context detected' with specific trigger phrases like 'Use when the user asks to evaluate, validate, or benchmark a machine learning model, check accuracy, precision, recall, F1 score, AUC-ROC, or confusion matrix.'
List concrete actions the skill performs, e.g., 'Computes classification metrics (accuracy, precision, recall, F1), generates confusion matrices, plots ROC curves, performs cross-validation, and compares model performance.'
Remove the meta-instruction 'Build this skill allows AI assistant to' and rewrite in third person active voice, e.g., 'Evaluates machine learning models using classification and regression metrics.'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description starts with 'Build this skill allows AI assistant to evaluate machine learning models' which is vague and poorly written. It mentions 'comprehensive suite of metrics' without listing any specific metrics, and the text is truncated ('accuracy, p...'). No concrete actions are clearly enumerated. | 1 / 3 |
Completeness | The 'what' is vaguely stated (evaluate ML models) but lacks specifics due to truncation. The 'when' clause is entirely generic boilerplate ('Use when appropriate context detected') rather than providing explicit, actionable triggers. This fails to meaningfully answer either question. | 1 / 3 |
Trigger Term Quality | The description includes some relevant terms like 'model performance analysis', 'validation', 'testing', and 'accuracy', but the generic filler 'Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.' is meaningless boilerplate that adds no real trigger terms. The truncation also cuts off potentially useful keywords. | 1 / 3 |
Distinctiveness Conflict Risk | The ML model evaluation domain is somewhat specific and wouldn't overlap with most non-ML skills. However, the vague language ('comprehensive suite of metrics') and generic triggers could cause confusion with other data science or analytics skills. | 2 / 3 |
Total | 5 / 12 Passed |
Implementation
0%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is almost entirely boilerplate with no actionable content. It describes what model evaluation is and when to use it in abstract terms, but never provides concrete code, specific commands with parameters, actual metric calculation examples, or real implementation details. The referenced `/eval-model` command and `model-evaluation-suite` plugin are mentioned but never documented with usage syntax, expected inputs/outputs, or configuration options.
Suggestions
Replace abstract descriptions with executable code examples showing actual metric calculations (e.g., using sklearn.metrics for accuracy, precision, recall, F1-score with concrete input/output).
Document the `/eval-model` command with its actual syntax, required parameters, expected input format, and example output so Claude can use it directly.
Remove generic boilerplate sections (Error Handling, Output, Resources, Instructions) that contain no skill-specific information, and the 'Overview' section that explains concepts Claude already knows.
Add concrete workflow steps with validation checkpoints, such as verifying data format before evaluation, checking for class imbalance, and validating metric thresholds before reporting results.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose with extensive padding. Explains obvious concepts Claude already knows (what model evaluation is, when to use it), includes generic boilerplate sections like 'Error Handling', 'Output', 'Resources', and 'Instructions' that add no actionable value. The 'Overview' section explains what the skill 'empowers Claude to do' rather than providing concrete instructions. | 1 / 3 |
Actionability | No executable code, no concrete commands, no actual implementation details. References a `/eval-model` command and `model-evaluation-suite` plugin without showing how to use them, what parameters they accept, or what output they produce. Examples describe what 'the skill will do' in abstract terms rather than providing copy-paste ready code or commands. | 1 / 3 |
Workflow Clarity | The workflow steps are entirely abstract ('Analyzing Context', 'Executing Evaluation', 'Presenting Results') with no concrete details about what to actually do. No validation checkpoints, no error recovery loops, no specific commands or parameters. The 'Instructions' section is completely generic boilerplate with no task-specific guidance. | 1 / 3 |
Progressive Disclosure | Monolithic wall of text with many sections that contain little substance. No references to external files for detailed content. The 'Resources' section mentions 'Project documentation' and 'Related skills and commands' without any actual links or file paths. Content is poorly organized with redundant sections (Overview, How It Works, When to Use, Instructions all overlap). | 1 / 3 |
Total | 4 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 9 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 9 / 11 Passed | |
3e83543
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.