Build this skill allows AI assistant to evaluate machine learning models using a comprehensive suite of metrics. it should be used when the user requests model performance analysis, validation, or testing. AI assistant can use this skill to assess model accuracy, p... Use when appropriate context detected. Trigger with relevant phrases based on skill purpose.
Install with Tessl CLI
npx tessl i github:jeremylongshore/claude-code-plugins-plus-skills --skill evaluating-machine-learning-models27
Does it follow best practices?
If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:
npx tessl skill review --optimize ./path/to/skillValidation for skill structure
Discovery
17%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description suffers from critical issues: it appears truncated mid-sentence, and the trigger guidance is entirely placeholder boilerplate text that provides no actual value. While it identifies the ML model evaluation domain, the lack of specific metrics, concrete actions, and meaningful trigger terms makes it nearly unusable for skill selection.
Suggestions
Replace the placeholder trigger text ('Use when appropriate context detected...') with specific trigger phrases like 'Use when user asks about model accuracy, precision, recall, F1 score, confusion matrix, ROC curves, or model validation'
Complete the truncated description and list specific metrics/actions: 'Calculate accuracy, precision, recall, F1 score, AUC-ROC, confusion matrices, and cross-validation scores'
Add natural user phrases as triggers: 'evaluate my model', 'check model performance', 'test accuracy', 'validate predictions', 'compare models'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (machine learning model evaluation) and mentions some actions like 'assess model accuracy', but the description is truncated ('p...') and uses vague phrases like 'comprehensive suite of metrics' without listing specific metrics or concrete actions. | 2 / 3 |
Completeness | The 'what' is partially addressed but truncated. The 'when' section is pure placeholder text ('Use when appropriate context detected. Trigger with relevant phrases based on skill purpose') that provides no actual guidance on when to use this skill. | 1 / 3 |
Trigger Term Quality | The 'Use when' clause is completely generic boilerplate ('appropriate context detected', 'relevant phrases based on skill purpose') with no actual trigger terms. While 'model performance analysis, validation, or testing' appear earlier, the trigger section provides zero natural keywords users would say. | 1 / 3 |
Distinctiveness Conflict Risk | The ML model evaluation domain is somewhat specific, but the generic trigger language and incomplete description could cause overlap with other data analysis or ML-related skills. Terms like 'model performance' could conflict with other analytics skills. | 2 / 3 |
Total | 6 / 12 Passed |
Implementation
7%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is a template-like document filled with generic boilerplate that provides no actionable guidance. It describes what the skill does conceptually but never shows how to actually use it - no real code, no actual command syntax, no concrete examples. The content wastes tokens explaining obvious concepts while failing to deliver the specific, executable instructions Claude needs.
Suggestions
Replace abstract descriptions with actual executable code showing how to invoke the model evaluation suite with specific parameters and expected output format
Remove generic sections like 'Prerequisites', 'Instructions', 'Error Handling' that contain only placeholder text with no real content
Provide concrete command syntax for '/eval-model' including all parameters, input formats, and example output JSON/data structures
Cut the 'Overview', 'How It Works', and 'When to Use' sections entirely - Claude doesn't need explanations of when to evaluate models
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose with extensive padding and explanations of obvious concepts. Sections like 'How It Works', 'When to Use', and generic boilerplate ('This skill provides automated assistance...') explain things Claude already knows and waste tokens. | 1 / 3 |
Actionability | No executable code, no concrete commands, no actual implementation details. References '/eval-model' command but never shows syntax, parameters, or real usage. Examples describe what 'the skill will do' abstractly rather than providing copy-paste ready instructions. | 1 / 3 |
Workflow Clarity | Steps are vague and abstract ('Invoke the /eval-model command', 'Analyze the model's performance'). No validation checkpoints, no error recovery, no concrete sequence of actual commands or code to execute. | 1 / 3 |
Progressive Disclosure | Content is organized into sections with headers, but it's a monolithic document with no references to external files. The structure exists but contains too much inline content that could be split or removed entirely. | 2 / 3 |
Total | 5 / 12 Passed |
Validation
81%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 13 / 16 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
allowed_tools_field | 'allowed-tools' contains unusual tool name(s) | Warning |
metadata_version | 'metadata' field is not a dictionary | Warning |
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 13 / 16 Passed | |
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.