This skill allows Claude to evaluate machine learning models using a comprehensive suite of metrics. It should be used when the user requests model performance analysis, validation, or testing. Claude can use this skill to assess model accuracy, precision, recall, F1-score, and other relevant metrics. Trigger this skill when the user mentions "evaluate model", "model performance", "testing metrics", "validation results", or requests a comprehensive "model evaluation".
90
44%
Does it follow best practices?
Impact
99%
1.01xAverage score across 9 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./backups/skills-migration-20251108-070147/plugins/ai-ml/model-evaluation-suite/skills/model-evaluation-suite/SKILL.mdMetrics specification and result interpretation
Multiple metrics selected
100%
100%
Metrics explicitly named
100%
100%
KPIs highlighted
70%
70%
Improvement areas identified
100%
100%
Result interpretation present
100%
100%
Context-grounded interpretation
100%
100%
Structured workflow evidence
100%
100%
Per-class breakdown
100%
100%
Evaluation code or script
100%
100%
Output report file
100%
100%
Multi-model comparison analysis
Context analysis step
87%
100%
Metrics specified for comparison
100%
100%
Side-by-side comparison
100%
100%
KPI identification
80%
100%
Improvement area for weaker model
100%
100%
Result interpretation with context
100%
100%
Recommendation made
100%
100%
Evaluation code produced
100%
100%
Output summary file
100%
100%
Metric choice justified
87%
100%
Structured progression
100%
100%
Data validation and deployment readiness
Data distribution check
90%
100%
Representativeness assessment
83%
100%
Deployment-appropriate metrics
100%
100%
Metrics named explicitly
100%
100%
KPIs identified
75%
100%
Improvement areas flagged
100%
100%
Interpretation with business context
70%
90%
Deployment recommendation
100%
100%
Evaluation script produced
100%
100%
Report or summary file
100%
100%
Structured workflow
100%
83%
Automated ML evaluation workflow
Three-stage workflow documented
100%
100%
Automation mechanism described
100%
100%
Multiple metrics computed
100%
100%
Metrics explicitly named
100%
100%
Threshold checking implemented
100%
100%
KPI identified for domain
100%
100%
Improvement area identified
100%
100%
Domain-grounded interpretation
100%
100%
Runnable pipeline script
100%
100%
Separate report file
100%
100%
Per-class or breakdown metrics
66%
100%
Regression model evaluation
Regression-appropriate metrics
100%
100%
Metrics explicitly named
100%
100%
KPI identified with justification
100%
100%
Per-segment breakdown
100%
100%
Underperforming segment identified
100%
100%
Improvement area specified
100%
100%
Business-context interpretation
100%
100%
Launch recommendation made
100%
100%
Evaluation script produced
100%
100%
Separate report file created
100%
100%
Structured workflow and context analysis
Pre-computation plan written
100%
100%
Metrics appropriate for NLP task
100%
100%
Metrics explicitly named
100%
100%
KPI identified with clinical justification
100%
100%
Subgroup breakdown present
100%
100%
Underperforming subgroup identified
100%
100%
Improvement area and next step
100%
100%
Plain-language interpretation
100%
100%
Executive summary present
75%
100%
Evaluation script produced
100%
100%
Three output files created
100%
100%
Domain-appropriate metrics selection
Pre-computation plan written
100%
100%
Detection-appropriate metrics used
100%
100%
Metrics explicitly named
100%
100%
Per-class breakdown included
100%
100%
KPI identified with operational justification
100%
100%
False negatives / missed detections analysed
100%
100%
Specific weakness identified
100%
100%
Business-context interpretation
100%
100%
Rollout recommendation present
100%
100%
Evaluation script produced
100%
100%
Unsupervised model evaluation metrics
Pre-computation plan written
100%
100%
Clustering-appropriate metrics used
100%
100%
Metrics explicitly named
100%
100%
Per-cluster breakdown included
100%
100%
KPI identified with marketing justification
100%
100%
Anomalous cluster identified
100%
100%
Specific weakness identified
100%
100%
Business-context interpretation
100%
100%
Campaign readiness recommendation
100%
100%
Evaluation script produced
100%
100%
Multi-model comparison with interpretation
Pre-computation plan written
100%
100%
Ranking-appropriate metrics used
100%
100%
Metrics explicitly named
100%
100%
Side-by-side comparison present
100%
100%
KPI identified with search justification
100%
100%
Weaker model weakness identified
100%
100%
Business-context interpretation
100%
100%
Model recommendation with evidence
100%
100%
Evaluation script produced
100%
100%
Separate comparison report file
100%
100%
13d35b8
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.