CtrlK
BlogDocsLog inGet started
Tessl Logo

evaluating-machine-learning-models

This skill allows Claude to evaluate machine learning models using a comprehensive suite of metrics. It should be used when the user requests model performance analysis, validation, or testing. Claude can use this skill to assess model accuracy, precision, recall, F1-score, and other relevant metrics. Trigger this skill when the user mentions "evaluate model", "model performance", "testing metrics", "validation results", or requests a comprehensive "model evaluation".

90

1.01x
Quality

44%

Does it follow best practices?

Impact

99%

1.01x

Average score across 9 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./backups/skills-migration-20251108-070147/plugins/ai-ml/model-evaluation-suite/skills/model-evaluation-suite/SKILL.md
SKILL.md
Quality
Evals
Security

Evaluation results

97%

Sentiment Analysis Model Evaluation Report

Metrics specification and result interpretation

Criteria
Without context
With context

Multiple metrics selected

100%

100%

Metrics explicitly named

100%

100%

KPIs highlighted

70%

70%

Improvement areas identified

100%

100%

Result interpretation present

100%

100%

Context-grounded interpretation

100%

100%

Structured workflow evidence

100%

100%

Per-class breakdown

100%

100%

Evaluation code or script

100%

100%

Output report file

100%

100%

100%

4%

Fraud Detection Model Selection

Multi-model comparison analysis

Criteria
Without context
With context

Context analysis step

87%

100%

Metrics specified for comparison

100%

100%

Side-by-side comparison

100%

100%

KPI identification

80%

100%

Improvement area for weaker model

100%

100%

Result interpretation with context

100%

100%

Recommendation made

100%

100%

Evaluation code produced

100%

100%

Output summary file

100%

100%

Metric choice justified

87%

100%

Structured progression

100%

100%

98%

6%

Product Recommendation Model — Production Launch Assessment

Data validation and deployment readiness

Criteria
Without context
With context

Data distribution check

90%

100%

Representativeness assessment

83%

100%

Deployment-appropriate metrics

100%

100%

Metrics named explicitly

100%

100%

KPIs identified

75%

100%

Improvement areas flagged

100%

100%

Interpretation with business context

70%

90%

Deployment recommendation

100%

100%

Evaluation script produced

100%

100%

Report or summary file

100%

100%

Structured workflow

100%

83%

100%

2%

Monthly Model Health Monitor

Automated ML evaluation workflow

Criteria
Without context
With context

Three-stage workflow documented

100%

100%

Automation mechanism described

100%

100%

Multiple metrics computed

100%

100%

Metrics explicitly named

100%

100%

Threshold checking implemented

100%

100%

KPI identified for domain

100%

100%

Improvement area identified

100%

100%

Domain-grounded interpretation

100%

100%

Runnable pipeline script

100%

100%

Separate report file

100%

100%

Per-class or breakdown metrics

66%

100%

100%

Property Valuation Model Assessment

Regression model evaluation

Criteria
Without context
With context

Regression-appropriate metrics

100%

100%

Metrics explicitly named

100%

100%

KPI identified with justification

100%

100%

Per-segment breakdown

100%

100%

Underperforming segment identified

100%

100%

Improvement area specified

100%

100%

Business-context interpretation

100%

100%

Launch recommendation made

100%

100%

Evaluation script produced

100%

100%

Separate report file created

100%

100%

100%

2%

NLP Model Audit for Regulatory Compliance

Structured workflow and context analysis

Criteria
Without context
With context

Pre-computation plan written

100%

100%

Metrics appropriate for NLP task

100%

100%

Metrics explicitly named

100%

100%

KPI identified with clinical justification

100%

100%

Subgroup breakdown present

100%

100%

Underperforming subgroup identified

100%

100%

Improvement area and next step

100%

100%

Plain-language interpretation

100%

100%

Executive summary present

75%

100%

Evaluation script produced

100%

100%

Three output files created

100%

100%

100%

Warehouse Inventory Counting Model Assessment

Domain-appropriate metrics selection

Criteria
Without context
With context

Pre-computation plan written

100%

100%

Detection-appropriate metrics used

100%

100%

Metrics explicitly named

100%

100%

Per-class breakdown included

100%

100%

KPI identified with operational justification

100%

100%

False negatives / missed detections analysed

100%

100%

Specific weakness identified

100%

100%

Business-context interpretation

100%

100%

Rollout recommendation present

100%

100%

Evaluation script produced

100%

100%

100%

Customer Segmentation Model Review

Unsupervised model evaluation metrics

Criteria
Without context
With context

Pre-computation plan written

100%

100%

Clustering-appropriate metrics used

100%

100%

Metrics explicitly named

100%

100%

Per-cluster breakdown included

100%

100%

KPI identified with marketing justification

100%

100%

Anomalous cluster identified

100%

100%

Specific weakness identified

100%

100%

Business-context interpretation

100%

100%

Campaign readiness recommendation

100%

100%

Evaluation script produced

100%

100%

100%

Search Relevance Model Comparison

Multi-model comparison with interpretation

Criteria
Without context
With context

Pre-computation plan written

100%

100%

Ranking-appropriate metrics used

100%

100%

Metrics explicitly named

100%

100%

Side-by-side comparison present

100%

100%

KPI identified with search justification

100%

100%

Weaker model weakness identified

100%

100%

Business-context interpretation

100%

100%

Model recommendation with evidence

100%

100%

Evaluation script produced

100%

100%

Separate comparison report file

100%

100%

Repository
jeremylongshore/claude-code-plugins-plus-skills
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.