Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
60
41%
Does it follow best practices?
Impact
95%
1.79xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/llm-application-dev/skills/llm-evaluation/SKILL.mdLLM-as-Judge evaluator
Anthropic client used
100%
100%
Correct judge model
100%
100%
max_tokens=500
0%
100%
Pydantic output model
0%
100%
Field validators on scores
0%
100%
JSON parse via json.loads
70%
100%
Reasoning field present
100%
100%
Pairwise comparison model
100%
100%
Winner uses Literal type
70%
100%
Confidence field in pairwise
20%
100%
JSON prompt format
100%
100%
Automated text metrics suite
EvaluationSuite pattern
37%
100%
Aggregation with np.mean
0%
100%
Returns raw_scores
71%
100%
BLEU uses SmoothingFunction().method4
0%
100%
BLEU from nltk
0%
100%
ROUGE variant selection
100%
100%
ROUGE use_stemmer=True
100%
100%
ROUGE from rouge_score
100%
100%
BERTScore model type
0%
0%
BERTScore from bert_score
0%
100%
ROUGE returns fmeasure
100%
100%
BERTScore returns P, R, F1
0%
100%
A/B testing and regression detection
scipy ttest_ind used
100%
100%
alpha=0.05
100%
100%
Cohen's d formula
80%
100%
Cohen's d interpretation thresholds
100%
100%
relative_improvement field
0%
100%
winner field
100%
100%
statistically_significant field
42%
100%
RegressionDetector class
62%
100%
Regression threshold=0.05
0%
100%
Relative change formula
0%
100%
has_regression boolean
42%
100%
numpy for score arrays
100%
100%
91fe43e
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.