Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
72
Does it follow best practices?
If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:
npx tessl skill review --optimize ./path/to/skillValidation for skill structure
Automated metrics pipeline
Metric dataclass fields
50%
100%
Metric static factory methods
80%
100%
EvaluationSuite async evaluate
0%
0%
evaluate return structure
0%
100%
BLEU smoothing function
0%
100%
ROUGE score variants
0%
100%
ROUGE stemmer enabled
100%
100%
BERTScore model
0%
100%
BERTScore return keys
0%
100%
Correct library imports
100%
100%
Without context: $0.7491 · 8m 33s · 23 turns · 177 in / 13,307 out tokens
With context: $0.9183 · 12m 41s · 23 turns · 758 in / 11,626 out tokens
LLM-as-Judge evaluation
Anthropic client
100%
100%
Pydantic quality model
100%
100%
Pydantic pairwise model
100%
100%
claude-sonnet-4-6 model
0%
100%
Quality model fields
0%
100%
Field range validation
0%
100%
Winner Literal type
100%
100%
Pairwise confidence field
100%
100%
Position bias handling
100%
100%
Async functions
0%
0%
Without context: $1.2910 · 12m 48s · 33 turns · 259 in / 15,077 out tokens
With context: $0.4586 · 4m 51s · 16 turns · 959 in / 5,740 out tokens
A/B testing and regression detection
scipy t-test
100%
100%
Cohen's d formula
0%
100%
analyze return keys
20%
100%
Cohen's d interpretation
100%
100%
Regression threshold default
100%
100%
Regression detection logic
37%
100%
Cohen's kappa import
0%
100%
Kappa interpretation bands
100%
100%
ABTest add_result method
0%
100%
report.json output
100%
100%
Without context: $0.5866 · 10m 2s · 20 turns · 153 in / 8,708 out tokens
With context: $0.7240 · 6m 20s · 29 turns · 482 in / 7,623 out tokens
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.