Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
68
54%
Does it follow best practices?
Impact
91%
1.75xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.mdLLM-as-Judge patterns
Anthropic client import
100%
100%
Correct model name
0%
0%
Pydantic BaseModel used
100%
100%
Field constraints on ratings
0%
100%
Quality rating fields
100%
100%
Pairwise winner field
50%
100%
Pairwise confidence field
0%
100%
Reference eval float scores
0%
100%
Reference eval issues list
0%
100%
Three judge functions present
100%
100%
Results saved to JSON
100%
100%
No hardcoded API key
100%
100%
Automated metrics suite
BLEU library import
0%
100%
BLEU smoothing method
0%
100%
ROUGE library import
100%
100%
ROUGE variants and stemmer
55%
0%
BERTScore library
0%
100%
BERTScore model type
0%
0%
Metric dataclass pattern
22%
100%
EvaluationSuite class
25%
100%
Dual output format
60%
100%
Custom metric support
14%
100%
Results saved to JSON
62%
100%
Three metrics covered
40%
100%
Statistical testing and regression detection
T-test function used
0%
100%
Cohen's d formula
100%
100%
Effect size thresholds
100%
100%
Regression uses relative change
0%
100%
Regression threshold value
0%
100%
Cohen's kappa library
100%
100%
Kappa interpretation thresholds
90%
100%
P-value reported
100%
100%
Report includes all four analyses
100%
100%
Output written to JSON
100%
100%
Mean scores reported
100%
100%
47823e3
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.