llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

1.75x

Quality

41%

Does it follow best practices?

Impact

91%

1.75x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.md

Evaluation results

92%

38%

Evaluating Customer Support Chatbot Responses

LLM-as-Judge patterns

Criteria

Without context

With context

Anthropic client import

100%

Correct model name

Pydantic BaseModel used

100%

Field constraints on ratings

100%

Quality rating fields

100%

Pairwise winner field

50%

100%

Pairwise confidence field

100%

Reference eval float scores

100%

Reference eval issues list

100%

Three judge functions present

100%

Results saved to JSON

100%

No hardcoded API key

100%

82%

50%

Benchmarking AI-Generated Article Summaries

Automated metrics suite

Criteria

Without context

With context

BLEU library import

100%

BLEU smoothing method

100%

ROUGE library import

100%

ROUGE variants and stemmer

55%

BERTScore library

100%

BERTScore model type

Metric dataclass pattern

22%

100%

EvaluationSuite class

25%

100%

Dual output format

60%

100%

Custom metric support

14%

100%

Results saved to JSON

62%

100%

Three metrics covered

40%

100%

29%

Deciding Between Two Prompt Strategies with Statistical Rigor

Statistical testing and regression detection

Criteria

Without context

With context

T-test function used

100%

Cohen's d formula

100%

Effect size thresholds

100%

Regression uses relative change

100%

Regression threshold value

100%

Cohen's kappa library

100%

Kappa interpretation thresholds

90%

100%

P-value reported

100%

Report includes all four analyses

100%

Output written to JSON

100%

Mean scores reported

100%

Repository: Dicklesworthstone/pi_agent_rust
Commit: b09ec7f

Evaluated: 2 months ago
Agent: Claude Code
Model: Claude Sonnet 4.6

Table of Contents

Evaluating Customer Support Chatbot Responses Benchmarking AI-Generated Article Summaries Deciding Between Two Prompt Strategies with Statistical Rigor

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.