CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

68

1.75x
Quality

54%

Does it follow best practices?

Impact

91%

1.75x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Evaluation results

92%

38%

Evaluating Customer Support Chatbot Responses

LLM-as-Judge patterns

Criteria
Without context
With context

Anthropic client import

100%

100%

Correct model name

0%

0%

Pydantic BaseModel used

100%

100%

Field constraints on ratings

0%

100%

Quality rating fields

100%

100%

Pairwise winner field

50%

100%

Pairwise confidence field

0%

100%

Reference eval float scores

0%

100%

Reference eval issues list

0%

100%

Three judge functions present

100%

100%

Results saved to JSON

100%

100%

No hardcoded API key

100%

100%

82%

50%

Benchmarking AI-Generated Article Summaries

Automated metrics suite

Criteria
Without context
With context

BLEU library import

0%

100%

BLEU smoothing method

0%

100%

ROUGE library import

100%

100%

ROUGE variants and stemmer

55%

0%

BERTScore library

0%

100%

BERTScore model type

0%

0%

Metric dataclass pattern

22%

100%

EvaluationSuite class

25%

100%

Dual output format

60%

100%

Custom metric support

14%

100%

Results saved to JSON

62%

100%

Three metrics covered

40%

100%

100%

29%

Deciding Between Two Prompt Strategies with Statistical Rigor

Statistical testing and regression detection

Criteria
Without context
With context

T-test function used

0%

100%

Cohen's d formula

100%

100%

Effect size thresholds

100%

100%

Regression uses relative change

0%

100%

Regression threshold value

0%

100%

Cohen's kappa library

100%

100%

Kappa interpretation thresholds

90%

100%

P-value reported

100%

100%

Report includes all four analyses

100%

100%

Output written to JSON

100%

100%

Mean scores reported

100%

100%

Repository
Dicklesworthstone/pi_agent_rust
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.