CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

60

1.79x
Quality

41%

Does it follow best practices?

Impact

95%

1.79x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/llm-application-dev/skills/llm-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Evaluation results

100%

40%

Customer Support Response Quality Evaluator

LLM-as-Judge evaluator

Criteria
Without context
With context

Anthropic client used

100%

100%

Correct judge model

100%

100%

max_tokens=500

0%

100%

Pydantic output model

0%

100%

Field validators on scores

0%

100%

JSON parse via json.loads

70%

100%

Reasoning field present

100%

100%

Pairwise comparison model

100%

100%

Winner uses Literal type

70%

100%

Confidence field in pairwise

20%

100%

JSON prompt format

100%

100%

87%

47%

Summarization Model Evaluation Suite

Automated text metrics suite

Criteria
Without context
With context

EvaluationSuite pattern

37%

100%

Aggregation with np.mean

0%

100%

Returns raw_scores

71%

100%

BLEU uses SmoothingFunction().method4

0%

100%

BLEU from nltk

0%

100%

ROUGE variant selection

100%

100%

ROUGE use_stemmer=True

100%

100%

ROUGE from rouge_score

100%

100%

BERTScore model type

0%

0%

BERTScore from bert_score

0%

100%

ROUGE returns fmeasure

100%

100%

BERTScore returns P, R, F1

0%

100%

100%

39%

Prompt Strategy Comparison and Regression Analysis

A/B testing and regression detection

Criteria
Without context
With context

scipy ttest_ind used

100%

100%

alpha=0.05

100%

100%

Cohen's d formula

80%

100%

Cohen's d interpretation thresholds

100%

100%

relative_improvement field

0%

100%

winner field

100%

100%

statistically_significant field

42%

100%

RegressionDetector class

62%

100%

Regression threshold=0.05

0%

100%

Relative change formula

0%

100%

has_regression boolean

42%

100%

numpy for score arrays

100%

100%

Repository
wshobson/agents
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.