CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

Install with Tessl CLI

npx tessl i github:wshobson/agents --skill llm-evaluation
What are skills?

72

Does it follow best practices?

Validation for skill structure

SKILL.md
Review
Evals

Evaluation results

90%

57%

Text Summarization Evaluation Suite

Automated metrics pipeline

Criteria
Without context
With context

Metric dataclass fields

50%

100%

Metric static factory methods

80%

100%

EvaluationSuite async evaluate

0%

0%

evaluate return structure

0%

100%

BLEU smoothing function

0%

100%

ROUGE score variants

0%

100%

ROUGE stemmer enabled

100%

100%

BERTScore model

0%

100%

BERTScore return keys

0%

100%

Correct library imports

100%

100%

Without context: $0.7491 · 8m 33s · 23 turns · 177 in / 13,307 out tokens

With context: $0.9183 · 12m 41s · 23 turns · 758 in / 11,626 out tokens

88%

28%

Customer Support Chatbot Response Judge

LLM-as-Judge evaluation

Criteria
Without context
With context

Anthropic client

100%

100%

Pydantic quality model

100%

100%

Pydantic pairwise model

100%

100%

claude-sonnet-4-6 model

0%

100%

Quality model fields

0%

100%

Field range validation

0%

100%

Winner Literal type

100%

100%

Pairwise confidence field

100%

100%

Position bias handling

100%

100%

Async functions

0%

0%

Without context: $1.2910 · 12m 48s · 33 turns · 259 in / 15,077 out tokens

With context: $0.4586 · 4m 51s · 16 turns · 959 in / 5,740 out tokens

100%

44%

Prompt Variant Analysis for Question-Answering System

A/B testing and regression detection

Criteria
Without context
With context

scipy t-test

100%

100%

Cohen's d formula

0%

100%

analyze return keys

20%

100%

Cohen's d interpretation

100%

100%

Regression threshold default

100%

100%

Regression detection logic

37%

100%

Cohen's kappa import

0%

100%

Kappa interpretation bands

100%

100%

ABTest add_result method

0%

100%

report.json output

100%

100%

Without context: $0.5866 · 10m 2s · 20 turns · 153 in / 8,708 out tokens

With context: $0.7240 · 6m 20s · 29 turns · 482 in / 7,623 out tokens

Evaluated
Agent
Claude Code

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.