CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

Install with Tessl CLI

npx tessl i github:wshobson/agents --skill llm-evaluation
What are skills?

72

Does it follow best practices?

Validation for skill structure

SKILL.md
Review
Evals

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structure with explicit 'Use when' guidance and covers the domain adequately. However, it relies on somewhat abstract language ('comprehensive evaluation strategies') rather than concrete actions, and the trigger terms could better match natural user language patterns.

Suggestions

Replace abstract language with specific actions like 'create evaluation test suites, calculate accuracy/relevance metrics, compare model outputs, generate benchmark reports'

Add more natural trigger terms users would say: 'test my model', 'eval suite', 'prompt testing', 'measure accuracy', 'compare outputs', 'model benchmarks'

DimensionReasoningScore

Specificity

Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, and benchmarking') but lacks concrete specific actions like 'create test suites', 'calculate accuracy scores', or 'generate evaluation reports'.

2 / 3

Completeness

Clearly answers both what ('Implement comprehensive evaluation strategies...using automated metrics, human feedback, and benchmarking') and when ('Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks') with explicit trigger guidance.

3 / 3

Trigger Term Quality

Includes some relevant terms ('LLM performance', 'AI application quality', 'evaluation frameworks') but misses common variations users might say like 'test my model', 'benchmark prompts', 'measure accuracy', 'eval suite', or 'model testing'.

2 / 3

Distinctiveness Conflict Risk

Focuses on LLM evaluation which is a specific niche, but terms like 'testing' and 'quality' could overlap with general testing/QA skills. Could be more distinctive by mentioning specific evaluation types (e.g., 'prompt evaluation', 'model comparison').

2 / 3

Total

9

/

12

Passed

Implementation

64%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a comprehensive and highly actionable skill with excellent code examples covering multiple evaluation approaches. However, it suffers from being overly long and monolithic, explaining some concepts Claude already knows (metric definitions), and lacking clear workflow guidance for how to actually implement an evaluation strategy from start to finish with validation checkpoints.

Suggestions

Add a clear workflow section at the top showing the recommended sequence for implementing evaluation (e.g., 1. Define metrics → 2. Create test dataset → 3. Run baseline → 4. Validate results → 5. Iterate)

Split into multiple files: keep SKILL.md as overview with Quick Start, move detailed implementations to METRICS.md, LLM_JUDGE.md, AB_TESTING.md, BENCHMARKING.md

Remove explanatory text for concepts Claude knows (e.g., 'BLEU: N-gram overlap (translation)') and just show the implementation

Add explicit validation checkpoints, such as 'Verify sample size is sufficient before drawing conclusions' or 'Check inter-rater agreement before trusting human evaluation results'

DimensionReasoningScore

Conciseness

The skill is comprehensive but includes some unnecessary verbosity, such as explaining basic concepts like what BLEU/ROUGE are and listing metric definitions that Claude already knows. The code examples are valuable but some sections could be tightened.

2 / 3

Actionability

Excellent executable code throughout with complete, copy-paste ready implementations. The EvaluationSuite class, metric implementations, LLM-as-Judge patterns, and A/B testing framework are all fully functional with proper imports and type hints.

3 / 3

Workflow Clarity

While individual components are well-documented, the skill lacks clear end-to-end workflow guidance with validation checkpoints. There's no explicit sequence for setting up an evaluation pipeline, and no validation steps for ensuring evaluation results are reliable before acting on them.

2 / 3

Progressive Disclosure

The content is well-organized with clear sections, but it's a monolithic document (~500 lines) that could benefit from splitting into separate files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md). The external resources section is good but internal content organization could be improved.

2 / 3

Total

9

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (696 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.