CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

60

1.79x
Quality

41%

Does it follow best practices?

Impact

95%

1.79x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/llm-application-dev/skills/llm-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with an explicit 'Use when' clause and covers the domain adequately. However, it relies on somewhat abstract category names (automated metrics, human feedback, benchmarking) rather than listing concrete actions, and its trigger terms could be expanded to cover more natural user phrasings. The distinctiveness is moderate—it could overlap with general testing or ML evaluation skills.

Suggestions

Replace high-level categories with specific concrete actions, e.g., 'Create test suites, compute accuracy/BLEU/ROUGE scores, run A/B comparisons between prompts, build evaluation pipelines'.

Expand trigger terms in the 'Use when' clause to include natural variations like 'evals', 'prompt testing', 'model comparison', 'regression testing', 'scoring LLM outputs'.

DimensionReasoningScore

Specificity

Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, benchmarking'), but these are still fairly high-level categories rather than concrete actions like 'create test suites', 'compute BLEU/ROUGE scores', or 'build evaluation pipelines'.

2 / 3

Completeness

Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when' clause covering testing LLM performance, measuring AI application quality, or establishing evaluation frameworks).

3 / 3

Trigger Term Quality

Includes some relevant terms like 'LLM performance', 'AI application quality', 'evaluation frameworks', 'benchmarking', and 'metrics'. However, it misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'regression testing', or 'scoring'.

2 / 3

Distinctiveness Conflict Risk

The focus on LLM evaluation is a reasonably distinct niche, but terms like 'testing', 'quality', and 'metrics' are broad enough to potentially overlap with general testing/QA skills or monitoring skills. The description could be more precise about what distinguishes it from general software testing or ML model evaluation.

2 / 3

Total

9

/

12

Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like a comprehensive textbook chapter or API reference catalog than an actionable skill file. It is excessively long, explains many concepts Claude already knows (metric definitions, statistical test interpretations), and lacks a clear workflow for actually conducting an evaluation. The code examples, while numerous, often reference undefined functions and cannot be executed as-is.

Suggestions

Drastically reduce content to a concise overview (~50-80 lines) with a clear step-by-step workflow (e.g., 1. Define test cases, 2. Choose metrics, 3. Run evaluation, 4. Analyze results, 5. Detect regressions) and move detailed implementations to separate reference files.

Remove metric glossary lists (BLEU, ROUGE, accuracy definitions) that Claude already knows, and focus only on project-specific conventions or non-obvious implementation details.

Make the Quick Start example fully executable by either implementing the referenced functions inline or removing the abstraction layer and showing a simple, complete end-to-end example.

Add explicit validation checkpoints to the workflow, such as verifying test case format before running, checking for sufficient sample sizes before statistical tests, and validating that metric scores fall within expected ranges.

DimensionReasoningScore

Conciseness

Extremely verbose at ~500+ lines. Explains well-known concepts (what BLEU, ROUGE, accuracy, precision are), lists metric definitions Claude already knows, and includes extensive boilerplate code for standard patterns. The metric glossary sections add no value for Claude.

1 / 3

Actionability

Provides substantial code examples that are mostly executable, but many rely on undefined functions (e.g., calculate_accuracy, calculate_bertscore in the EvaluationSuite usage, check_groundedness, your_model, your_chain). The Quick Start example cannot actually run as-is due to missing implementations referenced by the Metric static methods.

2 / 3

Workflow Clarity

No clear workflow or sequencing for how to actually conduct an evaluation end-to-end. The content is organized as a reference catalog of code snippets rather than a guided process. There are no validation checkpoints, no guidance on when to use which approach, and no feedback loops for iterating on evaluation results.

1 / 3

Progressive Disclosure

Monolithic wall of content with no references to external files. Everything is inlined in a single massive document—the LangSmith integration, A/B testing, benchmarking, human evaluation, and all metric implementations could be split into separate reference files with a concise overview in the main skill.

1 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (667 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
wshobson/agents
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.