CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

60

1.79x
Quality

41%

Does it follow best practices?

Impact

95%

1.79x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/llm-application-dev/skills/llm-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with an explicit 'Use when' clause, which is its strongest aspect. However, it relies on somewhat abstract category-level language ('comprehensive evaluation strategies', 'automated metrics') rather than listing concrete, specific actions. The trigger terms cover the basics but miss many natural variations users might employ when seeking this skill.

Suggestions

Replace abstract categories with concrete actions, e.g., 'Create test suites, compute accuracy/BLEU/ROUGE scores, run A/B comparisons between prompts, detect hallucinations, and track regression across model versions'.

Expand trigger terms in the 'Use when' clause to include natural variations like 'evals', 'prompt testing', 'model comparison', 'hallucination checking', 'scoring LLM outputs', or 'regression testing'.

DimensionReasoningScore

Specificity

Names the domain (LLM evaluation) and mentions some actions ('automated metrics, human feedback, benchmarking'), but these are more categories than concrete actions. It doesn't list specific tasks like 'create test suites', 'compute BLEU/ROUGE scores', or 'build evaluation pipelines'.

2 / 3

Completeness

Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when' clause covering testing LLM performance, measuring AI application quality, or establishing evaluation frameworks).

3 / 3

Trigger Term Quality

Includes some relevant terms like 'LLM', 'evaluation', 'benchmarking', 'AI application quality', but misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'hallucination detection', or 'regression testing'.

2 / 3

Distinctiveness Conflict Risk

The focus on LLM evaluation is a reasonably distinct niche, but terms like 'testing', 'performance', and 'quality' are broad enough to potentially overlap with general testing/QA skills or performance optimization skills.

2 / 3

Total

9

/

12

Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads as an exhaustive reference document rather than an actionable skill guide. It suffers from severe verbosity—cataloging metrics and patterns Claude already knows—while lacking clear workflows, validation steps, and progressive disclosure. The code examples, while numerous, often depend on undefined functions and aren't truly executable without significant additional work.

Suggestions

Drastically reduce content to a concise Quick Start workflow (e.g., 'Step 1: Define test cases → Step 2: Choose metrics → Step 3: Run evaluation → Step 4: Analyze results → Step 5: Decide') with validation checkpoints, and move detailed implementations to separate reference files.

Remove metric definition lists (BLEU, ROUGE, precision/recall, etc.) that Claude already knows, and focus only on project-specific conventions or non-obvious implementation details.

Make the Quick Start example fully executable by providing concrete implementations for all referenced functions (calculate_accuracy, check_groundedness, etc.) or removing undefined references.

Split content into separate files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md, REGRESSION.md) and provide a concise overview with clear navigation links in the main SKILL.md.

DimensionReasoningScore

Conciseness

Extremely verbose at ~500+ lines. Extensively lists metric definitions Claude already knows (BLEU, ROUGE, precision/recall, etc.), explains basic concepts like Cohen's kappa interpretation thresholds, and includes lengthy code blocks for straightforward library wrappers. Much of this is reference material that doesn't earn its token cost.

1 / 3

Actionability

Provides substantial code examples that appear mostly executable, but many rely on undefined functions (e.g., calculate_accuracy, calculate_bertscore in the EvaluationSuite usage, check_groundedness, your_model, your_chain) making them not truly copy-paste ready. The Quick Start example cannot run without significant additional implementation.

2 / 3

Workflow Clarity

Despite being a complex multi-step domain (evaluation pipelines involve data preparation, model execution, metric calculation, analysis, and decision-making), there is no clear sequenced workflow. The content is organized as a reference catalog of patterns rather than a guided process. No validation checkpoints or feedback loops for catching evaluation errors.

1 / 3

Progressive Disclosure

Monolithic wall of content with no references to external files despite the massive size clearly warranting decomposition. All metric implementations, judge patterns, A/B testing, regression testing, LangSmith integration, and benchmarking are inlined in a single file with no navigation aids or content splitting.

1 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (667 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
wshobson/agents
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.