llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

1.79x

Quality

41%

Does it follow best practices?

Impact

95%

1.79x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/llm-application-dev/skills/llm-evaluation/SKILL.md

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with an explicit 'Use when' clause, which is its strongest aspect. However, it relies on somewhat abstract category-level language ('comprehensive evaluation strategies', 'automated metrics') rather than listing concrete, specific actions. The trigger terms cover the basics but miss many natural variations users might employ when seeking this skill.

Suggestions

Replace abstract categories with concrete actions, e.g., 'Create test suites, compute accuracy/BLEU/ROUGE scores, run A/B comparisons between prompts, detect hallucinations, and track regression across model versions'.

Expand trigger terms in the 'Use when' clause to include natural variations like 'evals', 'prompt testing', 'model comparison', 'hallucination checking', 'scoring LLM outputs', or 'regression testing'.

Dimension	Reasoning	Score
Specificity	Names the domain (LLM evaluation) and mentions some actions ('automated metrics, human feedback, benchmarking'), but these are more categories than concrete actions. It doesn't list specific tasks like 'create test suites', 'compute BLEU/ROUGE scores', or 'build evaluation pipelines'.	2 / 3
Completeness	Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when' clause covering testing LLM performance, measuring AI application quality, or establishing evaluation frameworks).	3 / 3
Trigger Term Quality	Includes some relevant terms like 'LLM', 'evaluation', 'benchmarking', 'AI application quality', but misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'hallucination detection', or 'regression testing'.	2 / 3
Distinctiveness Conflict Risk	The focus on LLM evaluation is a reasonably distinct niche, but terms like 'testing', 'performance', and 'quality' are broad enough to potentially overlap with general testing/QA skills or performance optimization skills.	2 / 3
	Total	9 / 12 Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads as an exhaustive reference document rather than an actionable skill guide. It suffers from severe verbosity—cataloging metrics and patterns Claude already knows—while lacking clear workflows, validation steps, and progressive disclosure. The code examples, while numerous, often depend on undefined functions and aren't truly executable without significant additional work.

Suggestions

Drastically reduce content to a concise Quick Start workflow (e.g., 'Step 1: Define test cases → Step 2: Choose metrics → Step 3: Run evaluation → Step 4: Analyze results → Step 5: Decide') with validation checkpoints, and move detailed implementations to separate reference files.

Remove metric definition lists (BLEU, ROUGE, precision/recall, etc.) that Claude already knows, and focus only on project-specific conventions or non-obvious implementation details.

Make the Quick Start example fully executable by providing concrete implementations for all referenced functions (calculate_accuracy, check_groundedness, etc.) or removing undefined references.

Split content into separate files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md, REGRESSION.md) and provide a concise overview with clear navigation links in the main SKILL.md.

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~500+ lines. Extensively lists metric definitions Claude already knows (BLEU, ROUGE, precision/recall, etc.), explains basic concepts like Cohen's kappa interpretation thresholds, and includes lengthy code blocks for straightforward library wrappers. Much of this is reference material that doesn't earn its token cost.	1 / 3
Actionability	Provides substantial code examples that appear mostly executable, but many rely on undefined functions (e.g., calculate_accuracy, calculate_bertscore in the EvaluationSuite usage, check_groundedness, your_model, your_chain) making them not truly copy-paste ready. The Quick Start example cannot run without significant additional implementation.	2 / 3
Workflow Clarity	Despite being a complex multi-step domain (evaluation pipelines involve data preparation, model execution, metric calculation, analysis, and decision-making), there is no clear sequenced workflow. The content is organized as a reference catalog of patterns rather than a guided process. No validation checkpoints or feedback loops for catching evaluation errors.	1 / 3
Progressive Disclosure	Monolithic wall of content with no references to external files despite the massive size clearly warranting decomposition. All metric implementations, judge patterns, A/B testing, regression testing, LangSmith integration, and benchmarking are inlined in a single file with no navigation aids or content splitting.	1 / 3
	Total	5 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (667 lines); consider splitting into references/ and linking	Warning

	Total	10 / 11 Passed

Repository: wshobson/agents
Commit: 34632bc

Reviewed: 5 days ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.