Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
60
41%
Does it follow best practices?
Impact
95%
1.79xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/llm-application-dev/skills/llm-evaluation/SKILL.mdQuality
Discovery
67%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description has good structural completeness with an explicit 'Use when' clause, which is its strongest aspect. However, it relies on somewhat abstract category-level language ('comprehensive evaluation strategies', 'automated metrics') rather than listing concrete, specific actions. The trigger terms cover the basics but miss many natural variations users might employ when seeking this skill.
Suggestions
Replace abstract categories with concrete actions, e.g., 'Create test suites, compute accuracy/BLEU/ROUGE scores, run A/B comparisons between prompts, detect hallucinations, and track regression across model versions'.
Expand trigger terms in the 'Use when' clause to include natural variations like 'evals', 'prompt testing', 'model comparison', 'hallucination checking', 'scoring LLM outputs', or 'regression testing'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (LLM evaluation) and mentions some actions ('automated metrics, human feedback, benchmarking'), but these are more categories than concrete actions. It doesn't list specific tasks like 'create test suites', 'compute BLEU/ROUGE scores', or 'build evaluation pipelines'. | 2 / 3 |
Completeness | Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when' clause covering testing LLM performance, measuring AI application quality, or establishing evaluation frameworks). | 3 / 3 |
Trigger Term Quality | Includes some relevant terms like 'LLM', 'evaluation', 'benchmarking', 'AI application quality', but misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'hallucination detection', or 'regression testing'. | 2 / 3 |
Distinctiveness Conflict Risk | The focus on LLM evaluation is a reasonably distinct niche, but terms like 'testing', 'performance', and 'quality' are broad enough to potentially overlap with general testing/QA skills or performance optimization skills. | 2 / 3 |
Total | 9 / 12 Passed |
Implementation
14%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill reads as an exhaustive reference document rather than an actionable skill guide. It suffers from severe verbosity—cataloging metrics and patterns Claude already knows—while lacking clear workflows, validation steps, and progressive disclosure. The code examples, while numerous, often depend on undefined functions and aren't truly executable without significant additional work.
Suggestions
Drastically reduce content to a concise Quick Start workflow (e.g., 'Step 1: Define test cases → Step 2: Choose metrics → Step 3: Run evaluation → Step 4: Analyze results → Step 5: Decide') with validation checkpoints, and move detailed implementations to separate reference files.
Remove metric definition lists (BLEU, ROUGE, precision/recall, etc.) that Claude already knows, and focus only on project-specific conventions or non-obvious implementation details.
Make the Quick Start example fully executable by providing concrete implementations for all referenced functions (calculate_accuracy, check_groundedness, etc.) or removing undefined references.
Split content into separate files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md, REGRESSION.md) and provide a concise overview with clear navigation links in the main SKILL.md.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~500+ lines. Extensively lists metric definitions Claude already knows (BLEU, ROUGE, precision/recall, etc.), explains basic concepts like Cohen's kappa interpretation thresholds, and includes lengthy code blocks for straightforward library wrappers. Much of this is reference material that doesn't earn its token cost. | 1 / 3 |
Actionability | Provides substantial code examples that appear mostly executable, but many rely on undefined functions (e.g., calculate_accuracy, calculate_bertscore in the EvaluationSuite usage, check_groundedness, your_model, your_chain) making them not truly copy-paste ready. The Quick Start example cannot run without significant additional implementation. | 2 / 3 |
Workflow Clarity | Despite being a complex multi-step domain (evaluation pipelines involve data preparation, model execution, metric calculation, analysis, and decision-making), there is no clear sequenced workflow. The content is organized as a reference catalog of patterns rather than a guided process. No validation checkpoints or feedback loops for catching evaluation errors. | 1 / 3 |
Progressive Disclosure | Monolithic wall of content with no references to external files despite the massive size clearly warranting decomposition. All metric implementations, judge patterns, A/B testing, regression testing, LangSmith integration, and benchmarking are inlined in a single file with no navigation aids or content splitting. | 1 / 3 |
Total | 5 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (667 lines); consider splitting into references/ and linking | Warning |
Total | 10 / 11 Passed | |
34632bc
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.