Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
60
41%
Does it follow best practices?
Impact
95%
1.79xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/llm-application-dev/skills/llm-evaluation/SKILL.mdQuality
Discovery
67%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description has good structural completeness with an explicit 'Use when' clause and covers the domain adequately. However, it relies on somewhat abstract category names rather than concrete actions, and the trigger terms could be expanded to include more natural user language variations. The description is functional but could be more distinctive and specific.
Suggestions
Replace high-level categories with concrete actions, e.g., 'Build test suites, compute accuracy/BLEU/ROUGE scores, set up human annotation pipelines, run A/B comparisons between model versions'.
Add more natural trigger terms users would actually say, such as 'evals', 'prompt testing', 'model comparison', 'regression testing', 'accuracy metrics', '.jsonl test sets'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, benchmarking'), but these are still fairly high-level categories rather than concrete actions like 'create test suites, compute BLEU/ROUGE scores, build annotation interfaces'. | 2 / 3 |
Completeness | Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks'). | 3 / 3 |
Trigger Term Quality | Includes some relevant terms like 'LLM performance', 'AI application quality', 'evaluation frameworks', 'benchmarking', but misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'regression testing', 'A/B testing'. | 2 / 3 |
Distinctiveness Conflict Risk | The LLM evaluation niche is reasonably specific, but terms like 'AI application quality' and 'evaluation frameworks' are broad enough to potentially overlap with general testing/QA skills or ML model training skills. | 2 / 3 |
Total | 9 / 12 Passed |
Implementation
14%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill reads like a comprehensive textbook chapter rather than an actionable skill file. It is extremely verbose, explaining many concepts Claude already knows (metric definitions, evaluation dimensions), while failing to provide a clear workflow for actually conducting an evaluation. The massive amount of inline code should be split into referenced files, and the content should focus on decision-making guidance (when to use which approach) rather than cataloging every possible metric and pattern.
Suggestions
Reduce the SKILL.md to a concise overview (~50-80 lines) with a clear workflow: 1) Choose evaluation type, 2) Set up test cases, 3) Run evaluation, 4) Analyze results, 5) Validate findings. Move detailed implementations to separate referenced files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md).
Remove all metric definition lists (BLEU, ROUGE, accuracy, etc.) and human evaluation dimension descriptions—Claude already knows these. Focus only on project-specific configuration and decision guidance.
Add explicit validation checkpoints: how to verify test cases are well-formed, how to sanity-check evaluation results, what to do when metrics disagree or seem wrong.
Make the Quick Start truly self-contained and executable by either defining all referenced functions inline or removing undefined references like calculate_accuracy, check_groundedness, and your_model.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | Extremely verbose at ~500+ lines. Explains well-known concepts like BLEU, ROUGE, accuracy, precision/recall that Claude already knows. Lists metric definitions (e.g., 'Accuracy: Percentage correct') that add no value. The human evaluation dimensions list and metric taxonomy sections are pure padding. | 1 / 3 |
Actionability | Provides substantial code examples that appear executable, but many rely on undefined functions (e.g., calculate_accuracy, calculate_bleu, check_groundedness, your_model) making them not truly copy-paste ready. The EvaluationSuite references functions that are defined much later or not at all, creating confusion about how to actually run the code. | 2 / 3 |
Workflow Clarity | No clear workflow or sequencing for how to actually set up and run an evaluation pipeline end-to-end. The content is organized as a reference catalog of disconnected code snippets rather than a guided process. There are no validation checkpoints, no error handling guidance, and no feedback loops for when evaluations produce unexpected results. | 1 / 3 |
Progressive Disclosure | Monolithic wall of content with no references to external files. All content is inline despite being far too long for a single SKILL.md. The automated metrics, LLM-as-judge patterns, human evaluation, A/B testing, regression testing, LangSmith integration, and benchmarking sections should each be separate referenced files. | 1 / 3 |
Total | 5 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (667 lines); consider splitting into references/ and linking | Warning |
Total | 10 / 11 Passed | |
91fe43e
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.