Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
59
41%
Does it follow best practices?
Impact
91%
1.75xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.mdQuality
Discovery
67%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description has good structural completeness with an explicit 'Use when' clause, which is its strongest aspect. However, the capabilities listed are somewhat high-level and could benefit from more concrete, specific actions. The trigger terms cover the domain but miss several natural variations users might employ when seeking evaluation help.
Suggestions
Add more concrete actions such as 'create test suites, compute accuracy/BLEU/ROUGE scores, build evaluation pipelines, compare model outputs' to increase specificity.
Include additional natural trigger terms users might say, such as 'evals', 'prompt testing', 'model comparison', 'regression testing', 'scoring', or 'A/B testing LLM outputs'.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, benchmarking'), but these are still fairly high-level categories rather than concrete actions like 'create test suites', 'compute BLEU/ROUGE scores', or 'build comparison dashboards'. | 2 / 3 |
Completeness | Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when' clause covering testing LLM performance, measuring AI application quality, or establishing evaluation frameworks). | 3 / 3 |
Trigger Term Quality | Includes some relevant terms like 'LLM performance', 'AI application quality', 'evaluation frameworks', 'benchmarking', and 'automated metrics'. However, it misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'regression testing', or 'scoring'. | 2 / 3 |
Distinctiveness Conflict Risk | The focus on LLM evaluation is a reasonably specific niche, but terms like 'testing', 'quality', and 'frameworks' are broad enough to potentially overlap with general testing/QA skills or broader AI development skills. | 2 / 3 |
Total | 9 / 12 Passed |
Implementation
14%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill reads more like a comprehensive textbook chapter or reference manual than an actionable skill file. It is extremely verbose, explains many concepts Claude already knows, and lacks a clear workflow for actually conducting evaluations. The code examples are partially executable but the core framework relies on undefined functions, and the entire content should be restructured with a concise overview pointing to separate detailed files.
Suggestions
Reduce the SKILL.md to a concise overview (~100 lines) with a clear evaluation workflow (define test cases → select metrics → run evaluation → analyze results → iterate), and move detailed implementations into separate bundle files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md).
Remove all metric definition lists (BLEU, ROUGE, Accuracy, etc.) that simply restate what Claude already knows, and instead focus on when to use each metric and project-specific configuration.
Make the Quick Start actually executable by either defining all referenced functions inline or removing the abstraction layer and showing a single concrete end-to-end example.
Add explicit validation checkpoints to the workflow, such as verifying test case format, checking for sufficient sample sizes before drawing conclusions, and validating metric results before comparing against baselines.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | This is extremely verbose at ~500+ lines. It explains concepts Claude already knows (what BLEU, ROUGE, accuracy, precision are), lists metric definitions that are basic ML knowledge, and includes extensive boilerplate code. The metric definition lists (e.g., 'Accuracy: Percentage correct', 'Precision/Recall/F1: Class-specific performance') add no value for Claude. Much of this could be cut by 60-70%. | 1 / 3 |
Actionability | The code examples are mostly concrete and executable, but several rely on undefined functions (e.g., `calculate_accuracy`, `calculate_bleu`, `calculate_bertscore`, `check_groundedness`, `your_model`) making the Quick Start not actually runnable. The individual metric implementations (BLEU, ROUGE, BERTScore) are executable, but the framework code that ties them together has gaps. | 2 / 3 |
Workflow Clarity | There is no clear workflow or sequenced process for actually conducting an evaluation. The skill presents a catalog of tools and code snippets but never guides the user through a coherent evaluation process (e.g., 'first define your test cases, then select metrics, then run evaluation, then analyze results, then validate'). No validation checkpoints or feedback loops are present despite evaluation being an iterative process prone to errors. | 1 / 3 |
Progressive Disclosure | This is a monolithic wall of text with no bundle files to offload detailed implementations. The metric implementations, LLM-as-judge patterns, human evaluation frameworks, A/B testing, regression testing, LangSmith integration, and benchmarking are all inlined in a single massive file. These should be split into separate reference files with the SKILL.md serving as a concise overview with navigation. | 1 / 3 |
Total | 5 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (696 lines); consider splitting into references/ and linking | Warning |
Total | 10 / 11 Passed | |
bbc5ade
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.