llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

1.75x

Quality

41%

Does it follow best practices?

Impact

91%

1.75x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.md

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with an explicit 'Use when' clause and covers the domain adequately. However, it relies on broad categorical terms ('automated metrics', 'comprehensive evaluation strategies') rather than listing specific concrete actions, and its trigger terms could be expanded to cover more natural user phrasings. The description is functional but could be more distinctive and specific.

Suggestions

Replace 'comprehensive evaluation strategies' with specific concrete actions like 'design scoring rubrics, compute automated metrics (BLEU, ROUGE, perplexity), set up A/B comparisons, build evaluation datasets'.

Add more natural trigger terms users would actually say, such as 'evals', 'prompt testing', 'model accuracy', 'hallucination detection', 'regression testing', '.jsonl eval sets'.

Dimension	Reasoning	Score
Specificity	Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, benchmarking'), but these are broad categories rather than concrete actions like 'compute BLEU scores, run A/B tests, generate confusion matrices'. 'Implement comprehensive evaluation strategies' is somewhat vague.	2 / 3
Completeness	Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when' clause covering testing LLM performance, measuring AI application quality, or establishing evaluation frameworks).	3 / 3
Trigger Term Quality	Includes some relevant terms like 'LLM performance', 'AI application quality', 'evaluation frameworks', 'benchmarking', but misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'hallucination detection', 'regression testing'.	2 / 3
Distinctiveness Conflict Risk	The LLM evaluation niche is reasonably specific, but terms like 'AI application quality' and 'testing LLM performance' could overlap with general testing/QA skills or LLM development skills. The description doesn't sharply delineate its boundaries from adjacent skills like prompt engineering or general software testing.	2 / 3
	Total	9 / 12 Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads like a comprehensive textbook chapter on LLM evaluation rather than a concise, actionable skill for Claude. It is extremely verbose, explaining well-known ML concepts and metrics that Claude already understands, while lacking a clear end-to-end workflow. The code examples, while numerous, have inconsistent signatures and undefined dependencies that prevent them from being truly executable.

Suggestions

Drastically reduce content to ~100 lines: remove metric definitions Claude already knows (BLEU, ROUGE, accuracy, etc.), keep only the novel patterns like the EvaluationSuite framework, LLM-as-Judge, and regression detection with consistent, executable code.

Add a clear end-to-end workflow section (e.g., '1. Define test cases → 2. Choose metrics → 3. Run evaluation → 4. Validate results → 5. Compare against baseline') with explicit validation checkpoints.

Split detailed implementations (metric code, A/B testing, LangSmith integration, benchmarking) into separate referenced files, keeping SKILL.md as a concise overview with navigation links.

Fix code consistency: ensure function signatures match their usage in the EvaluationSuite (e.g., `calculate_bleu` is called with `prediction`/`reference`/`context` kwargs but defined with `reference`/`hypothesis`), and replace undefined placeholders like `your_model` with minimal working examples.

Dimension	Reasoning	Score
Conciseness	Extremely verbose at ~500+ lines. Explains concepts Claude already knows (what BLEU, ROUGE, accuracy are), lists metric definitions that are basic ML knowledge, includes lengthy code blocks for standard library usage (nltk, rouge_score, bert_score), and provides extensive boilerplate classes. Much of this is textbook content that doesn't add novel, actionable value.	1 / 3
Actionability	Code examples are mostly concrete and use real libraries, but many are incomplete or not truly executable — e.g., `calculate_accuracy` and `calculate_bleu` are referenced in the EvaluationSuite but defined with different signatures later, `your_model` and `your_chain` are undefined placeholders, and the async patterns assume an event loop without showing how to run them. The LangSmith section references undefined variables like `questions` and `expected_answers`.	2 / 3
Workflow Clarity	There is no clear workflow or sequenced process for how to actually set up and run an evaluation end-to-end. The content is organized as a reference catalog of disconnected code snippets rather than a guided process. There are no validation checkpoints, no error handling guidance, and no feedback loops for when evaluations produce unexpected results.	1 / 3
Progressive Disclosure	All content is in a single monolithic file with no references to supporting files. The massive amount of inline code (metric implementations, A/B testing, regression detection, LangSmith integration, benchmarking) should be split into separate reference files. The external resource links at the end are helpful but don't compensate for the lack of internal structure.	1 / 3
	Total	5 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (696 lines); consider splitting into references/ and linking	Warning

	Total	10 / 11 Passed

Repository: Dicklesworthstone/pi_agent_rust
Commit: b09ec7f

Reviewed: about 8 hours ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.