CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

68

1.75x
Quality

54%

Does it follow best practices?

Impact

91%

1.75x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with an explicit 'Use when' clause and covers the domain adequately. However, it stays at a somewhat abstract level—listing categories of evaluation approaches rather than concrete actions—and could benefit from more specific trigger terms that users naturally use when seeking LLM evaluation help.

Suggestions

Add more specific concrete actions, e.g., 'create test datasets, compute BLEU/ROUGE/accuracy scores, build evaluation pipelines, compare model outputs side-by-side'.

Expand trigger terms with natural user language variations like 'evals', 'prompt testing', 'model comparison', 'regression testing', 'scoring LLM outputs'.

DimensionReasoningScore

Specificity

Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, benchmarking'), but these are still fairly high-level categories rather than concrete actions like 'create test suites, compute BLEU/ROUGE scores, build annotation interfaces'.

2 / 3

Completeness

Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks').

3 / 3

Trigger Term Quality

Includes some relevant terms like 'LLM performance', 'AI application quality', 'evaluation frameworks', 'benchmarking', but misses common natural variations users might say such as 'evals', 'prompt testing', 'accuracy', 'regression testing', 'model comparison', 'scoring'.

2 / 3

Distinctiveness Conflict Risk

The focus on LLM evaluation is a reasonably specific niche, but terms like 'testing', 'quality', and 'frameworks' are broad enough to potentially overlap with general testing skills or software quality assurance skills.

2 / 3

Total

9

/

12

Passed

Implementation

42%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The skill provides highly actionable, executable code examples across a comprehensive range of LLM evaluation techniques, which is its primary strength. However, it is severely over-long and monolithic—it reads more like a textbook chapter than a concise skill file. It explains many concepts Claude already knows (metric definitions, what accuracy means), lacks a clear sequential workflow, and desperately needs to be split into separate reference files with the main SKILL.md serving as a concise overview with navigation.

Suggestions

Split implementation details (automated metrics, LLM-as-Judge patterns, A/B testing, benchmarking) into separate referenced files and keep SKILL.md as a concise overview with navigation links

Remove metric glossary lists (BLEU, ROUGE, Accuracy, Precision, etc.) that explain concepts Claude already knows—just reference them in code examples

Add a clear end-to-end workflow section showing the recommended sequence: define test cases → establish baseline → run evaluation → check regressions → analyze results, with explicit validation checkpoints

Remove the 'Human Evaluation Frameworks' annotation task boilerplate which is generic form-building code that doesn't add LLM-evaluation-specific value

DimensionReasoningScore

Conciseness

This is extremely verbose at ~500+ lines. It explains concepts Claude already knows (what BLEU, ROUGE, accuracy, precision are), lists metric definitions that are basic ML knowledge, and includes extensive boilerplate code. The metric glossary sections and human evaluation dimension lists add little value for Claude.

1 / 3

Actionability

The code examples are concrete, executable, and copy-paste ready. Functions have proper imports, type hints, and return values. The EvaluationSuite, ABTest, RegressionDetector, and LLM-as-Judge patterns are all fully implemented with real library calls.

3 / 3

Workflow Clarity

While individual components are well-defined, there's no clear end-to-end workflow showing how to sequence evaluation steps (e.g., 'first establish baseline, then run evaluation, then check for regressions, then analyze results'). The skill presents a catalog of tools without explicit validation checkpoints or a coherent process flow.

2 / 3

Progressive Disclosure

This is a monolithic wall of code and text with no references to external files. All implementation details for BLEU, ROUGE, BERTScore, LLM-as-Judge, A/B testing, regression testing, benchmarking, and LangSmith integration are inlined. This content should be split across multiple files with the SKILL.md serving as an overview.

1 / 3

Total

7

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (696 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
Dicklesworthstone/pi_agent_rust
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.