CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

59

1.75x
Quality

41%

Does it follow best practices?

Impact

91%

1.75x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with explicit 'what' and 'when' clauses, which is its strongest aspect. However, the capabilities listed are somewhat high-level and could be more concrete, and the trigger terms, while relevant, miss several natural variations users might employ. It occupies a reasonable niche but could be more distinctive.

Suggestions

Add more concrete actions such as 'create test suites, compute accuracy/BLEU/ROUGE scores, build evaluation pipelines, compare model outputs' to improve specificity.

Expand trigger terms to include natural variations like 'evals', 'prompt testing', 'model comparison', 'regression testing', 'accuracy measurement', or 'eval harness'.

DimensionReasoningScore

Specificity

Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, benchmarking'), but these are still fairly high-level categories rather than concrete actions like 'create test suites', 'compute BLEU/ROUGE scores', or 'build comparison dashboards'.

2 / 3

Completeness

Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks').

3 / 3

Trigger Term Quality

Includes some relevant terms like 'LLM performance', 'AI application quality', 'evaluation frameworks', 'benchmarking', and 'metrics'. However, it misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'regression testing', or 'A/B testing'.

2 / 3

Distinctiveness Conflict Risk

The focus on LLM evaluation is a reasonably specific niche, but terms like 'AI application quality' and 'testing' could overlap with general testing/QA skills or broader AI development skills. The description could be more distinctive by specifying unique artifacts or workflows.

2 / 3

Total

9

/

12

Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like a comprehensive textbook chapter or reference manual than an actionable skill file. It is extremely verbose, explains many concepts Claude already knows (metric definitions, what accuracy/precision/recall are), and lacks a clear workflow for actually conducting evaluations. The code examples, while numerous, have inconsistent function signatures and undefined dependencies that prevent them from being truly executable.

Suggestions

Drastically reduce content to ~100 lines: remove metric glossaries Claude already knows, keep only the EvaluationSuite pattern and one LLM-as-Judge example, and move detailed implementations to separate reference files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md)

Add a clear sequential workflow: e.g., 1) Define test cases → 2) Choose metrics → 3) Run evaluation → 4) Validate results meet threshold → 5) Compare against baseline → 6) Report/act on findings, with explicit validation checkpoints

Fix code consistency: ensure the EvaluationSuite's Metric functions match the signatures of the implementations provided (e.g., calculate_bleu takes 'reference' and 'hypothesis' but the suite passes 'prediction' and 'reference' as keyword args)

Remove the 'Common Pitfalls' and 'Best Practices' bullet lists which are generic advice Claude already knows, and replace with a brief decision tree for choosing evaluation approach based on use case

DimensionReasoningScore

Conciseness

This is extremely verbose at ~500+ lines. It explains basic concepts Claude already knows (what BLEU, ROUGE, accuracy, precision are), lists metric definitions that are textbook knowledge, and includes extensive boilerplate code. The metric glossary sections and human evaluation dimension lists add no value for Claude.

1 / 3

Actionability

The code examples are mostly executable and use real libraries (nltk, rouge_score, bert_score, anthropic), but several rely on undefined functions (calculate_accuracy, calculate_bleu, calculate_bertscore, check_groundedness, your_model, your_chain) making them not truly copy-paste ready. The EvaluationSuite class references functions that are defined later with incompatible signatures.

2 / 3

Workflow Clarity

There is no clear workflow or sequenced process for actually conducting an evaluation. The skill presents a catalog of disconnected code snippets and concepts without guiding the user through when to use what, in what order, or how to validate results. No validation checkpoints or feedback loops are present despite evaluation being an iterative process.

1 / 3

Progressive Disclosure

This is a monolithic wall of text with everything inline. The full implementations of BLEU, ROUGE, BERTScore, LLM-as-Judge patterns, A/B testing, regression testing, benchmarking, and LangSmith integration are all crammed into one file. These should be split into separate reference files with the SKILL.md providing a concise overview and navigation.

1 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (696 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
Dicklesworthstone/pi_agent_rust
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.