CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

59

1.75x
Quality

41%

Does it follow best practices?

Impact

91%

1.75x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./tests/ext_conformance/artifacts/agents-wshobson/llm-application-dev/skills/llm-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with an explicit 'Use when' clause, which is its strongest aspect. However, the capabilities listed are somewhat high-level and could benefit from more concrete, specific actions. The trigger terms cover the domain but miss several natural variations users might employ when seeking evaluation help.

Suggestions

Add more concrete actions such as 'create test suites, compute accuracy/BLEU/ROUGE scores, build evaluation pipelines, compare model outputs' to increase specificity.

Include additional natural trigger terms users might say, such as 'evals', 'prompt testing', 'model comparison', 'regression testing', 'scoring', or 'A/B testing LLM outputs'.

DimensionReasoningScore

Specificity

Names the domain (LLM evaluation) and mentions some approaches ('automated metrics, human feedback, benchmarking'), but these are still fairly high-level categories rather than concrete actions like 'create test suites', 'compute BLEU/ROUGE scores', or 'build comparison dashboards'.

2 / 3

Completeness

Clearly answers both 'what' (implement evaluation strategies using automated metrics, human feedback, and benchmarking) and 'when' (explicit 'Use when' clause covering testing LLM performance, measuring AI application quality, or establishing evaluation frameworks).

3 / 3

Trigger Term Quality

Includes some relevant terms like 'LLM performance', 'AI application quality', 'evaluation frameworks', 'benchmarking', and 'automated metrics'. However, it misses common natural variations users might say such as 'evals', 'prompt testing', 'model comparison', 'accuracy', 'regression testing', or 'scoring'.

2 / 3

Distinctiveness Conflict Risk

The focus on LLM evaluation is a reasonably specific niche, but terms like 'testing', 'quality', and 'frameworks' are broad enough to potentially overlap with general testing/QA skills or broader AI development skills.

2 / 3

Total

9

/

12

Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like a comprehensive textbook chapter or reference manual than an actionable skill file. It is extremely verbose, explains many concepts Claude already knows, and lacks a clear workflow for actually conducting evaluations. The code examples are partially executable but the core framework relies on undefined functions, and the entire content should be restructured with a concise overview pointing to separate detailed files.

Suggestions

Reduce the SKILL.md to a concise overview (~100 lines) with a clear evaluation workflow (define test cases → select metrics → run evaluation → analyze results → iterate), and move detailed implementations into separate bundle files (e.g., METRICS.md, LLM_JUDGE.md, AB_TESTING.md).

Remove all metric definition lists (BLEU, ROUGE, Accuracy, etc.) that simply restate what Claude already knows, and instead focus on when to use each metric and project-specific configuration.

Make the Quick Start actually executable by either defining all referenced functions inline or removing the abstraction layer and showing a single concrete end-to-end example.

Add explicit validation checkpoints to the workflow, such as verifying test case format, checking for sufficient sample sizes before drawing conclusions, and validating metric results before comparing against baselines.

DimensionReasoningScore

Conciseness

This is extremely verbose at ~500+ lines. It explains concepts Claude already knows (what BLEU, ROUGE, accuracy, precision are), lists metric definitions that are basic ML knowledge, and includes extensive boilerplate code. The metric definition lists (e.g., 'Accuracy: Percentage correct', 'Precision/Recall/F1: Class-specific performance') add no value for Claude. Much of this could be cut by 60-70%.

1 / 3

Actionability

The code examples are mostly concrete and executable, but several rely on undefined functions (e.g., `calculate_accuracy`, `calculate_bleu`, `calculate_bertscore`, `check_groundedness`, `your_model`) making the Quick Start not actually runnable. The individual metric implementations (BLEU, ROUGE, BERTScore) are executable, but the framework code that ties them together has gaps.

2 / 3

Workflow Clarity

There is no clear workflow or sequenced process for actually conducting an evaluation. The skill presents a catalog of tools and code snippets but never guides the user through a coherent evaluation process (e.g., 'first define your test cases, then select metrics, then run evaluation, then analyze results, then validate'). No validation checkpoints or feedback loops are present despite evaluation being an iterative process prone to errors.

1 / 3

Progressive Disclosure

This is a monolithic wall of text with no bundle files to offload detailed implementations. The metric implementations, LLM-as-judge patterns, human evaluation frameworks, A/B testing, regression testing, LangSmith integration, and benchmarking are all inlined in a single massive file. These should be split into separate reference files with the SKILL.md serving as a concise overview with navigation.

1 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (696 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
Dicklesworthstone/pi_agent_rust
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.