CtrlK
BlogDocsLog inGet started
Tessl Logo

llm-evaluation

Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.

86

1.75x
Quality

Does it follow best practices?

Impact

91%

1.75x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

65%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The body is highly actionable with abundant executable code, but it is a large monolithic file lacking progressive disclosure to separate references and missing explicit validation/feedback checkpoints for batch evaluation workflows. Tightening volume and splitting detail into referenced files would raise the weaker dimensions.

Suggestions

Move the per-metric implementations (BLEU/ROUGE/BERTScore, LLM-as-judge variants, LangSmith, benchmarking) into reference files under references/ and keep SKILL.md as a concise overview with signaled one-level-deep links to improve progressive_disclosure.

Add an explicit evaluation workflow with validation checkpoints (e.g., run metrics -> inspect failures -> re-run on corrected cases) and a feedback loop for batch evaluation to raise workflow_clarity.

Trim the metric-definition glossary and consolidate near-duplicate LLM-as-judge code blocks to reduce token volume and lift conciseness toward level 3.

DimensionReasoningScore

Conciseness

The ~690-line body is mostly executable code that earns its place, but the metric glossary one-liners (e.g. "BLEU: N-gram overlap") and the sheer volume could be tightened; not level 3 because not every token is lean, not level 1 because it avoids long concept explanations Claude already knows.

2 / 3

Actionability

Provides extensive copy-paste-ready, executable implementations (BLEU, ROUGE, BERTScore, LLM-as-judge, A/B testing, regression, LangSmith), matching the fully-executable anchor.

3 / 3

Workflow Clarity

Content is a catalog of techniques with organized sections but no explicit validation checkpoints or feedback loops for batch evaluation operations; the rubric caps such workflows at 2, and it is above 1 because a clear Quick Start sequence is present.

2 / 3

Progressive Disclosure

No bundle files exist and all content sits inline in one ~690-line SKILL.md that could be split into reference files; above 1 because sections are well organized, below 3 because material that should be separate is not broken out with signaled one-level-deep references.

2 / 3

Total

9

/

12

Passed

Description

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is strong across all dimensions: it names concrete capabilities, includes natural trigger terms, supplies an explicit Use-when clause, and occupies a clear niche. No first/second-person voice or vague fluff is present.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions—"automated metrics, human feedback, and benchmarking"—matching the anchor for enumerating several concrete capabilities rather than vague language.

3 / 3

Completeness

Explicitly answers both what ("Implement comprehensive evaluation strategies...") and when via a clear "Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks" clause.

3 / 3

Trigger Term Quality

"testing LLM performance, measuring AI application quality" are natural phrases a user would say; not the 2 level because common variations are well covered rather than partially missing.

3 / 3

Distinctiveness Conflict Risk

The LLM-evaluation niche with distinct trigger phrasing is unlikely to fire for unrelated skills, matching the clear-niche anchor.

3 / 3

Total

12

/

12

Passed

Validation

93%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation15 / 16 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (696 lines); consider splitting into references/ and linking

Warning

Total

15

/

16

Passed

Repository
Dicklesworthstone/pi_agent_rust
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.