CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

25

Quality

7%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/agent-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

0%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This description reads as a persona/backstory rather than a functional skill description. It uses second-person voice ('You're a quality engineer') which violates the third-person requirement, provides no concrete actions or capabilities, and completely lacks trigger guidance for when Claude should select this skill. It would be nearly impossible for Claude to correctly choose this skill from a list of available options.

Suggestions

Replace the persona narrative with concrete actions in third person, e.g., 'Designs evaluation frameworks for LLM agents, creates test suites, measures output quality, and identifies failure modes in production deployments.'

Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks about evaluating AI agents, creating evals, testing LLM outputs, measuring prompt quality, or debugging agent failures in production.'

Remove the second-person voice ('You're a quality engineer') and philosophical framing, replacing it with specific, actionable capability descriptions that distinguish this skill from general QA or testing skills.

DimensionReasoningScore

Specificity

The description uses vague, abstract language about being a 'quality engineer' and philosophical statements about LLM evaluation challenges. It lists no concrete actions like 'create test suites', 'run evaluations', or 'measure accuracy metrics'.

1 / 3

Completeness

The description fails to answer both 'what does this do' and 'when should Claude use it'. There is no 'Use when...' clause or equivalent trigger guidance, and the 'what' is entirely absent—it describes a persona and philosophy rather than capabilities.

1 / 3

Trigger Term Quality

Contains some relevant terms like 'LLM agents', 'benchmarks', and 'production', but these are buried in narrative prose rather than serving as natural trigger keywords. Missing common user terms like 'eval', 'testing', 'accuracy', 'prompt evaluation', or 'agent testing'.

1 / 3

Distinctiveness Conflict Risk

The description is so vague about actual capabilities that it could overlap with any testing, QA, or AI-related skill. There are no distinct triggers that would help Claude differentiate this from other skills.

1 / 3

Total

4

/

12

Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is essentially a skeleton or outline with no actionable content. It names important concepts (statistical test evaluation, behavioral contract testing, adversarial testing) but provides zero implementation details, code examples, or concrete guidance. The sharp edges table contains placeholder comments instead of actual solutions, and the introductory text wastes tokens on narrative that Claude doesn't need.

Suggestions

Add concrete, executable code examples for each pattern (e.g., a Python function that runs an agent test N times and computes pass rate with confidence intervals for 'Statistical Test Evaluation').

Replace the placeholder comments in the Sharp Edges table with actual solutions (e.g., specific techniques for bridging benchmark and production evaluation, handling flaky tests).

Define a clear multi-step workflow for evaluating an agent, with explicit validation checkpoints (e.g., 1. Define behavioral contracts → 2. Run statistical tests → 3. Analyze distributions → 4. Flag regressions).

Remove the introductory narrative paragraphs and replace with a concise quick-start section that shows how to immediately begin evaluating an agent.

DimensionReasoningScore

Conciseness

The introductory paragraphs explain concepts Claude already understands (what makes LLM evaluation different from traditional testing). The pattern/anti-pattern sections are lean but the overall content has unnecessary narrative framing.

2 / 3

Actionability

There are no concrete code examples, commands, or executable guidance anywhere. Patterns like 'Statistical Test Evaluation' and 'Behavioral Contract Testing' are named but never explained with actual implementation steps, code, or specific techniques. The sharp edges table has placeholder comments (// ...) instead of actual solutions.

1 / 3

Workflow Clarity

There is no workflow, sequence of steps, or process defined. The skill names concepts (statistical testing, adversarial testing) but never describes how to actually perform them. No validation checkpoints or feedback loops exist.

1 / 3

Progressive Disclosure

The content is a flat list of headings with minimal substance under each. There are no references to detailed files, no navigation structure, and no meaningful content hierarchy. The 'Related Skills' and 'When to Use' sections add no value.

1 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
sickn33/antigravity-awesome-skills
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.