You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.
25
7%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/agent-evaluation/SKILL.mdQuality
Discovery
0%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description reads as a persona/backstory rather than a functional skill description. It uses second-person voice ('You're a quality engineer') which violates the third-person requirement, provides no concrete actions or capabilities, and completely lacks trigger guidance for when Claude should select this skill. It would be nearly impossible for Claude to correctly choose this skill from a list of available options.
Suggestions
Replace the persona narrative with concrete actions in third person, e.g., 'Designs evaluation frameworks for LLM agents, creates test suites, measures output quality, and identifies failure modes in production deployments.'
Add an explicit 'Use when...' clause with natural trigger terms, e.g., 'Use when the user asks about evaluating AI agents, creating evals, testing LLM outputs, measuring prompt quality, or debugging agent failures in production.'
Remove the second-person voice ('You're a quality engineer') and philosophical framing, replacing it with specific, actionable capability descriptions that distinguish this skill from general QA or testing skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description uses vague, abstract language about being a 'quality engineer' and philosophical statements about LLM evaluation challenges. It lists no concrete actions like 'create test suites', 'run evaluations', or 'measure accuracy metrics'. | 1 / 3 |
Completeness | The description fails to answer both 'what does this do' and 'when should Claude use it'. There is no 'Use when...' clause or equivalent trigger guidance, and the 'what' is entirely absent—it describes a persona and philosophy rather than capabilities. | 1 / 3 |
Trigger Term Quality | Contains some relevant terms like 'LLM agents', 'benchmarks', and 'production', but these are buried in narrative prose rather than serving as natural trigger keywords. Missing common user terms like 'eval', 'testing', 'accuracy', 'prompt evaluation', or 'agent testing'. | 1 / 3 |
Distinctiveness Conflict Risk | The description is so vague about actual capabilities that it could overlap with any testing, QA, or AI-related skill. There are no distinct triggers that would help Claude differentiate this from other skills. | 1 / 3 |
Total | 4 / 12 Passed |
Implementation
14%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is essentially a skeleton or outline with no actionable content. It names important concepts (statistical test evaluation, behavioral contract testing, adversarial testing) but provides zero implementation details, code examples, or concrete guidance. The sharp edges table contains placeholder comments instead of actual solutions, and the introductory text wastes tokens on narrative that Claude doesn't need.
Suggestions
Add concrete, executable code examples for each pattern (e.g., a Python function that runs an agent test N times and computes pass rate with confidence intervals for 'Statistical Test Evaluation').
Replace the placeholder comments in the Sharp Edges table with actual solutions (e.g., specific techniques for bridging benchmark and production evaluation, handling flaky tests).
Define a clear multi-step workflow for evaluating an agent, with explicit validation checkpoints (e.g., 1. Define behavioral contracts → 2. Run statistical tests → 3. Analyze distributions → 4. Flag regressions).
Remove the introductory narrative paragraphs and replace with a concise quick-start section that shows how to immediately begin evaluating an agent.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The introductory paragraphs explain concepts Claude already understands (what makes LLM evaluation different from traditional testing). The pattern/anti-pattern sections are lean but the overall content has unnecessary narrative framing. | 2 / 3 |
Actionability | There are no concrete code examples, commands, or executable guidance anywhere. Patterns like 'Statistical Test Evaluation' and 'Behavioral Contract Testing' are named but never explained with actual implementation steps, code, or specific techniques. The sharp edges table has placeholder comments (// ...) instead of actual solutions. | 1 / 3 |
Workflow Clarity | There is no workflow, sequence of steps, or process defined. The skill names concepts (statistical testing, adversarial testing) but never describes how to actually perform them. No validation checkpoints or feedback loops exist. | 1 / 3 |
Progressive Disclosure | The content is a flat list of headings with minimal substance under each. There are no references to detailed files, no navigation structure, and no meaningful content hierarchy. The 'Related Skills' and 'When to Use' sections add no value. | 1 / 3 |
Total | 5 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
d739c8b
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.