Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on re...
Install with Tessl CLI
npx tessl i github:sickn33/antigravity-awesome-skills --skill agent-evaluation52
Quality
27%
Does it follow best practices?
Impact
99%
0.99xAverage score across 3 eval scenarios
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/agent-evaluation/SKILL.mdQuality
Discovery
32%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description identifies a clear domain (LLM agent testing) and lists relevant capability areas, but suffers from being truncated and lacking explicit trigger guidance. The terms used are somewhat technical category labels rather than concrete actions, and there's no 'Use when...' clause to help Claude know when to select this skill.
Suggestions
Add an explicit 'Use when...' clause with trigger phrases like 'evaluate agent performance', 'test LLM behavior', 'benchmark AI agents', or 'measure agent reliability'
Replace category labels with concrete actions: instead of 'behavioral testing', say 'design behavioral test suites, measure pass rates, identify failure modes'
Include common user terminology variations: 'evals', 'agent evaluation', 'LLM testing', 'AI agent benchmarks', 'agent accuracy'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (LLM agent testing) and lists several action areas (behavioral testing, capability assessment, reliability metrics, production monitoring), but these are category labels rather than concrete actions like 'run benchmark suites' or 'generate test cases'. | 2 / 3 |
Completeness | Describes what the skill covers (testing/benchmarking LLM agents) but completely lacks a 'Use when...' clause or any explicit trigger guidance. The description also appears truncated mid-sentence. | 1 / 3 |
Trigger Term Quality | Includes relevant terms like 'LLM agents', 'benchmarking', 'testing', 'reliability metrics', but misses common variations users might say like 'eval', 'evals', 'agent evaluation', 'test harness', or 'accuracy testing'. | 2 / 3 |
Distinctiveness Conflict Risk | The focus on 'LLM agents' specifically provides some distinction from general testing skills, but 'testing' and 'benchmarking' are broad terms that could overlap with other QA or performance testing skills. | 2 / 3 |
Total | 7 / 12 Passed |
Implementation
22%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill is essentially a skeleton or outline rather than actionable guidance. It names important concepts in agent evaluation (statistical testing, behavioral contracts, adversarial testing) but provides zero executable examples, no actual solutions in the sharp edges table, and no workflow for how to implement any evaluation approach. Claude would not know what to do with this skill beyond understanding that agent evaluation is a topic.
Suggestions
Add concrete, executable code examples for at least one evaluation pattern (e.g., a Python snippet showing statistical test evaluation with multiple runs and confidence intervals)
Replace the placeholder '// comments' in the Sharp Edges table with actual solutions or code snippets
Define a clear workflow for evaluating an agent, such as: 1) Define behavioral contracts, 2) Run N iterations, 3) Analyze distribution, 4) Check against thresholds
Remove the persona framing paragraph and replace with actionable quick-start guidance
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The content is relatively brief but includes unnecessary persona framing ('You're a quality engineer who has seen...') that doesn't add actionable value. The capabilities/requirements lists are terse but the sharp edges table has placeholder comments instead of actual solutions. | 2 / 3 |
Actionability | The skill provides no concrete code, commands, or executable examples. Patterns like 'Statistical Test Evaluation' and 'Behavioral Contract Testing' are named but not demonstrated. The sharp edges table lists problems with '// comments' as placeholders rather than actual solutions. | 1 / 3 |
Workflow Clarity | There is no workflow defined. No steps, sequences, or validation checkpoints are provided. The content describes concepts and categories but never explains how to actually perform agent evaluation. | 1 / 3 |
Progressive Disclosure | The content has some structure with clear sections (Patterns, Anti-Patterns, Sharp Edges), but there are no references to detailed documentation. The sections themselves are stubs without substantive content to organize. | 2 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
frontmatter_unknown_keys | Unknown frontmatter key(s) found; consider removing or moving to metadata | Warning |
Total | 10 / 11 Passed | |
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.