agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on re...

Install with Tessl CLI

npx tessl i github:sickn33/antigravity-awesome-skills --skill agent-evaluation

What are skills?

0.99x

Quality

27%

Does it follow best practices?

Impact

99%

0.99x

Average score across 3 eval scenarios

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/agent-evaluation/SKILL.md

SKILL.md

Review

Evals

Quality

Discovery

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description identifies a clear domain (LLM agent testing) and lists relevant capability areas, but suffers from being truncated and lacking explicit trigger guidance. The terms used are somewhat technical category labels rather than concrete actions, and there's no 'Use when...' clause to help Claude know when to select this skill.

Suggestions

Add an explicit 'Use when...' clause with trigger phrases like 'evaluate agent performance', 'test LLM behavior', 'benchmark AI agents', or 'measure agent reliability'

Replace category labels with concrete actions: instead of 'behavioral testing', say 'design behavioral test suites, measure pass rates, identify failure modes'

Include common user terminology variations: 'evals', 'agent evaluation', 'LLM testing', 'AI agent benchmarks', 'agent accuracy'

Dimension	Reasoning	Score
Specificity	Names the domain (LLM agent testing) and lists several action areas (behavioral testing, capability assessment, reliability metrics, production monitoring), but these are category labels rather than concrete actions like 'run benchmark suites' or 'generate test cases'.	2 / 3
Completeness	Describes what the skill covers (testing/benchmarking LLM agents) but completely lacks a 'Use when...' clause or any explicit trigger guidance. The description also appears truncated mid-sentence.	1 / 3
Trigger Term Quality	Includes relevant terms like 'LLM agents', 'benchmarking', 'testing', 'reliability metrics', but misses common variations users might say like 'eval', 'evals', 'agent evaluation', 'test harness', or 'accuracy testing'.	2 / 3
Distinctiveness Conflict Risk	The focus on 'LLM agents' specifically provides some distinction from general testing skills, but 'testing' and 'benchmarking' are broad terms that could overlap with other QA or performance testing skills.	2 / 3
	Total	7 / 12 Passed

Implementation

22%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is essentially a skeleton or outline rather than actionable guidance. It names important concepts in agent evaluation (statistical testing, behavioral contracts, adversarial testing) but provides zero executable examples, no actual solutions in the sharp edges table, and no workflow for how to implement any evaluation approach. Claude would not know what to do with this skill beyond understanding that agent evaluation is a topic.

Suggestions

Add concrete, executable code examples for at least one evaluation pattern (e.g., a Python snippet showing statistical test evaluation with multiple runs and confidence intervals)

Replace the placeholder '// comments' in the Sharp Edges table with actual solutions or code snippets

Define a clear workflow for evaluating an agent, such as: 1) Define behavioral contracts, 2) Run N iterations, 3) Analyze distribution, 4) Check against thresholds

Remove the persona framing paragraph and replace with actionable quick-start guidance

Dimension	Reasoning	Score
Conciseness	The content is relatively brief but includes unnecessary persona framing ('You're a quality engineer who has seen...') that doesn't add actionable value. The capabilities/requirements lists are terse but the sharp edges table has placeholder comments instead of actual solutions.	2 / 3
Actionability	The skill provides no concrete code, commands, or executable examples. Patterns like 'Statistical Test Evaluation' and 'Behavioral Contract Testing' are named but not demonstrated. The sharp edges table lists problems with '// comments' as placeholders rather than actual solutions.	1 / 3
Workflow Clarity	There is no workflow defined. No steps, sequences, or validation checkpoints are provided. The content describes concepts and categories but never explains how to actually perform agent evaluation.	1 / 3
Progressive Disclosure	The content has some structure with clear sections (Patterns, Anti-Patterns, Sharp Edges), but there are no references to detailed documentation. The sections themselves are stubs without substantive content to organize.	2 / 3
	Total	6 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
frontmatter_unknown_keys	Unknown frontmatter key(s) found; consider removing or moving to metadata	Warning

	Total	10 / 11 Passed

Reviewed: 1 day ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.