CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on re...

52

0.99x

Quality

27%

Does it follow best practices?

Impact

99%

0.99x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/antigravity-agent-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description identifies a clear domain (LLM agent testing/benchmarking) and lists relevant capability areas, but suffers from being truncated and lacking explicit trigger guidance. The capabilities listed are category-level rather than concrete actions, and there's no 'Use when...' clause to help Claude know when to select this skill.

Suggestions

Add a 'Use when...' clause with explicit triggers like 'Use when evaluating agent performance, running benchmarks, testing AI agents, or measuring agent reliability'

Replace category labels with concrete actions: instead of 'behavioral testing', say 'run behavioral test suites, measure task completion rates, evaluate agent responses'

Ensure the description is not truncated and include common user phrasings like 'agent evaluation', 'test my agent', 'agent performance metrics'

DimensionReasoningScore

Specificity

Names the domain (LLM agent testing) and lists several action areas (behavioral testing, capability assessment, reliability metrics, production monitoring), but these are category labels rather than concrete actions like 'run benchmark suites' or 'generate test cases'.

2 / 3

Completeness

Describes what the skill covers but has no 'Use when...' clause or explicit trigger guidance. The description is also truncated (ends with 're...'), making it incomplete. Missing the 'when' component entirely.

1 / 3

Trigger Term Quality

Includes relevant terms like 'LLM agents', 'benchmarking', 'testing', 'reliability metrics', but misses common variations users might say like 'agent evaluation', 'AI testing', 'model performance', or 'agent accuracy'.

2 / 3

Distinctiveness Conflict Risk

The focus on 'LLM agents' specifically provides some distinction from general testing skills, but 'testing', 'benchmarking', and 'monitoring' are broad terms that could overlap with other testing or monitoring skills.

2 / 3

Total

7

/

12

Passed

Implementation

22%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is essentially a skeleton or outline with no actionable content. It names important concepts in agent evaluation (statistical testing, behavioral contracts, adversarial testing) but provides zero implementation guidance, code examples, or concrete steps. The Sharp Edges table is particularly problematic with empty solution fields that just contain comments.

Suggestions

Add executable code examples for each pattern (e.g., a Python function showing statistical test evaluation with multiple runs and confidence intervals)

Complete the Sharp Edges solutions with actual guidance instead of placeholder comments

Define a concrete workflow for evaluating an agent: setup → baseline tests → statistical runs → analysis → reporting

Add references to detailed documentation files for complex topics like benchmark design and reliability metrics

DimensionReasoningScore

Conciseness

The content has some unnecessary narrative framing ('You're a quality engineer who has seen agents...') that doesn't add actionable value. The patterns and anti-patterns sections are appropriately brief but lack substance.

2 / 3

Actionability

The skill provides no concrete code, commands, or executable examples. Patterns like 'Statistical Test Evaluation' and 'Behavioral Contract Testing' are named but not explained with any implementation details. The Sharp Edges table has empty solutions (just comments).

1 / 3

Workflow Clarity

There is no workflow defined. The skill mentions concepts like 'behavioral regression tests' and 'capability assessments' but provides no sequence of steps, no validation checkpoints, and no process for actually evaluating an agent.

1 / 3

Progressive Disclosure

The content has reasonable section structure (Patterns, Anti-Patterns, Sharp Edges, Related Skills) but the sections are mostly empty placeholders. No references to detailed documentation or examples are provided despite the complex topic warranting them.

2 / 3

Total

6

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Repository
boisenoise/skills-collections
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.