CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

33

Quality

30%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/antigravity-agent-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description identifies a clear domain (LLM agent testing and benchmarking) but relies on category-level language rather than concrete actions. It critically lacks a 'Use when...' clause, making it difficult for Claude to know when to select this skill. The statistical claim about agent performance is interesting context but doesn't aid skill selection.

Suggestions

Add an explicit 'Use when...' clause with trigger scenarios, e.g., 'Use when the user asks about evaluating LLM agents, creating evals, measuring agent accuracy, or setting up agent benchmarks.'

Replace category names with concrete actions, e.g., 'Designs behavioral test suites for LLM agents, runs capability benchmarks, calculates reliability metrics, and sets up production monitoring dashboards.'

Include common user-facing trigger terms like 'eval', 'evaluation', 'agent performance', 'test suite', and 'accuracy measurement' to improve matching.

DimensionReasoningScore

Specificity

The description names the domain (LLM agent testing/benchmarking) and lists several areas (behavioral testing, capability assessment, reliability metrics, production monitoring), but these are more like categories than concrete actions. It doesn't specify what the skill actually does (e.g., 'generates test suites', 'runs benchmarks', 'produces reports').

2 / 3

Completeness

The description addresses 'what' at a high level but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. Per the rubric, a missing 'Use when...' clause should cap completeness at 2, and since the 'what' is also somewhat vague, this scores a 1.

1 / 3

Trigger Term Quality

Includes some relevant terms like 'LLM agents', 'benchmarking', 'testing', 'reliability metrics', and 'production monitoring' that users might naturally use. However, it misses common variations like 'eval', 'evaluation', 'agent evaluation', 'test harness', 'accuracy', or 'performance testing'.

2 / 3

Distinctiveness Conflict Risk

The focus on LLM agent testing/benchmarking is a reasonably specific niche, but the broad terms like 'testing', 'monitoring', and 'capability assessment' could overlap with general software testing skills or monitoring/observability skills. The added statistical claim about '50% on real-world benchmarks' is flavor text that doesn't help with disambiguation.

2 / 3

Total

7

/

12

Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is a comprehensive but extremely verbose document that reads more like a textbook chapter than a concise skill reference. The code examples provide structural patterns but are not executable, relying on many undefined types and placeholder methods. The content would be dramatically improved by extracting the large code blocks into separate reference files and keeping the main skill as a lean overview with clear pointers.

Suggestions

Reduce the main SKILL.md to ~100 lines: keep a brief overview of each pattern (2-3 sentences + key interface) and move full implementations to separate files like patterns/statistical-evaluation.ts, patterns/behavioral-contracts.ts, etc.

Make code examples executable by providing concrete implementations rather than abstract classes with undefined types—or explicitly frame them as design patterns/templates rather than runnable code.

Remove explanations of concepts Claude already knows (what confidence intervals are, how chi-squared tests work, what Jaccard similarity measures) and focus on agent-evaluation-specific decisions and thresholds.

Add explicit validation checkpoints to workflows—e.g., after establishing a baseline, verify the baseline is stable before proceeding to regression testing; after detecting leakage, specify concrete remediation steps with verification.

DimensionReasoningScore

Conciseness

Extremely verbose at ~700+ lines. Massive code blocks explain concepts Claude already knows (statistical testing, chi-squared tests, Jaccard similarity). The entire content reads like a tutorial/textbook rather than a concise skill reference. Sections like 'What is a PDF' equivalent explanations of basic testing concepts, interface definitions that add no actionable value, and helper methods that are obvious implementations all waste tokens.

1 / 3

Actionability

The code examples are TypeScript-like but not truly executable—they reference undefined types (Agent, AgentOutput, AgentContext, TestCase), use placeholder methods (this.containsRudeLanguage, this.isRelevantToCustomerService), and lack imports or setup instructions. The patterns show structure but couldn't be copy-pasted and run. They're closer to detailed pseudocode than executable guidance.

2 / 3

Workflow Clarity

The Collaboration section has brief workflow sequences (e.g., 'Design agent → Create evaluation suite → Implement → Evaluate → Iterate'), but the main patterns lack explicit validation checkpoints or feedback loops. The Statistical Test Evaluation pattern runs tests and analyzes but doesn't specify what to do when concerns are identified. The regression testing has a deploy/don't-deploy recommendation but no recovery workflow.

2 / 3

Progressive Disclosure

Monolithic wall of text with no references to external files. All content is inline—hundreds of lines of code that could be in separate reference files. No bundle files exist, and no attempt is made to split content into overview vs. detailed references. The skill would benefit enormously from extracting patterns into separate files and keeping SKILL.md as a concise overview.

1 / 3

Total

6

/

12

Passed

Validation

81%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation9 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (1136 lines); consider splitting into references/ and linking

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

9

/

11

Passed

Repository
boisenoise/skills-collections
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.