CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

42

Quality

30%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/antigravity-agent-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Discovery

32%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description identifies a clear domain (LLM agent testing and benchmarking) and lists relevant subcategories, but it reads more like a topic summary than an actionable skill description. It lacks a 'Use when...' clause, concrete actions the skill performs, and the trailing statistical claim ('even top agents achieve less than 50%') is editorial commentary that doesn't aid skill selection.

Suggestions

Add an explicit 'Use when...' clause with trigger scenarios, e.g., 'Use when the user asks about evaluating LLM agents, creating test suites for AI systems, measuring agent reliability, or setting up evals.'

Replace category labels with concrete actions, e.g., 'Designs behavioral test suites, runs capability benchmarks, calculates reliability metrics, and sets up production monitoring dashboards for LLM agents.'

Remove the editorial claim about '50% on real-world benchmarks' and instead add common user trigger terms like 'eval', 'evaluation', 'agent performance', 'test harness', or 'accuracy measurement'.

DimensionReasoningScore

Specificity

Names the domain (LLM agent testing/benchmarking) and lists some action areas (behavioral testing, capability assessment, reliability metrics, production monitoring), but these are more like categories than concrete actions. It doesn't specify what the skill actually does (e.g., 'generates test suites', 'runs benchmarks', 'produces reports').

2 / 3

Completeness

Describes what the skill covers at a high level but completely lacks a 'Use when...' clause or any explicit trigger guidance for when Claude should select this skill. The rubric states a missing 'Use when...' clause should cap completeness at 2, and the 'what' itself is also somewhat vague, bringing this to 1.

1 / 3

Trigger Term Quality

Includes relevant terms like 'LLM agents', 'benchmarking', 'testing', 'reliability metrics', and 'production monitoring' which users might naturally use. However, it misses common variations like 'eval', 'evaluation', 'agent evaluation', 'test harness', 'accuracy', or 'performance testing'.

2 / 3

Distinctiveness Conflict Risk

The focus on LLM agent testing/benchmarking is a reasonably specific niche, but terms like 'testing', 'monitoring', and 'metrics' could overlap with general software testing or monitoring skills. The statistical claim about '50% on real-world benchmarks' adds flavor but doesn't help with disambiguation.

2 / 3

Total

7

/

12

Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill is comprehensive in coverage but severely undermined by its verbosity and monolithic structure. The ~700+ lines of mostly non-executable TypeScript class implementations could be condensed to a fraction of the size by providing concise patterns with key interfaces and delegating full implementations to separate reference files. The content reads more like a tutorial or library documentation than a lean skill file optimized for Claude's context window.

Suggestions

Reduce the main file to ~100 lines with concise pattern summaries and key interfaces, moving full class implementations to separate reference files (e.g., patterns/statistical-evaluation.md, patterns/adversarial-testing.md)

Remove metadata sections (Capabilities, Prerequisites, Scope, Ecosystem, Related Skills, When to Use) that duplicate frontmatter concerns and don't provide actionable guidance

Make code examples executable by either providing concrete implementations with real dependencies (e.g., using actual testing frameworks like Jest/Vitest) or simplifying to focused, runnable snippets rather than abstract class hierarchies

Add explicit validation checkpoints to workflows, e.g., 'If baseline pass rate < 0.7, stop and investigate test quality before proceeding to regression testing'

DimensionReasoningScore

Conciseness

Extremely verbose at ~700+ lines. Massive code blocks contain full class implementations with helper methods, type definitions, and extensive inline comments that Claude doesn't need. Concepts like confidence intervals, chi-squared tests, and Jaccard similarity are explained through code that Claude already understands. The metadata sections (Capabilities, Prerequisites, Scope, Ecosystem) add significant overhead with information that could be drastically condensed.

1 / 3

Actionability

The code examples are detailed TypeScript/pseudo-TypeScript but not truly executable—they reference undefined types (Agent, AgentOutput, AgentContext, TestCase), unimplemented helper methods (containsRudeLanguage, isRelevantToCustomerService, containsLegalAdvice, similarity), and abstract interfaces without concrete implementations. They illustrate patterns well but aren't copy-paste ready.

2 / 3

Workflow Clarity

The Collaboration section includes brief workflow sequences (design → create suite → implement → evaluate → iterate), and the patterns are logically sequenced. However, there are no explicit validation checkpoints within the main patterns—the statistical evaluator and regression tester don't specify when to stop, what thresholds trigger action, or how to recover from failures in the evaluation pipeline itself.

2 / 3

Progressive Disclosure

This is a monolithic wall of text with no references to external files. All patterns, sharp edges, and collaboration workflows are inlined in a single massive document. The content would benefit enormously from splitting patterns into separate files (e.g., STATISTICAL_TESTING.md, ADVERSARIAL_TESTING.md, REGRESSION_TESTING.md) with a concise overview in the main skill file.

1 / 3

Total

6

/

12

Passed

Validation

81%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation9 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (1132 lines); consider splitting into references/ and linking

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

9

/

11

Passed

Repository
boisenoise/skills-collections
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.