CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

56

Quality

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Content

65%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

The content is highly actionable with rich executable code, but it suffers from monolithic verbosity and lacks explicit validation checkpoints and file-based progressive disclosure for its large reference material.

Suggestions

Move the long TypeScript class implementations into separate reference files (e.g., references/statistical-evaluator.ts) and keep SKILL.md a concise overview with one-level-deep links.

Add explicit validation/feedback-loop checkpoints (validate -> fix -> retry) to the risky-operation workflows such as adversarial testing and production-readiness evaluation.

Tighten or trim the code blocks to the essential executable fragments rather than full class bodies to improve token efficiency.

DimensionReasoningScore

Conciseness

The prose is lean and assumes competence, but the body is a ~1100-line monolith padded with long full-class TypeScript implementations that could be tightened significantly.

2 / 3

Actionability

It provides extensive, executable TypeScript code with complete interfaces, classes, and detector functions across every pattern, giving copy-paste-ready concrete guidance.

3 / 3

Workflow Clarity

Numbered workflows and 'When to use' labels exist, but risky operations (adversarial testing, production readiness) lack explicit validation checkpoints and feedback loops.

2 / 3

Progressive Disclosure

Sections are reasonably organized, but no bundle files exist and the entire content is inlined in one monolithic SKILL.md rather than split into one-level-deep references.

2 / 3

Total

9

/

12

Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description is specific and distinctive, clearly conveying the domain and concrete actions, but omits an explicit trigger/usage clause, which caps completeness and slightly weakens trigger-term naturalness.

Suggestions

Add an explicit 'Use when...' clause naming natural user phrases (e.g., 'Use when the user mentions agent testing, benchmarking agents, or agent reliability').

Soften technical phrasing like 'capability assessment' and 'reliability metrics' with more common trigger terms users would actually say.

DimensionReasoningScore

Specificity

The description enumerates multiple concrete actions — 'behavioral testing, capability assessment, reliability metrics, and production monitoring' — rather than vaguely naming a domain.

3 / 3

Completeness

It clearly states what the skill does but lacks an explicit 'Use when...' trigger clause, so the 'when' is only implied; per guidelines this caps completeness at 2.

2 / 3

Trigger Term Quality

It includes relevant keywords like 'testing', 'benchmarking', and 'reliability', but phrasing leans technical ('capability assessment', 'reliability metrics') and misses some natural variations a user would say.

2 / 3

Distinctiveness Conflict Risk

It targets a clear niche (LLM agent evaluation/benchmarking) with distinct triggers, making it unlikely to fire for unrelated skills.

3 / 3

Total

10

/

12

Passed

Validation

87%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation14 / 16 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (1136 lines); consider splitting into references/ and linking

Warning

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

14

/

16

Passed

Repository
boisenoise/skills-collections
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.