CtrlK
BlogDocsLog inGet started
Tessl Logo

jbvc/agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

62

Quality

62%

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

Quality

Discovery

75%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description covers a clear niche (LLM agent testing/benchmarking) and includes both 'what' and 'when' components, which is good. However, the capabilities listed are more like category names than concrete actions, and the trigger terms are somewhat repetitive. The inclusion of the factoid about '<50% on real-world benchmarks' is interesting context but doesn't help with skill selection.

Suggestions

Replace category labels with concrete actions, e.g., 'Generates behavioral test suites for LLM agents, runs capability benchmarks, measures reliability metrics, and sets up production monitoring dashboards.'

Diversify trigger terms to include more natural user phrasings like 'evaluate my AI agent,' 'LLM performance testing,' 'agent accuracy,' 'how reliable is my agent,' 'test my chatbot,' or 'agent QA.'

DimensionReasoningScore

Specificity

Names the domain (LLM agent testing/benchmarking) and lists some areas like 'behavioral testing, capability assessment, reliability metrics, and production monitoring,' but these are more category labels than concrete actions. It doesn't specify what the skill actually does (e.g., 'generates test suites,' 'runs benchmarks,' 'produces reliability reports').

2 / 3

Completeness

The description answers both 'what' (testing and benchmarking LLM agents with behavioral testing, capability assessment, reliability metrics, production monitoring) and 'when' (explicit 'Use when' clause with trigger terms). Both components are present and explicit.

3 / 3

Trigger Term Quality

The 'Use when' clause includes relevant terms like 'agent testing,' 'agent evaluation,' 'benchmark agents,' and 'agent reliability,' but these are somewhat repetitive variations of the same concept. Missing natural user phrases like 'evaluate my AI agent,' 'LLM performance,' 'agent accuracy,' 'how good is my agent,' or 'test my chatbot.'

2 / 3

Distinctiveness Conflict Risk

The focus on LLM agent testing and benchmarking is a clear, specific niche. The mention of agent-specific benchmarks, reliability metrics, and the statistical claim about real-world benchmarks makes it unlikely to conflict with general testing or general LLM skills.

3 / 3

Total

10

/

12

Passed

Implementation

14%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads as an incomplete outline or skeleton rather than a functional skill document. It names important concepts (statistical testing, behavioral contracts, adversarial testing) but provides no concrete implementation guidance, code examples, or executable workflows. The sharp edges table contains placeholder comments instead of solutions, and the patterns section lacks any substantive content beyond titles.

Suggestions

Add concrete, executable code examples for each pattern—e.g., a Python function that runs an agent test N times and computes pass rate with confidence intervals for 'Statistical Test Evaluation'.

Replace the placeholder comments in the Sharp Edges table with actual solutions or at minimum specific techniques (e.g., 'Use held-out test sets with hash-based deduplication against training data' for data leakage).

Add a clear step-by-step workflow for evaluating an agent: define behavioral contracts → write test cases → run statistical evaluation → analyze distributions → set up regression monitoring.

Either flesh out the patterns/anti-patterns with inline examples or create referenced files (e.g., BEHAVIORAL_TESTING.md, STATISTICAL_EVALUATION.md) with detailed guidance and link to them.

DimensionReasoningScore

Conciseness

The opening paragraphs contain unnecessary narrative framing ('You're a quality engineer who has seen agents...') that wastes tokens without adding actionable value. The capabilities/requirements lists are metadata-like content that could be in frontmatter. However, the tables and pattern/anti-pattern sections are reasonably concise.

2 / 3

Actionability

The skill provides zero concrete code, commands, or executable examples. Patterns like 'Statistical Test Evaluation' and 'Behavioral Contract Testing' are named but not demonstrated—there are no code snippets, no specific tools, no example test cases, and no concrete implementation guidance. The sharp edges table has placeholder comments (e.g., '// Bridge benchmark and production evaluation') instead of actual solutions.

1 / 3

Workflow Clarity

There is no sequenced workflow, no step-by-step process for evaluating an agent, and no validation checkpoints. The patterns section names approaches but never describes how to execute them. For a multi-step domain like agent evaluation (design tests → run tests → analyze results → iterate), the complete absence of workflow is a significant gap.

1 / 3

Progressive Disclosure

The content is a shallow outline with no depth anywhere—neither inline nor via references to external files. The 'Related Skills' section mentions other skills but there are no links to detailed guides, examples, or reference materials. The sharp edges solutions are stub comments rather than actual content or pointers to content.

1 / 3

Total

5

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

frontmatter_unknown_keys

Unknown frontmatter key(s) found; consider removing or moving to metadata

Warning

Total

10

/

11

Passed

Reviewed

Table of Contents