Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
68
Does it follow best practices?
If you maintain this skill, you can automatically optimize it using the tessl CLI to improve its score:
npx tessl skill review --optimize ./path/to/skillValidation for skill structure
Statistical and behavioral contract testing
Multi-run execution
100%
100%
Distribution analysis
100%
100%
No string matching scoring
100%
100%
Behavioral invariants
100%
100%
Edge/negative cases
90%
100%
Flakiness handling documented
100%
100%
Multiple behavioral dimensions
100%
100%
Summary report output
100%
100%
Happy path not sole focus
100%
100%
Behavioral invariants documented
85%
100%
Without context: $0.3464 · 1m 46s · 16 turns · 22 in / 6,666 out tokens
With context: $0.6287 · 2m 32s · 28 turns · 33 in / 9,326 out tokens
Benchmark-production gap and multi-dimensional metrics
Benchmark-production gap addressed
100%
100%
Multiple distinct metrics
100%
100%
No single-metric reliance
100%
100%
Metric gaming prevention
100%
80%
Production-representative scenarios
100%
100%
No output string matching in prototype
100%
100%
Per-metric scores reported
100%
100%
Failure modes mapped to metrics
100%
100%
Adversarial or edge inputs noted
100%
100%
Statistical reliability mentioned
0%
100%
Without context: $0.3080 · 1m 41s · 14 turns · 17 in / 5,467 out tokens
With context: $0.6670 · 2m 59s · 28 turns · 33 in / 10,704 out tokens
Adversarial testing and data leakage prevention
Adversarial tests present
100%
100%
Adversarial category labeled
100%
100%
Data leakage risk identified
100%
100%
Data leakage mitigation proposed
100%
100%
Beyond happy path
100%
100%
No exact string matching
100%
100%
Adversarial intent documented
100%
100%
Separate category results
100%
100%
Multi-run or reliability note
0%
100%
Behavioral checks used
100%
100%
Without context: $0.3916 · 2m 16s · 16 turns · 22 in / 8,350 out tokens
With context: $0.6941 · 3m 3s · 26 turns · 279 in / 11,476 out tokens
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.