Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
63
44%
Does it follow best practices?
Impact
99%
1.05xAverage score across 3 eval scenarios
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./docs/v19.7/configuration/agent/skills_external/antigravity-awesome-skills-main/skills/agent-evaluation/SKILL.mdStatistical and behavioral contract testing
Multi-run execution
100%
100%
Distribution analysis
100%
100%
No string matching scoring
100%
100%
Behavioral invariants
100%
100%
Edge/negative cases
90%
100%
Flakiness handling documented
100%
100%
Multiple behavioral dimensions
100%
100%
Summary report output
100%
100%
Happy path not sole focus
100%
100%
Behavioral invariants documented
85%
100%
Benchmark-production gap and multi-dimensional metrics
Benchmark-production gap addressed
100%
100%
Multiple distinct metrics
100%
100%
No single-metric reliance
100%
100%
Metric gaming prevention
100%
80%
Production-representative scenarios
100%
100%
No output string matching in prototype
100%
100%
Per-metric scores reported
100%
100%
Failure modes mapped to metrics
100%
100%
Adversarial or edge inputs noted
100%
100%
Statistical reliability mentioned
0%
100%
Adversarial testing and data leakage prevention
Adversarial tests present
100%
100%
Adversarial category labeled
100%
100%
Data leakage risk identified
100%
100%
Data leakage mitigation proposed
100%
100%
Beyond happy path
100%
100%
No exact string matching
100%
100%
Adversarial intent documented
100%
100%
Separate category results
100%
100%
Multi-run or reliability note
0%
100%
Behavioral checks used
100%
100%
20ba150
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.