Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on re...
Install with Tessl CLI
npx tessl i github:sickn33/antigravity-awesome-skills --skill agent-evaluation52
Quality
27%
Does it follow best practices?
Impact
99%
0.99xAverage score across 3 eval scenarios
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/agent-evaluation/SKILL.mdStatistical evaluation and multi-dimensional scoring
Multi-run per test case
100%
100%
Result aggregation
100%
100%
No string matching scoring
100%
100%
Multiple scoring dimensions
100%
100%
Flaky test detection
100%
100%
Results in eval_results.json
100%
100%
eval_design.md present
100%
100%
Design justifies multi-run
100%
100%
Design mentions metric gaming risk
100%
60%
Without context: $0.4438 · 2m 3s · 16 turns · 17 in / 7,325 out tokens
With context: $0.9244 · 3m 56s · 31 turns · 79 in / 14,451 out tokens
Behavioral contract and adversarial testing
Behavioral invariants defined
100%
100%
Adversarial test cases present
100%
100%
Edge case / boundary tests
100%
100%
Not only happy path
100%
100%
Pass conditions not string matching
100%
100%
Strategy mentions invariants
100%
100%
Strategy mentions adversarial intent
100%
100%
Test cases organized by category
100%
100%
Sufficient test count
100%
100%
Without context: $0.4062 · 1m 52s · 18 turns · 19 in / 5,598 out tokens
With context: $0.4206 · 1m 48s · 20 turns · 68 in / 5,498 out tokens
Benchmark-production gap and data leakage prevention
Data leakage risk identified
100%
100%
Benchmark-production gap identified
100%
100%
Test set separation proposed
100%
100%
Production-representative evaluation
100%
100%
Multi-dimensional metrics proposed
100%
100%
risk_matrix.json present
100%
100%
Data leakage in risk_matrix.json
100%
100%
Data governance section
100%
100%
Without context: $0.4767 · 1m 45s · 24 turns · 24 in / 5,394 out tokens
With context: $0.3673 · 1m 30s · 17 turns · 17 in / 4,627 out tokens
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.