CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on re...

Install with Tessl CLI

npx tessl i github:sickn33/antigravity-awesome-skills --skill agent-evaluation
What are skills?

52

0.99x

Quality

27%

Does it follow best practices?

Impact

99%

0.99x

Average score across 3 eval scenarios

Optimize this skill with Tessl

npx tessl skill review --optimize ./skills/agent-evaluation/SKILL.md
SKILL.md
Review
Evals

Evaluation results

98%

-2%

Document Summarization Agent Evaluation Framework

Statistical evaluation and multi-dimensional scoring

Criteria
Without context
With context

Multi-run per test case

100%

100%

Result aggregation

100%

100%

No string matching scoring

100%

100%

Multiple scoring dimensions

100%

100%

Flaky test detection

100%

100%

Results in eval_results.json

100%

100%

eval_design.md present

100%

100%

Design justifies multi-run

100%

100%

Design mentions metric gaming risk

100%

60%

Without context: $0.4438 · 2m 3s · 16 turns · 17 in / 7,325 out tokens

With context: $0.9244 · 3m 56s · 31 turns · 79 in / 14,451 out tokens

100%

Test Suite Design for an LLM-Powered SQL Query Generator

Behavioral contract and adversarial testing

Criteria
Without context
With context

Behavioral invariants defined

100%

100%

Adversarial test cases present

100%

100%

Edge case / boundary tests

100%

100%

Not only happy path

100%

100%

Pass conditions not string matching

100%

100%

Strategy mentions invariants

100%

100%

Strategy mentions adversarial intent

100%

100%

Test cases organized by category

100%

100%

Sufficient test count

100%

100%

Without context: $0.4062 · 1m 52s · 18 turns · 19 in / 5,598 out tokens

With context: $0.4206 · 1m 48s · 20 turns · 68 in / 5,498 out tokens

100%

Production Deployment Readiness Evaluation for a RAG-Based Support Agent

Benchmark-production gap and data leakage prevention

Criteria
Without context
With context

Data leakage risk identified

100%

100%

Benchmark-production gap identified

100%

100%

Test set separation proposed

100%

100%

Production-representative evaluation

100%

100%

Multi-dimensional metrics proposed

100%

100%

risk_matrix.json present

100%

100%

Data leakage in risk_matrix.json

100%

100%

Data governance section

100%

100%

Without context: $0.4767 · 1m 45s · 24 turns · 24 in / 5,394 out tokens

With context: $0.3673 · 1m 30s · 17 turns · 17 in / 4,627 out tokens

Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.