CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

63

1.05x
Quality

44%

Does it follow best practices?

Impact

99%

1.05x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./docs/v19.7/configuration/agent/skills_external/antigravity-awesome-skills-main/skills/agent-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Evaluation results

100%

2%

Regression Test Suite for Upgraded Q&A Agent

Statistical and behavioral contract testing

Criteria
Without context
With context

Multi-run execution

100%

100%

Distribution analysis

100%

100%

No string matching scoring

100%

100%

Behavioral invariants

100%

100%

Edge/negative cases

90%

100%

Flakiness handling documented

100%

100%

Multiple behavioral dimensions

100%

100%

Summary report output

100%

100%

Happy path not sole focus

100%

100%

Behavioral invariants documented

85%

100%

98%

6%

Agent Performance Investigation: High Benchmark, Poor Production

Benchmark-production gap and multi-dimensional metrics

Criteria
Without context
With context

Benchmark-production gap addressed

100%

100%

Multiple distinct metrics

100%

100%

No single-metric reliance

100%

100%

Metric gaming prevention

100%

80%

Production-representative scenarios

100%

100%

No output string matching in prototype

100%

100%

Per-metric scores reported

100%

100%

Failure modes mapped to metrics

100%

100%

Adversarial or edge inputs noted

100%

100%

Statistical reliability mentioned

0%

100%

100%

8%

Hardening the Test Suite for a Code Review Agent

Adversarial testing and data leakage prevention

Criteria
Without context
With context

Adversarial tests present

100%

100%

Adversarial category labeled

100%

100%

Data leakage risk identified

100%

100%

Data leakage mitigation proposed

100%

100%

Beyond happy path

100%

100%

No exact string matching

100%

100%

Adversarial intent documented

100%

100%

Separate category results

100%

100%

Multi-run or reliability note

0%

100%

Behavioral checks used

100%

100%

Repository
duclm1x1/Dive-Ai
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.