CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

Install with Tessl CLI

npx tessl i github:duclm1x1/Dive-Ai --skill agent-evaluation
What are skills?

68

Does it follow best practices?

Validation for skill structure

SKILL.md
Review
Evals

Evaluation results

100%

2%

Regression Test Suite for Upgraded Q&A Agent

Statistical and behavioral contract testing

Criteria
Without context
With context

Multi-run execution

100%

100%

Distribution analysis

100%

100%

No string matching scoring

100%

100%

Behavioral invariants

100%

100%

Edge/negative cases

90%

100%

Flakiness handling documented

100%

100%

Multiple behavioral dimensions

100%

100%

Summary report output

100%

100%

Happy path not sole focus

100%

100%

Behavioral invariants documented

85%

100%

Without context: $0.3464 · 1m 46s · 16 turns · 22 in / 6,666 out tokens

With context: $0.6287 · 2m 32s · 28 turns · 33 in / 9,326 out tokens

98%

6%

Agent Performance Investigation: High Benchmark, Poor Production

Benchmark-production gap and multi-dimensional metrics

Criteria
Without context
With context

Benchmark-production gap addressed

100%

100%

Multiple distinct metrics

100%

100%

No single-metric reliance

100%

100%

Metric gaming prevention

100%

80%

Production-representative scenarios

100%

100%

No output string matching in prototype

100%

100%

Per-metric scores reported

100%

100%

Failure modes mapped to metrics

100%

100%

Adversarial or edge inputs noted

100%

100%

Statistical reliability mentioned

0%

100%

Without context: $0.3080 · 1m 41s · 14 turns · 17 in / 5,467 out tokens

With context: $0.6670 · 2m 59s · 28 turns · 33 in / 10,704 out tokens

100%

8%

Hardening the Test Suite for a Code Review Agent

Adversarial testing and data leakage prevention

Criteria
Without context
With context

Adversarial tests present

100%

100%

Adversarial category labeled

100%

100%

Data leakage risk identified

100%

100%

Data leakage mitigation proposed

100%

100%

Beyond happy path

100%

100%

No exact string matching

100%

100%

Adversarial intent documented

100%

100%

Separate category results

100%

100%

Multi-run or reliability note

0%

100%

Behavioral checks used

100%

100%

Without context: $0.3916 · 2m 16s · 16 turns · 22 in / 8,350 out tokens

With context: $0.6941 · 3m 3s · 26 turns · 279 in / 11,476 out tokens

Evaluated
Agent
Claude Code

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.