CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.

88

1.05x
Quality

88%

Does it follow best practices?

Impact

81%

1.05x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Evaluation results

79%

9%

Set Up an Eval Suite for a Python Code Generation Agent

Coding agent eval design

Criteria
Without context
With context

Correct directory structure

100%

75%

Code-based grader chosen

100%

100%

Test passage rate metric

100%

100%

Build success metric

100%

100%

Lint or style metric

0%

0%

Diff size tracked

0%

0%

Suite balance documented

50%

100%

Correct positive/negative/edge ratio

0%

100%

QA cases converted

100%

100%

Grader type from has_tests

62%

50%

Outcome-focused grader

100%

100%

Environment isolation config

100%

87%

65%

-9%

Design an Evaluation System for a Customer Support Chatbot

Conversational agent grading rubric

Criteria
Without context
With context

Model-based grader chosen

100%

100%

Empathy dimension present

100%

100%

Resolution dimension present

100%

100%

Efficiency dimension present

100%

100%

Correct dimension weights

50%

70%

Empathy threshold

37%

0%

Resolution rate threshold

25%

0%

Average turns threshold

37%

0%

Escalation rate threshold

37%

0%

Calibration noted

100%

100%

Outcome-focused criteria

100%

100%

Sample tasks from transcripts

100%

100%

100%

11%

Operationalize an AI Agent Eval System for Production

Production monitoring and eval saturation

Criteria
Without context
With context

CI/CD trigger events

100%

100%

Results uploaded as artifacts

100%

100%

10% production sample rate

100%

100%

Low-score alert

100%

100%

Saturation sliding window

37%

100%

Variance calculated

100%

100%

Saturation recommendation

87%

100%

Saturation correctly identified

100%

100%

A/B winner determination

100%

100%

Weekly maintenance items

100%

100%

Monthly maintenance items

71%

100%

Quarterly maintenance items

62%

100%

No intermediate step grading

100%

100%

Repository
supercent-io/skills-template
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.