Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.
88
88%
Does it follow best practices?
Impact
81%
1.05xAverage score across 3 eval scenarios
Passed
No known issues
Coding agent eval design
Correct directory structure
100%
75%
Code-based grader chosen
100%
100%
Test passage rate metric
100%
100%
Build success metric
100%
100%
Lint or style metric
0%
0%
Diff size tracked
0%
0%
Suite balance documented
50%
100%
Correct positive/negative/edge ratio
0%
100%
QA cases converted
100%
100%
Grader type from has_tests
62%
50%
Outcome-focused grader
100%
100%
Environment isolation config
100%
87%
Conversational agent grading rubric
Model-based grader chosen
100%
100%
Empathy dimension present
100%
100%
Resolution dimension present
100%
100%
Efficiency dimension present
100%
100%
Correct dimension weights
50%
70%
Empathy threshold
37%
0%
Resolution rate threshold
25%
0%
Average turns threshold
37%
0%
Escalation rate threshold
37%
0%
Calibration noted
100%
100%
Outcome-focused criteria
100%
100%
Sample tasks from transcripts
100%
100%
Production monitoring and eval saturation
CI/CD trigger events
100%
100%
Results uploaded as artifacts
100%
100%
10% production sample rate
100%
100%
Low-score alert
100%
100%
Saturation sliding window
37%
100%
Variance calculated
100%
100%
Saturation recommendation
87%
100%
Saturation correctly identified
100%
100%
A/B winner determination
100%
100%
Weekly maintenance items
100%
100%
Monthly maintenance items
71%
100%
Quarterly maintenance items
62%
100%
No intermediate step grading
100%
100%
fd18296
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.