This skill should be used when the user asks to "create evals", "evaluate an agent", "build evaluation suite", or mentions agent testing, graders, or benchmarks. Also suggest when building coding agents, conversational agents, or research agents that need quality assurance.
Coding agent grader selection
Deterministic tests primary
66%
100%
LLM rubric for quality
100%
100%
Static analysis: ruff
0%
100%
Static analysis: mypy
0%
100%
Static analysis: bandit
0%
100%
State check present
30%
100%
Transcript metrics tracked
62%
100%
Latency metrics tracked
62%
100%
tool_calls not over-specified
100%
70%
Rationale grader ordering
100%
100%
Rationale outcome grading
71%
85%
Conversational agent eval patterns
llm_rubric as primary grader
100%
100%
Natural language assertions
100%
100%
Simulated user persona
100%
100%
Transcript max_turns constraint
100%
100%
State check for outcome
100%
100%
Multi-dimensional success
100%
100%
Transcript metrics tracked
71%
100%
Latency metrics tracked
0%
100%
Design note: persona rationale
100%
100%
Design note: grader complementarity
100%
100%
Non-determinism metrics and eval classification
pass@k definition
90%
100%
pass^k definition
40%
100%
Numeric pass@k example
100%
100%
Numeric pass^k example
25%
100%
Appropriate use of pass@k
87%
100%
Appropriate use of pass^k
75%
100%
Capability vs regression distinction
40%
100%
Saturation response
30%
100%
Balanced problem set guidance
30%
100%
Domain-specific balanced example
50%
100%
LLM judge Unknown option
0%
100%
92f335c
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.