CtrlK
BlogDocsLog inGet started
Tessl Logo

anthropic-evaluations

This skill should be used when the user asks to "create evals", "evaluate an agent", "build evaluation suite", or mentions agent testing, graders, or benchmarks. Also suggest when building coding agents, conversational agents, or research agents that need quality assurance.

Invalid
This skill can't be scored yet
Validation errors are blocking scoring. Review and fix them to unlock Quality, Impact and Security scores. See what needs fixing →
SKILL.md
Quality
Evals
Security

Evaluation results

96%

40%

Evaluation Suite for a Python Bug-Fix Agent

Coding agent grader selection

Criteria
Without context
With context

Deterministic tests primary

66%

100%

LLM rubric for quality

100%

100%

Static analysis: ruff

0%

100%

Static analysis: mypy

0%

100%

Static analysis: bandit

0%

100%

State check present

30%

100%

Transcript metrics tracked

62%

100%

Latency metrics tracked

62%

100%

tool_calls not over-specified

100%

70%

Rationale grader ordering

100%

100%

Rationale outcome grading

71%

85%

100%

9%

Evaluation Design for a Subscription Cancellation Support Agent

Conversational agent eval patterns

Criteria
Without context
With context

llm_rubric as primary grader

100%

100%

Natural language assertions

100%

100%

Simulated user persona

100%

100%

Transcript max_turns constraint

100%

100%

State check for outcome

100%

100%

Multi-dimensional success

100%

100%

Transcript metrics tracked

71%

100%

Latency metrics tracked

0%

100%

Design note: persona rationale

100%

100%

Design note: grader complementarity

100%

100%

100%

50%

Eval Strategy Document for a High-Stakes Customer-Facing AI Assistant

Non-determinism metrics and eval classification

Criteria
Without context
With context

pass@k definition

90%

100%

pass^k definition

40%

100%

Numeric pass@k example

100%

100%

Numeric pass^k example

25%

100%

Appropriate use of pass@k

87%

100%

Appropriate use of pass^k

75%

100%

Capability vs regression distinction

40%

100%

Saturation response

30%

100%

Balanced problem set guidance

30%

100%

Domain-specific balanced example

50%

100%

LLM judge Unknown option

0%

100%

Repository
dwmkerr/claude-toolkit
Evaluated
Agent
Claude Code
Model
Claude Sonnet 4.6

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.