anthropic-evaluations

This skill should be used when the user asks to "create evals", "evaluate an agent", "build evaluation suite", or mentions agent testing, graders, or benchmarks. Also suggest when building coding agents, conversational agents, or research agents that need quality assurance.

Invalid

This skill can't be scored yet

Validation errors are blocking scoring. Review and fix them to unlock Quality, Impact and Security scores. See what needs fixing →

Evaluation results

96%

40%

Evaluation Suite for a Python Bug-Fix Agent

Coding agent grader selection

Criteria

Without context

With context

Deterministic tests primary

66%

100%

LLM rubric for quality

100%

Static analysis: ruff

100%

Static analysis: mypy

100%

Static analysis: bandit

100%

State check present

30%

100%

Transcript metrics tracked

62%

100%

Latency metrics tracked

62%

100%

tool_calls not over-specified

100%

70%

Rationale grader ordering

100%

Rationale outcome grading

71%

85%

100%

Evaluation Design for a Subscription Cancellation Support Agent

Conversational agent eval patterns

Criteria

Without context

With context

llm_rubric as primary grader

100%

Natural language assertions

100%

Simulated user persona

100%

Transcript max_turns constraint

100%

State check for outcome

100%

Multi-dimensional success

100%

Transcript metrics tracked

71%

100%

Latency metrics tracked

100%

Design note: persona rationale

100%

Design note: grader complementarity

100%

50%

Eval Strategy Document for a High-Stakes Customer-Facing AI Assistant

Non-determinism metrics and eval classification

Criteria

Without context

With context

pass@k definition

90%

100%

pass^k definition

40%

100%

Numeric pass@k example

100%

Numeric pass^k example

25%

100%

Appropriate use of pass@k

87%

100%

Appropriate use of pass^k

75%

100%

Capability vs regression distinction

40%

100%

Saturation response

30%

100%

Balanced problem set guidance

30%

100%

Domain-specific balanced example

50%

100%

LLM judge Unknown option

100%

Repository: dwmkerr/claude-toolkit
Commit: ccd0360

Evaluated: 2 months ago
Agent: Claude Code
Model: Claude Sonnet 4.6

Table of Contents

Evaluation Suite for a Python Bug-Fix Agent Evaluation Design for a Subscription Cancellation Support Agent Eval Strategy Document for a High-Stakes Customer-Facing AI Assistant

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.