CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.

88

1.05x
Quality

88%

Does it follow best practices?

Impact

81%

1.05x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

Quality

Discovery

100%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

This is a well-crafted skill description that excels across all dimensions. It clearly specifies concrete capabilities (evaluation systems, graders, benchmarks), includes natural trigger terms users would actually say, explicitly states both what it does and when to use it, and occupies a distinct niche in AI agent evaluation that won't conflict with other skills.

DimensionReasoningScore

Specificity

Lists multiple specific concrete actions: 'Design and implement comprehensive evaluation systems', covers 'grader types, benchmarks, 8-step roadmap, and production integration' - these are concrete, actionable capabilities.

3 / 3

Completeness

Clearly answers both what ('Design and implement comprehensive evaluation systems... Covers grader types, benchmarks, 8-step roadmap, and production integration') AND when ('Use when building evals for coding agents, conversational agents, research agents, or computer-use agents').

3 / 3

Trigger Term Quality

Includes natural keywords users would say: 'evals', 'coding agents', 'conversational agents', 'research agents', 'computer-use agents', 'benchmarks', 'grader'. These are terms practitioners naturally use when discussing AI evaluation.

3 / 3

Distinctiveness Conflict Risk

Clear niche focused specifically on AI agent evaluation systems with distinct triggers like 'evals', 'grader types', 'benchmarks'. Unlikely to conflict with general coding or documentation skills.

3 / 3

Total

12

/

12

Passed

Implementation

77%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This is a strong, actionable skill with excellent code examples and clear workflow guidance for building AI agent evaluation systems. The main weaknesses are moderate verbosity (some explanatory content Claude doesn't need) and a monolithic structure that could benefit from splitting detailed sections into separate files for better progressive disclosure.

Suggestions

Remove or condense the '7 Key Terms' table and 'Eval Evolution' table - Claude understands these concepts and they add token overhead without unique value

Split agent-type-specific strategies (coding, conversational, research, computer-use) into separate reference files linked from the main skill

Consider moving the extensive Examples section to a separate EXAMPLES.md file with a brief link from the main document

DimensionReasoningScore

Conciseness

The skill is comprehensive but includes some redundancy (e.g., the 7 Key Terms table explains concepts Claude likely knows, and some sections repeat similar information). The tables and code examples are efficient, but the overall length could be tightened.

2 / 3

Actionability

Excellent actionability with fully executable Python code examples, concrete YAML configurations, specific grading functions, and copy-paste ready implementations for each agent type. The examples are complete and practical.

3 / 3

Workflow Clarity

The 8-step roadmap provides clear sequencing with explicit checkpoints. Steps are numbered, validation is addressed (e.g., 'check_saturation', 'analyze_transcript'), and the graduated complexity pattern provides clear progression paths.

3 / 3

Progressive Disclosure

Content is well-structured with clear sections and headers, but it's a monolithic document that could benefit from splitting detailed agent-type strategies and examples into separate reference files. The References section links to external resources but internal progressive disclosure is limited.

2 / 3

Total

10

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

metadata_version

'metadata.version' is missing

Warning

Total

10

/

11

Passed

Repository
supercent-io/skills-template
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.