This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines.
60
49%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./skills/evaluation/SKILL.mdQuality
Discovery
37%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This description is essentially a list of trigger phrases without any explanation of what the skill actually does. While it excels at providing natural keywords users might say, it completely fails to describe the skill's capabilities, making it impossible for Claude to understand what actions this skill enables.
Suggestions
Add a clear 'what' statement at the beginning describing concrete capabilities, e.g., 'Creates evaluation rubrics, implements LLM-as-judge patterns, builds multi-dimensional scoring frameworks, and sets up quality gates for agent pipelines.'
Restructure to lead with capabilities, then follow with 'Use when...' clause containing the trigger terms
Include specific outputs or deliverables the skill produces, such as 'generates evaluation reports', 'produces scoring matrices', or 'configures automated test suites'
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | The description lacks concrete actions - it only lists trigger phrases without explaining what the skill actually does. There are no specific capabilities like 'creates rubrics', 'runs evaluations', or 'generates reports'. | 1 / 3 |
Completeness | The description only addresses 'when' (trigger conditions) but completely omits 'what' - there is no explanation of what capabilities or actions this skill provides. | 1 / 3 |
Trigger Term Quality | Excellent coverage of natural trigger terms users would say: 'evaluate agent performance', 'build test framework', 'measure agent quality', 'create evaluation rubrics', 'LLM-as-judge', 'agent testing', 'quality gates'. | 3 / 3 |
Distinctiveness Conflict Risk | The trigger terms are fairly specific to agent evaluation domain, but without knowing what the skill actually does, it could overlap with general testing or evaluation skills. Terms like 'test framework' could conflict with other testing skills. | 2 / 3 |
Total | 7 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill provides comprehensive coverage of agent evaluation concepts with good workflow structure and clear sequencing. However, it suffers from verbosity, explaining concepts Claude already understands, and the code examples are incomplete pseudocode rather than executable snippets. The content would benefit from being more concise and splitting detailed topics into referenced files.
Suggestions
Replace pseudocode examples with complete, executable code that can be copy-pasted (e.g., implement actual `assess_dimension` function or use a real evaluation library)
Remove explanatory prose about why agents differ from traditional software and other concepts Claude already knows - jump directly to actionable guidance
Split detailed sections (Evaluation Challenges, Rubric Design, Methodologies) into separate reference files and keep SKILL.md as a concise overview with clear navigation
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill contains useful information but is verbose in places, explaining concepts Claude likely knows (e.g., what non-determinism means, why agents differ from traditional software). The 95% finding table and some explanatory prose could be tightened. | 2 / 3 |
Actionability | Provides some concrete guidance with code examples, but the examples are incomplete/pseudocode-like (e.g., `load_rubric()`, `assess_dimension()` are undefined). The guidance is more conceptual than copy-paste executable. | 2 / 3 |
Workflow Clarity | The 'Building Evaluation Frameworks' section provides a clear 8-step sequence with explicit ordering rationale. The workflow includes validation concepts and explains why each step matters before proceeding. | 3 / 3 |
Progressive Disclosure | Has a References section pointing to external resources and one internal reference, but the main content is a monolithic document with many sections that could be split into separate files. The Integration section lists connections but doesn't provide clear navigation paths. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
3ab8c94
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.