Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.
56
47%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Passed
No known issues
Optimize this skill with Tessl
npx tessl skill review --optimize ./plugins/customaize-agent/skills/agent-evaluation/SKILL.mdQuality
Discovery
67%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
The description adequately covers both what the skill does and when to use it, earning strong marks on completeness. However, the specific actions remain somewhat abstract (evaluate, improve, test, validate, measure) without detailing concrete operations, and the trigger terms could be broader to capture more natural user phrasings. The skill's niche is reasonably distinct but could be sharpened to avoid overlap with general prompt engineering or code review skills.
Suggestions
Add more concrete actions such as 'run prompt comparisons', 'score output quality', 'benchmark skill performance' to increase specificity.
Expand trigger terms to include natural variations like 'prompt testing', 'prompt quality', 'benchmark', 'skill validation', 'agent evaluation' to improve discoverability.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Names the domain (Claude Code commands, skills, agents) and some actions (evaluate, improve, testing, validating, measuring), but the actions remain fairly high-level without listing concrete operations like 'run A/B comparisons', 'score prompt outputs', or 'generate improvement reports'. | 2 / 3 |
Completeness | Clearly answers both 'what' (evaluate and improve Claude Code commands, skills, and agents) and 'when' (Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality) with explicit trigger guidance. | 3 / 3 |
Trigger Term Quality | Includes relevant terms like 'prompt effectiveness', 'context engineering', 'commands', 'skills', 'agents', but misses common natural variations users might say such as 'prompt testing', 'prompt evaluation', 'benchmark', 'prompt quality', or 'skill validation'. | 2 / 3 |
Distinctiveness Conflict Risk | The focus on Claude Code commands/skills/agents is somewhat distinctive, but terms like 'evaluate', 'improve', and 'prompt effectiveness' could overlap with general code review or prompt engineering skills. The niche is moderately clear but not sharply delineated. | 2 / 3 |
Total | 9 / 12 Passed |
Implementation
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill reads like a comprehensive textbook on LLM evaluation rather than a concise, actionable skill for Claude. Its greatest weakness is extreme verbosity — it explains concepts Claude already knows (statistical metrics, evaluation theory), repeats key points multiple times, and inlines what should be 3-4 separate reference files into one massive document. The prompt templates and workflow patterns have value but are buried in excessive explanatory content.
Suggestions
Split into SKILL.md (overview + primary workflow, ~100 lines) with separate reference files: BIAS_MITIGATION.md, METRICS.md, PROMPT_PATTERNS.md, and EXAMPLES.md
Remove explanations of concepts Claude already knows: basic statistics (precision, recall, F1, Spearman's ρ, Cohen's κ), what PDF is, how libraries work, etc. — just reference them by name
Eliminate repeated content: the 'chain-of-thought improves reliability by 15-25%' claim appears 3 times, the multi-dimensional rubric is essentially presented twice, and bias mitigation summaries are duplicated
Add a clear decision tree at the top: 'What are you evaluating?' → specific workflow to follow, so Claude can quickly navigate to the relevant section rather than reading 800+ lines
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | This skill is extremely verbose at ~800+ lines, with massive amounts of content that Claude already knows (what precision/recall are, how correlation works, basic evaluation concepts). It repeats itself frequently (e.g., 'chain-of-thought improves reliability by 15-25%' appears multiple times, the multi-dimensional rubric is essentially repeated, bias mitigation tables are duplicated). Entire sections like the Metric Selection Guide explain basic statistics concepts Claude already understands. | 1 / 3 |
Actionability | The skill provides some concrete prompt templates and Python code examples, but much of the code is pseudocode-level (e.g., `await compare()`, `await extract_claims()` reference undefined functions). The prompt templates are useful and copy-paste ready, but the overall guidance is more conceptual/educational than executable — it describes evaluation approaches rather than providing a tight, actionable workflow Claude can follow. | 2 / 3 |
Workflow Clarity | Several workflows are listed (testing a new command, comparing prompt variants, regression testing) with numbered steps, which is helpful. However, validation checkpoints are mostly implicit rather than explicit, and the sheer volume of content makes it hard to identify which workflow to follow when. There's no clear decision point for 'when to use this skill' or a primary workflow path — it reads more like a textbook than an operational guide. | 2 / 3 |
Progressive Disclosure | This is a monolithic wall of text with no bundle files and no references to external documents. Content that should be split into separate reference files (bias mitigation techniques, metric selection guide, implementation patterns) is all inlined into a single massive document. The inline section headers suggest these were meant to be separate files but were collapsed into one, creating an overwhelming single document. | 1 / 3 |
Total | 6 / 12 Passed |
Validation
90%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 10 / 11 Passed
Validation for skill structure
| Criteria | Description | Result |
|---|---|---|
skill_md_line_count | SKILL.md is long (1711 lines); consider splitting into references/ and linking | Warning |
Total | 10 / 11 Passed | |
dedca19
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.