agent-evaluation

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

Quality

47%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Passed

No known issues

Optimize this skill with Tessl

npx tessl skill review --optimize ./plugins/customaize-agent/skills/agent-evaluation/SKILL.md

Quality

Discovery

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description adequately covers both what the skill does and when to use it, earning strong marks on completeness. However, the specific actions remain somewhat abstract (evaluate, improve, test, validate, measure) without detailing concrete operations, and the trigger terms could be broader to capture more natural user phrasings. The skill's niche is reasonably distinct but could be sharpened to avoid overlap with general prompt engineering or code review skills.

Suggestions

Add more concrete actions such as 'run prompt comparisons', 'score output quality', 'benchmark skill performance' to increase specificity.

Expand trigger terms to include natural variations like 'prompt testing', 'prompt quality', 'benchmark', 'skill validation', 'agent evaluation' to improve discoverability.

Dimension	Reasoning	Score
Specificity	Names the domain (Claude Code commands, skills, agents) and some actions (evaluate, improve, testing, validating, measuring), but the actions remain fairly high-level without listing concrete operations like 'run A/B comparisons', 'score prompt outputs', or 'generate improvement reports'.	2 / 3
Completeness	Clearly answers both 'what' (evaluate and improve Claude Code commands, skills, and agents) and 'when' (Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality) with explicit trigger guidance.	3 / 3
Trigger Term Quality	Includes relevant terms like 'prompt effectiveness', 'context engineering', 'commands', 'skills', 'agents', but misses common natural variations users might say such as 'prompt testing', 'prompt evaluation', 'benchmark', 'prompt quality', or 'skill validation'.	2 / 3
Distinctiveness Conflict Risk	The focus on Claude Code commands/skills/agents is somewhat distinctive, but terms like 'evaluate', 'improve', and 'prompt effectiveness' could overlap with general code review or prompt engineering skills. The niche is moderately clear but not sharply delineated.	2 / 3
	Total	9 / 12 Passed

Implementation

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads like a comprehensive textbook on LLM evaluation rather than a concise, actionable skill for Claude. Its greatest weakness is extreme verbosity — it explains concepts Claude already knows (statistical metrics, evaluation theory), repeats key points multiple times, and inlines what should be 3-4 separate reference files into one massive document. The prompt templates and workflow patterns have value but are buried in excessive explanatory content.

Suggestions

Split into SKILL.md (overview + primary workflow, ~100 lines) with separate reference files: BIAS_MITIGATION.md, METRICS.md, PROMPT_PATTERNS.md, and EXAMPLES.md

Remove explanations of concepts Claude already knows: basic statistics (precision, recall, F1, Spearman's ρ, Cohen's κ), what PDF is, how libraries work, etc. — just reference them by name

Eliminate repeated content: the 'chain-of-thought improves reliability by 15-25%' claim appears 3 times, the multi-dimensional rubric is essentially presented twice, and bias mitigation summaries are duplicated

Add a clear decision tree at the top: 'What are you evaluating?' → specific workflow to follow, so Claude can quickly navigate to the relevant section rather than reading 800+ lines

Dimension	Reasoning	Score
Conciseness	This skill is extremely verbose at ~800+ lines, with massive amounts of content that Claude already knows (what precision/recall are, how correlation works, basic evaluation concepts). It repeats itself frequently (e.g., 'chain-of-thought improves reliability by 15-25%' appears multiple times, the multi-dimensional rubric is essentially repeated, bias mitigation tables are duplicated). Entire sections like the Metric Selection Guide explain basic statistics concepts Claude already understands.	1 / 3
Actionability	The skill provides some concrete prompt templates and Python code examples, but much of the code is pseudocode-level (e.g., `await compare()`, `await extract_claims()` reference undefined functions). The prompt templates are useful and copy-paste ready, but the overall guidance is more conceptual/educational than executable — it describes evaluation approaches rather than providing a tight, actionable workflow Claude can follow.	2 / 3
Workflow Clarity	Several workflows are listed (testing a new command, comparing prompt variants, regression testing) with numbered steps, which is helpful. However, validation checkpoints are mostly implicit rather than explicit, and the sheer volume of content makes it hard to identify which workflow to follow when. There's no clear decision point for 'when to use this skill' or a primary workflow path — it reads more like a textbook than an operational guide.	2 / 3
Progressive Disclosure	This is a monolithic wall of text with no bundle files and no references to external documents. Content that should be split into separate reference files (bias mitigation techniques, metric selection guide, implementation patterns) is all inlined into a single massive document. The inline section headers suggest these were meant to be separate files but were collapsed into one, creating an overwhelming single document.	1 / 3
	Total	6 / 12 Passed

Validation

90%

Warnings & errors only

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation — 10 / 11 Passed

Validation for skill structure

Criteria	Description	Result
skill_md_line_count	SKILL.md is long (1711 lines); consider splitting into references/ and linking	Warning

	Total	10 / 11 Passed

Repository: NeoLabHQ/context-engineering-kit
Commit: dedca19

Reviewed: about 1 month ago

Table of Contents

Discovery Implementation Validation

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.