CtrlK
BlogDocsLog inGet started
Tessl Logo

agent-evaluation

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

45

Quality

47%

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

Fix and improve this skill with Tessl

tessl review fix ./plugins/customaize-agent/skills/agent-evaluation/SKILL.md
SKILL.md
Quality
Evals
Security

Quality

Content

27%

Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.

This skill reads more like a comprehensive textbook chapter on LLM evaluation than a concise, actionable skill for Claude Code. It suffers from extreme verbosity, explaining concepts Claude already knows (statistical metrics, basic ML evaluation), repeating key points multiple times, and inlining what should be 4-5 separate reference files into a single massive document. The actionable prompt templates and workflow patterns are buried under layers of explanatory content.

Suggestions

Reduce the main SKILL.md to ~100-150 lines covering core workflow steps and key prompt templates, moving bias mitigation techniques, metric definitions, and implementation patterns into separate referenced files (e.g., BIAS_MITIGATION.md, METRICS.md, PATTERNS.md).

Remove explanations of concepts Claude already knows: statistical metric definitions (precision, recall, F1, Spearman's rho, Cohen's kappa), what position bias is, how pairwise comparison works conceptually. Instead, just specify which metrics to use when.

Consolidate the repeated content (chain-of-thought requirement stated 3 times, position swapping explained in 4 different sections, anti-patterns listed twice) into single authoritative sections.

Make code examples fully executable or remove them—functions like `assess_relevance()`, `extract_claims()`, and `verify_claim()` are undefined and serve as pseudocode rather than actionable guidance.

DimensionReasoningScore

Conciseness

This skill is extremely verbose at ~800+ lines, with massive amounts of content that Claude already knows (what precision/recall are, how correlation works, basic evaluation concepts). It explains fundamental ML evaluation concepts, statistical metrics, and general software engineering practices at length. The repeated explanations of the same concepts (e.g., position bias mitigation appears multiple times, the chain-of-thought requirement is stated three times) and the inclusion of basic statistical definitions waste significant token budget.

1 / 3

Actionability

The skill provides some concrete prompt templates and Python code examples for bias mitigation and evaluation workflows, which is useful. However, much of the code is pseudocode-level (e.g., functions calling undefined helpers like `assess_relevance`, `extract_claims`, `verify_claim`), and the guidance is more conceptual/educational than directly executable in a Claude Code context. The evaluation prompt templates are the most actionable parts.

2 / 3

Workflow Clarity

Several workflows are listed (testing a new command, comparing prompt variants, regression testing) with numbered steps, which provides reasonable sequencing. However, validation checkpoints are mostly implicit rather than explicit, and the workflows lack concrete 'if X fails, do Y' feedback loops. The sheer volume of content also makes it hard to identify which workflow to follow for a given situation.

2 / 3

Progressive Disclosure

This is a monolithic wall of text with no references to external files despite being extremely long. Content that should be in separate reference files (bias mitigation techniques, metric selection guide, implementation patterns) is all inlined, with section headers that read like separate documents ('# Bias Mitigation Techniques for LLM Evaluation', '# LLM-as-Judge Implementation Patterns') but are crammed into one file. No bundle files are provided to offload this content.

1 / 3

Total

6

/

12

Passed

Description

67%

Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.

The description has good structural completeness with a clear 'Use when' clause covering multiple trigger scenarios. However, it lacks concrete action specificity—the verbs 'evaluate and improve' are high-level, and the trigger terms, while relevant, could better cover natural user phrasings. The domain is reasonably distinct but could be sharpened to reduce overlap with general prompt engineering skills.

Suggestions

Add more concrete actions to increase specificity, e.g., 'Run comparative evaluations, score prompt outputs, generate test cases for commands and skills'.

Expand trigger terms with natural user phrasings like 'test my skill', 'debug my prompt', 'is my agent working well', 'prompt quality check'.

DimensionReasoningScore

Specificity

Names the domain (Claude Code commands, skills, agents) and some actions (evaluate, improve, testing, validating, measuring), but the actions remain somewhat abstract—'evaluate and improve' is broad, and specific concrete actions like 'run A/B comparisons', 'score prompt outputs', or 'generate test cases' are missing.

2 / 3

Completeness

Clearly answers both 'what' (evaluate and improve Claude Code commands, skills, and agents) and 'when' (testing prompt effectiveness, validating context engineering choices, measuring improvement quality) with an explicit 'Use when' clause containing specific trigger scenarios.

3 / 3

Trigger Term Quality

Includes relevant terms like 'commands', 'skills', 'agents', 'prompt effectiveness', 'context engineering', and 'improvement quality', which are reasonably natural. However, it misses common user phrasings like 'test my skill', 'debug my prompt', 'prompt evaluation', 'skill quality', or 'agent testing'.

2 / 3

Distinctiveness Conflict Risk

The focus on evaluating Claude Code commands/skills/agents is a fairly specific niche, but terms like 'prompt effectiveness' and 'improvement quality' are broad enough that they could overlap with general prompt engineering or code review skills.

2 / 3

Total

9

/

12

Passed

Validation

90%

Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.

Validation10 / 11 Passed

Validation for skill structure

CriteriaDescriptionResult

skill_md_line_count

SKILL.md is long (1711 lines); consider splitting into references/ and linking

Warning

Total

10

/

11

Passed

Repository
NeoLabHQ/context-engineering-kit
Reviewed

Table of Contents

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.