Content
27%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill reads more like a comprehensive textbook chapter on LLM evaluation than a concise, actionable skill for Claude Code. It suffers from extreme verbosity, explaining concepts Claude already knows (statistical metrics, basic ML evaluation), repeating key points multiple times, and inlining what should be 4-5 separate reference files into a single massive document. The actionable prompt templates and workflow patterns are buried under layers of explanatory content.
Suggestions
Reduce the main SKILL.md to ~100-150 lines covering core workflow steps and key prompt templates, moving bias mitigation techniques, metric definitions, and implementation patterns into separate referenced files (e.g., BIAS_MITIGATION.md, METRICS.md, PATTERNS.md).
Remove explanations of concepts Claude already knows: statistical metric definitions (precision, recall, F1, Spearman's rho, Cohen's kappa), what position bias is, how pairwise comparison works conceptually. Instead, just specify which metrics to use when.
Consolidate the repeated content (chain-of-thought requirement stated 3 times, position swapping explained in 4 different sections, anti-patterns listed twice) into single authoritative sections.
Make code examples fully executable or remove them—functions like `assess_relevance()`, `extract_claims()`, and `verify_claim()` are undefined and serve as pseudocode rather than actionable guidance.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | This skill is extremely verbose at ~800+ lines, with massive amounts of content that Claude already knows (what precision/recall are, how correlation works, basic evaluation concepts). It explains fundamental ML evaluation concepts, statistical metrics, and general software engineering practices at length. The repeated explanations of the same concepts (e.g., position bias mitigation appears multiple times, the chain-of-thought requirement is stated three times) and the inclusion of basic statistical definitions waste significant token budget. | 1 / 3 |
Actionability | The skill provides some concrete prompt templates and Python code examples for bias mitigation and evaluation workflows, which is useful. However, much of the code is pseudocode-level (e.g., functions calling undefined helpers like `assess_relevance`, `extract_claims`, `verify_claim`), and the guidance is more conceptual/educational than directly executable in a Claude Code context. The evaluation prompt templates are the most actionable parts. | 2 / 3 |
Workflow Clarity | Several workflows are listed (testing a new command, comparing prompt variants, regression testing) with numbered steps, which provides reasonable sequencing. However, validation checkpoints are mostly implicit rather than explicit, and the workflows lack concrete 'if X fails, do Y' feedback loops. The sheer volume of content also makes it hard to identify which workflow to follow for a given situation. | 2 / 3 |
Progressive Disclosure | This is a monolithic wall of text with no references to external files despite being extremely long. Content that should be in separate reference files (bias mitigation techniques, metric selection guide, implementation patterns) is all inlined, with section headers that read like separate documents ('# Bias Mitigation Techniques for LLM Evaluation', '# LLM-as-Judge Implementation Patterns') but are crammed into one file. No bundle files are provided to offload this content. | 1 / 3 |
Total | 6 / 12 Passed |