Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
88
85%
Does it follow best practices?
Impact
88%
1.87xAverage score across 3 eval scenarios
Passed
No known issues
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong description that clearly defines a specific domain (skill management and optimization), lists concrete actions, and provides explicit trigger guidance via a 'Use when...' clause. The trigger terms are natural and cover multiple user intent variations. The skill's niche is distinct enough to avoid conflicts with other skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: create new skills, modify/improve existing skills, measure skill performance, run evals, benchmark with variance analysis, and optimize descriptions for triggering accuracy. | 3 / 3 |
Completeness | Clearly answers both 'what' (create, modify, improve, measure skills) and 'when' with an explicit 'Use when...' clause listing specific trigger scenarios like creating from scratch, editing, running evals, benchmarking, and optimizing descriptions. | 3 / 3 |
Trigger Term Quality | Includes strong natural trigger terms users would say: 'create a skill', 'edit', 'optimize', 'run evals', 'test a skill', 'benchmark', 'skill performance', 'triggering accuracy', 'description'. These cover a good range of how users would phrase requests about skill management. | 3 / 3 |
Distinctiveness Conflict Risk | The domain of skill creation, editing, evaluation, and optimization is a clear niche. Terms like 'skill', 'evals', 'variance analysis', 'triggering accuracy' are highly specific and unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
70%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill excels at actionability, workflow clarity, and progressive disclosure — it provides a comprehensive, well-structured guide for creating and iterating on skills with concrete commands, clear sequencing, and appropriate references to external files. However, it is severely undermined by verbosity: conversational filler, repeated statements of the core loop (3 times), lengthy philosophical asides about communication style and improvement philosophy, and explanations of concepts Claude already understands. The content could likely be cut by 40-50% without losing any actionable information.
Suggestions
Remove the three redundant restatements of the core loop — state it once at the top and reference it, rather than repeating it at the beginning, middle, and end.
Cut conversational filler ('Cool? Cool.', 'Good luck!', 'Sorry in advance but I'm gonna go all caps here') and the extended discussion about user demographics and communication style — these consume tokens without adding actionable guidance.
Consolidate the 'How to think about improvements' section: the philosophical guidance about generalization, lean prompts, and explaining 'why' could be reduced to 3-4 bullet points instead of multiple paragraphs with parenthetical asides.
Move environment-specific instructions (Claude.ai, Cowork) into separate reference files rather than inlining them, since they add ~100 lines that are irrelevant in most contexts.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~500+ lines with significant conversational padding ('Cool? Cool.', 'Good luck!'), repeated instructions (the core loop is stated 3 times), explanations of concepts Claude knows (what JSON is, how subagents work), and lengthy asides about communication style and user demographics that don't add actionable value. | 1 / 3 |
Actionability | Despite verbosity, the skill provides highly concrete, executable guidance: specific CLI commands, exact JSON schemas, file path conventions, script invocation patterns, and step-by-step procedures with copy-paste ready code blocks for every major operation. | 3 / 3 |
Workflow Clarity | The multi-step workflow is clearly sequenced with explicit validation checkpoints (run tests → grade → aggregate → launch viewer → collect feedback → iterate). Steps are numbered, feedback loops are built in (iterate until satisfied), and there are clear instructions for error recovery and environment-specific adaptations. | 3 / 3 |
Progressive Disclosure | Content is well-structured with clear references to external files (agents/grader.md, agents/comparator.md, agents/analyzer.md, references/schemas.md) that are one level deep and clearly signaled with descriptions of when to read them. The skill appropriately separates core workflow from advanced features (blind comparison) and environment-specific instructions. | 3 / 3 |
Total | 10 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
636b862
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.