Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch for Claude Code or Cursor, update or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
68
81%
Does it follow best practices?
Impact
—
No eval scenarios have been run
Passed
No known issues
Quality
Discovery
100%Based on the skill's description, can an agent find and select it at the right time? Clear, specific descriptions lead to better discovery.
This is a strong description that clearly articulates specific capabilities (creating, modifying, evaluating, and benchmarking skills) and provides explicit trigger guidance via a well-structured 'Use when...' clause. It uses third-person voice consistently, includes natural trigger terms users would employ, and occupies a distinct niche that minimizes conflict risk with other skills.
| Dimension | Reasoning | Score |
|---|---|---|
Specificity | Lists multiple specific concrete actions: 'Create new skills', 'modify and improve existing skills', 'measure skill performance', 'run evals to test a skill', 'benchmark skill performance with variance analysis', 'optimize a skill's description for better triggering accuracy'. | 3 / 3 |
Completeness | Clearly answers both 'what' (create, modify, measure, benchmark, optimize skills) and 'when' with an explicit 'Use when...' clause listing multiple specific trigger scenarios. | 3 / 3 |
Trigger Term Quality | Includes strong natural keywords users would say: 'create a skill', 'Claude Code', 'Cursor', 'update or optimize', 'run evals', 'benchmark', 'variance analysis', 'triggering accuracy', 'skill description'. These cover a good range of terms a user working with skills would naturally use. | 3 / 3 |
Distinctiveness Conflict Risk | The description carves out a very clear niche around skill creation, modification, evaluation, and benchmarking specifically for Claude Code and Cursor. This is a distinct domain unlikely to conflict with other skills. | 3 / 3 |
Total | 12 / 12 Passed |
Implementation
62%Reviews the quality of instructions and guidance provided to agents. Good implementation is clear, handles edge cases, and produces reliable results.
This skill provides an impressively thorough and actionable workflow for creating, testing, and iterating on skills, with concrete commands, JSON schemas, and clear step sequencing. However, it is severely undermined by excessive verbosity — conversational asides, motivational commentary, explanations of obvious concepts, and philosophical tangents inflate the token cost significantly. The content would benefit greatly from aggressive trimming and splitting large sections into referenced files.
Suggestions
Cut all conversational filler ('Cool? Cool.', 'Good luck!', motivational asides about billions in economic value) and meta-commentary about communication style — these waste tokens without adding actionable guidance.
Move the 'Communicating with the user' section, the 'How to think about improvements' philosophy section, and the 'Cursor-Specific Instructions' into separate reference files to keep SKILL.md under 300 lines.
Remove the repeated summary of the core loop at the end — it's already clear from the structure and adds ~15 lines of redundancy.
Tighten the 'How skill triggering works' explanation — Claude doesn't need a paragraph explaining that agents decide whether to consult skills; just state the implication for eval query design.
| Dimension | Reasoning | Score |
|---|---|---|
Conciseness | The skill is extremely verbose at ~400+ lines with significant conversational padding ('Cool? Cool.'), unnecessary meta-commentary about communication style, extensive explanations of concepts Claude already knows (what JSON is, how to think about improvements), and motivational asides ('we are trying to create billions a year in economic value here!'). Much of this could be cut without losing actionable content. | 1 / 3 |
Actionability | Despite the verbosity, the skill provides highly concrete, executable guidance: specific CLI commands for running scripts, exact JSON schemas for eval files and feedback, precise directory structures, copy-paste ready code blocks for spawning runs, grading, aggregation, and launching the viewer. The workflow is thoroughly specified with real commands. | 3 / 3 |
Workflow Clarity | The multi-step workflow is clearly sequenced with explicit numbered steps, validation checkpoints (grade before generating viewer, validate assertions exist before proceeding), feedback loops (iterate until user is satisfied), and clear error recovery patterns. The iteration loop is well-defined with explicit stopping criteria. | 3 / 3 |
Progressive Disclosure | The skill references external files appropriately (agents/grader.md, agents/comparator.md, agents/analyzer.md, references/schemas.md) with clear guidance on when to read them. However, the SKILL.md body itself is monolithic and contains substantial content that could be split into reference files — the description optimization section, Cursor-specific instructions, and the detailed improvement philosophy could all be separate documents to keep the main file leaner. | 2 / 3 |
Total | 9 / 12 Passed |
Validation
100%Checks the skill against the spec for correct structure and formatting. All validation checks must pass before discovery and implementation can be scored.
Validation — 11 / 11 Passed
Validation for skill structure
No warnings or errors.
d6af887
Table of Contents
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.