Build new Claude skills from scratch or supercharge existing ones through rigorous evaluation and iterative improvement. Use when the user wants to create, build, improve, evaluate, audit, enhance, benchmark, test, or package a skill. Also trigger for "turn this into a skill", "make this reusable", "I keep repeating this workflow", or references to SKILL.md, skill frontmatter, description optimization, or skill packaging. Do NOT use for general coding tasks, document creation, or other non-skill workflows. Even if the user just says "skill" in the context of Claude capabilities, this is likely the right skill to load.
94
92%
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Advisory
Suggest reviewing before use
The power source for building and enhancing Claude skills. Two modes, one goal: skills that trigger reliably, execute flawlessly, and improve over time.
Start every session by determining the user's intent:
If the conversation already contains a workflow the user wants to capture ("turn this into a skill"), that's Create mode. If the user has an existing SKILL.md they want to improve, that's Enhance mode. Ask if it's not clear.
Understand what the skill needs to do before writing anything.
references/skill-categories.md for the 9 common types. Naming the category helps set expectations about structure.If the user is converting an existing conversation into a skill, mine the conversation history for: tools used, sequence of steps, corrections made, input/output formats observed. Pre-fill answers from context and confirm with the user.
Before writing, do your homework:
references/skill-anatomy.md for structural rules and references/writing-guide.md for craft guidanceSketch the structure for the user and get buy-in before writing.
Write the SKILL.md following these principles (detailed in references/writing-guide.md):
references/, scripts to scripts/, templates to assets/.Use the template in templates/new-skill-template.md as a starting point if helpful.
After drafting the skill, assess whether hooks would improve it. Read the hooks section in references/writing-guide.md for the full guidance. The quick heuristic:
If hooks would help, design them alongside the skill and include them in hooks/ with a settings.json registration file. If they wouldn't add clear value, skip them — hooks add complexity and maintenance burden.
Evaluate whether the skill should be complemented by an MCP server and/or always-on rules. Read references/mcp-rules-integration.md for the full methodology. The quick assessment:
Does the skill interact with external services? Look for CLI commands calling APIs, curl/wget requests, auth tokens. If yes, check whether an MCP server exists for that service. If an MCP server is available, design the skill to use MCP tools as the primary path with CLI as fallback.
Would the skill benefit from fetching service documentation? If the skill depends on an external tool or platform, fetch the MCP server's tool descriptions and/or API docs to ensure the skill uses correct parameters, handles all error cases, and leverages available capabilities fully.
Should always-on rules accompany this skill? If the skill uses MCP tools that could cause harm when misused, or if it establishes patterns that should apply even outside the skill's context (output formatting, safety checks, preferred tool variants), generate rules for .claude/rules/. Rules persist across sessions; skills only apply when loaded.
If MCP and rules are relevant, generate:
.claude/rules/{service}-usage.md file with safety, data quality, tool usage, and fallback rulesAfter drafting, create 3-5 realistic test prompts — the kind of thing a real user would actually say. Share them with the user for review before running.
Design test cases with variety:
Save test cases to evals/evals.json. See references/schemas.md for the full schema. Don't write assertions yet — just prompts and expected output descriptions.
{
"skill_name": "your-skill",
"evals": [
{
"id": 1,
"prompt": "Realistic user prompt",
"category": "core",
"expected_output": "Description of expected result",
"files": [],
"expectations": []
}
]
}This is one continuous sequence. Put results in <skill-name>-workspace/ organized by iteration (iteration-1/, iteration-2/).
Step 1 — Spawn all runs in the same turn. For each test case, spawn two subagents:
iteration-N/eval-ID/with_skill/outputs/.iteration-N/eval-ID/without_skill/outputs/.Launch everything at once so it all finishes around the same time.
Step 2 — While runs are in progress, draft assertions. Don't wait idle. Write verifiable assertions for each test case. Good assertions are specific, discriminating, and have descriptive names. Update evals/evals.json and write eval_metadata.json for each eval directory.
Step 3 — Capture timing data. When subagent notifications arrive with total_tokens and duration_ms, save to timing.json immediately — this data isn't persisted elsewhere.
Step 4 — Grade, aggregate, and review.
agents/grader.md instructions. Save grading.json in each run directory. Use programmatic checks where possible — scripts are faster and more reliable than eyeballing.python scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>python scripts/generate_report.py <workspace>/iteration-N — get results in front of the human ASAP.Step 5 — Read feedback and improve. See the Improvement Philosophy section below.
After improving the skill:
iteration-N+1/--previous-iteration pointing at the last oneOnce the skill itself is solid, optimize the description for trigger accuracy. This is a separate step — don't optimize triggers before the skill works well.
Read the detailed process in references/writing-guide.md under "Description Optimization." The short version:
python scripts/run_trigger_loop.py --eval-set <path> --skill-path <path> --model <model-id> --max-iterations 5python scripts/package_skill.py <path/to/skill-folder>This produces a .skill file ready for upload to Claude.ai Settings or distribution. Present it to the user.
Enhance mode follows a structured audit → plan → rewrite → validate cycle. Read references/evaluation-framework.md for the full methodology. Here's the workflow:
cp -r <skill-path> /tmp/<skill-name>/Present a Skill Profile summary to the user before continuing.
Evaluate across these dimensions (see references/evaluation-framework.md for detailed criteria):
references/writing-guide.md section 8 for the category assessment. If the skill involves multi-step workflows, output validation, destructive operations, or user intent detection, hooks are likely worth recommending. If not, note that hooks aren't needed and move on.references/mcp-rules-integration.md for the full assessment framework. If the skill uses MCP tools, fetch the tool descriptions and verify correct usage. If rules would help, draft them alongside the skill enhancement.Rate each: Strong / Adequate / Needs Work / Critical Gap / Not Applicable. Provide specific evidence.
Organize findings into prioritized recommendations:
Each recommendation needs: What, Why (concrete impact), How (show the diff), Risk, and Effort estimate.
Present the plan to the user and get buy-in before rewriting.
Produce a complete enhanced version incorporating all Critical and High-priority recommendations. This is a clean rewrite, not a patch. Preserve the original name.
After rewriting, provide a Change Summary — concise bullets of what changed and why.
Design 5-7 test prompts (2 core, 2 edge, 1 boundary, 1 stress, 1 minimal). Follow the same Run & Evaluate flow from Create Mode Phase 5.
If this is an improvement over an existing skill, use the original version as the baseline instead of no-skill.
Same as Create Mode Phase 7. Especially important in Enhance mode since trigger issues are one of the most common problems.
Same as Create Mode Phase 8. Preserve the original skill name in the output.
When iterating on a skill — whether in Create or Enhance mode — keep these principles front of mind:
Generalize, don't overfit. The skill will be used thousands of times across different prompts. You're iterating on a few examples because it's fast. Don't put in fiddly changes that only fix one test case. If something is stubborn, try different metaphors or patterns rather than adding more rigid rules.
Keep the prompt lean. Every line costs context window tokens on every invocation. Read the transcripts — if the skill makes the model waste time on unproductive steps, cut those sections.
Explain the why. LLMs have good theory of mind. When given a good understanding of why something matters, they can go beyond rote instructions and handle novel situations. This is more effective than ALWAYS/NEVER directives.
Look for repeated work. If all test runs independently write similar helper scripts, that's a strong signal the skill should bundle that script. Write it once, put it in scripts/.
The description is the front door. If the skill never triggers, nothing else matters.
Build the Gotchas section over time. The highest-signal content in any skill captures the failure modes Claude hits in practice. Treat it as a living document.
Read these as needed — don't load everything upfront:
| File | When to read |
|---|---|
references/skill-anatomy.md | When drafting structure or validating format |
references/skill-categories.md | During intent capture to identify skill type |
references/writing-guide.md | When writing or rewriting skill instructions |
references/evaluation-framework.md | During Enhance mode structural evaluation |
references/mcp-rules-integration.md | When assessing MCP opportunity or generating rules |
references/schemas.md | When creating eval JSON files or interpreting results |
Read when spawning the relevant subagent:
| File | Purpose |
|---|---|
agents/grader.md | Evaluate assertions against outputs |
agents/comparator.md | Blind A/B comparison between two outputs |
agents/analyzer.md | Analyze why one version beat another |
| Script | Usage |
|---|---|
scripts/validate_skill.py | Validate skill structure and frontmatter |
scripts/run_eval.py | Run a single eval against a skill |
scripts/aggregate_benchmark.py | Aggregate grading results into benchmark |
scripts/run_trigger_loop.py | Description optimization loop |
scripts/generate_report.py | Generate HTML eval report |
scripts/package_skill.py | Package skill as .skill file for distribution |
Skill-arc-reactor includes on-demand hooks that activate when the skill is loaded. They catch common mistakes at write-time instead of after packaging.
| Hook | Event | What It Does |
|---|---|---|
hooks/validate-frontmatter.sh | PostToolUse → Write/Edit | Validates SKILL.md frontmatter on every save: kebab-case name, description length and content, XML brackets, trigger phrases, boundary clarity |
hooks/check-skill-length.sh | PostToolUse → Write/Edit | Nudges when SKILL.md exceeds 500 lines — reminds to move content to references/ |
hooks/capture-timing.sh | Notification | Auto-captures subagent total_tokens and duration_ms into timing.json — this data is lost if not saved at completion time |
Hook registration config is in hooks/settings.json. To activate, merge into your .claude/settings.json.
All hooks follow the exit code contract: exit 0 with stdout injects context (advice/warnings), exit 0 with no output is silent pass. No hooks block (exit 2) — they advise, never prevent.
95142b6
If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.