skill-arc-reactor

Build new Claude skills from scratch or supercharge existing ones through rigorous evaluation and iterative improvement. Use when the user wants to create, build, improve, evaluate, audit, enhance, benchmark, test, or package a skill. Also trigger for "turn this into a skill", "make this reusable", "I keep repeating this workflow", or references to SKILL.md, skill frontmatter, description optimization, or skill packaging. Do NOT use for general coding tasks, document creation, or other non-skill workflows. Even if the user just says "skill" in the context of Claude capabilities, this is likely the right skill to load.

Quality

92%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Skill Arc Reactor

The power source for building and enhancing Claude skills. Two modes, one goal: skills that trigger reliably, execute flawlessly, and improve over time.

Mode Selection

Start every session by determining the user's intent:

Create Mode — Building a new skill from scratch
Enhance Mode — Auditing, evaluating, or improving an existing skill

If the conversation already contains a workflow the user wants to capture ("turn this into a skill"), that's Create mode. If the user has an existing SKILL.md they want to improve, that's Enhance mode. Ask if it's not clear.

Create Mode: Building a New Skill

Phase 1: Intent Capture

Understand what the skill needs to do before writing anything.

What should this skill enable Claude to do? Get a one-sentence answer. If the user can't articulate it crisply, help them narrow it down.
What category does this fall into? Consult references/skill-categories.md for the 9 common types. Naming the category helps set expectations about structure.
When should this skill trigger? Collect specific phrases, contexts, and file types. Think about what users would actually say — casual, formal, abbreviated.
What's the expected output? Files, reports, actions, workflows — be concrete.
What environment constraints exist? Dependencies, MCP servers, network access, packages.

If the user is converting an existing conversation into a skill, mine the conversation history for: tools used, sequence of steps, corrections made, input/output formats observed. Pre-fill answers from context and confirm with the user.

Phase 2: Research & Design

Before writing, do your homework:

Check available MCPs and tools that could be relevant
If the skill domain is something you're less familiar with, research best practices
Read references/skill-anatomy.md for structural rules and references/writing-guide.md for craft guidance
Plan the file structure: will you need scripts, reference files, assets, templates?

Sketch the structure for the user and get buy-in before writing.

Phase 3: Draft the Skill

Write the SKILL.md following these principles (detailed in references/writing-guide.md):

Frontmatter first: Name (kebab-case) and description (specific, pushy, includes trigger phrases). The description is the front door — if the skill never triggers, nothing else matters. Give it disproportionate attention.
Keep SKILL.md under 500 lines. Move detailed docs to references/, scripts to scripts/, templates to assets/.
Use progressive disclosure: Tell Claude what files exist in the skill directory and when to read them. The model will explore on its own.
Write in imperative form: "Read the file" not "The file should be read."
Explain why over MUST: Today's LLMs are smart. Explain the reasoning behind instructions instead of relying on rigid directives. If you find yourself writing ALWAYS or NEVER in caps, reframe with reasoning.
Build a Gotchas section: The highest-signal content in any skill. Capture common failure points. Update this section over time.
Include examples: Concrete input/output pairs. Cover the happy path and at least one edge case.
Don't state the obvious: Claude already knows a lot. Focus on information that pushes Claude out of its normal patterns.
Error handling: Tell the model what to do when things go wrong.
Graceful degradation: If a dependency isn't available, what's the fallback?

Use the template in templates/new-skill-template.md as a starting point if helpful.

Phase 3b: Assess Hooks

After drafting the skill, assess whether hooks would improve it. Read the hooks section in references/writing-guide.md for the full guidance. The quick heuristic:

Does the skill involve multi-step workflows where one step's output should trigger the next? → PostToolUse hooks for workflow chaining.
Does the skill produce outputs that need validation? → PostToolUse hooks for auto-validation after writes.
Does the skill involve destructive or irreversible operations? → PreToolUse hooks for confirmation gates.
Would the skill benefit from detecting user intent before they explicitly invoke it? → UserPromptSubmit hooks for proactive suggestions.

If hooks would help, design them alongside the skill and include them in hooks/ with a settings.json registration file. If they wouldn't add clear value, skip them — hooks add complexity and maintenance burden.

Phase 3c: Assess MCP & Rules

Evaluate whether the skill should be complemented by an MCP server and/or always-on rules. Read references/mcp-rules-integration.md for the full methodology. The quick assessment:

Does the skill interact with external services? Look for CLI commands calling APIs, curl/wget requests, auth tokens. If yes, check whether an MCP server exists for that service. If an MCP server is available, design the skill to use MCP tools as the primary path with CLI as fallback.
Would the skill benefit from fetching service documentation? If the skill depends on an external tool or platform, fetch the MCP server's tool descriptions and/or API docs to ensure the skill uses correct parameters, handles all error cases, and leverages available capabilities fully.
Should always-on rules accompany this skill? If the skill uses MCP tools that could cause harm when misused, or if it establishes patterns that should apply even outside the skill's context (output formatting, safety checks, preferred tool variants), generate rules for .claude/rules/. Rules persist across sessions; skills only apply when loaded.

If MCP and rules are relevant, generate:

MCP tool usage recommendations in the skill
A .claude/rules/{service}-usage.md file with safety, data quality, tool usage, and fallback rules
Present both to the user alongside the skill draft

Phase 4: Test Cases

After drafting, create 3-5 realistic test prompts — the kind of thing a real user would actually say. Share them with the user for review before running.

Design test cases with variety:

2 core cases (most common, straightforward usage)
1 edge case (unusual input, ambiguous phrasing, missing files)
1 stress case (most complex realistic input)
1 minimal case (simplest valid input)

Save test cases to evals/evals.json. See references/schemas.md for the full schema. Don't write assertions yet — just prompts and expected output descriptions.

{
  "skill_name": "your-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "Realistic user prompt",
      "category": "core",
      "expected_output": "Description of expected result",
      "files": [],
      "expectations": []
    }
  ]
}

Phase 5: Run & Evaluate

This is one continuous sequence. Put results in <skill-name>-workspace/ organized by iteration (iteration-1/, iteration-2/).

Step 1 — Spawn all runs in the same turn. For each test case, spawn two subagents:

With-skill run: Give the subagent the skill path and the eval prompt. Save outputs to iteration-N/eval-ID/with_skill/outputs/.
Baseline run: Same prompt, no skill. Save to iteration-N/eval-ID/without_skill/outputs/.

Launch everything at once so it all finishes around the same time.

Step 2 — While runs are in progress, draft assertions. Don't wait idle. Write verifiable assertions for each test case. Good assertions are specific, discriminating, and have descriptive names. Update evals/evals.json and write eval_metadata.json for each eval directory.

Step 3 — Capture timing data. When subagent notifications arrive with total_tokens and duration_ms, save to timing.json immediately — this data isn't persisted elsewhere.

Step 4 — Grade, aggregate, and review.

Grade each run using agents/grader.md instructions. Save grading.json in each run directory. Use programmatic checks where possible — scripts are faster and more reliable than eyeballing.
Aggregate into benchmark: python scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>
Generate the eval viewer: python scripts/generate_report.py <workspace>/iteration-N — get results in front of the human ASAP.
Wait for user feedback before making changes yourself.

Step 5 — Read feedback and improve. See the Improvement Philosophy section below.

Phase 6: Iterate

After improving the skill:

Apply improvements
Rerun all test cases into iteration-N+1/
Generate the report with --previous-iteration pointing at the last one
Wait for user review
Repeat until the user is happy or feedback is all empty

Phase 7: Description Optimization

Once the skill itself is solid, optimize the description for trigger accuracy. This is a separate step — don't optimize triggers before the skill works well.

Read the detailed process in references/writing-guide.md under "Description Optimization." The short version:

Generate 20 trigger eval queries (10 should-trigger, 10 should-not-trigger). Realistic, detailed, with near-misses as negatives.
Review with user.
Run the optimization loop: python scripts/run_trigger_loop.py --eval-set <path> --skill-path <path> --model <model-id> --max-iterations 5
Apply the best description and report scores.

Phase 8: Package & Deliver

python scripts/package_skill.py <path/to/skill-folder>

This produces a .skill file ready for upload to Claude.ai Settings or distribution. Present it to the user.

Enhance Mode: Auditing & Improving an Existing Skill

Enhance mode follows a structured audit → plan → rewrite → validate cycle. Read references/evaluation-framework.md for the full methodology. Here's the workflow:

Phase 1: Intake

Read the complete SKILL.md — frontmatter to final line. Don't skim.
Inventory all bundled resources. List every file. Note which are referenced from SKILL.md and which appear orphaned.
State the skill's intent in one sentence. If you can't, that's your first finding.
Identify the target environment and any assumptions about available tools.
Copy the skill to a writable location before editing: cp -r <skill-path> /tmp/<skill-name>/

Present a Skill Profile summary to the user before continuing.

Phase 2: Structural Evaluation

Evaluate across these dimensions (see references/evaluation-framework.md for detailed criteria):

Frontmatter & Triggering — Is the description specific, covering, boundary-clear, and pushy enough?
Progressive Disclosure — Is SKILL.md under 500 lines? Is there filler? Are resources referenced clearly?
Instruction Quality — Imperative form? Why over MUST? Good examples? Ambiguities?
Robustness — Error handling? Input variety? Graceful degradation?
Script & Resource Quality — Necessary? Correct? Documented?
Hooks Opportunity — Based on the skill's category and workflow, would hooks improve it? See references/writing-guide.md section 8 for the category assessment. If the skill involves multi-step workflows, output validation, destructive operations, or user intent detection, hooks are likely worth recommending. If not, note that hooks aren't needed and move on.
MCP & Rules Opportunity — Does the skill interact with external services? Is it using CLI where an MCP server could provide native access? Are there patterns that should be enforced as always-on rules even when the skill isn't loaded? See references/mcp-rules-integration.md for the full assessment framework. If the skill uses MCP tools, fetch the tool descriptions and verify correct usage. If rules would help, draft them alongside the skill enhancement.

Rate each: Strong / Adequate / Needs Work / Critical Gap / Not Applicable. Provide specific evidence.

Phase 3: Enhancement Plan

Organize findings into prioritized recommendations:

Critical fixes — Things that cause failure or incorrect output
High-impact improvements — Meaningfully improve quality, reliability, or trigger accuracy
Medium optimizations — Efficiency, readability, maintainability
Low-priority polish — Nice-to-haves

Each recommendation needs: What, Why (concrete impact), How (show the diff), Risk, and Effort estimate.

Present the plan to the user and get buy-in before rewriting.

Phase 4: Rewrite

Produce a complete enhanced version incorporating all Critical and High-priority recommendations. This is a clean rewrite, not a patch. Preserve the original name.

After rewriting, provide a Change Summary — concise bullets of what changed and why.

Phase 5: Test & Validate

Design 5-7 test prompts (2 core, 2 edge, 1 boundary, 1 stress, 1 minimal). Follow the same Run & Evaluate flow from Create Mode Phase 5.

If this is an improvement over an existing skill, use the original version as the baseline instead of no-skill.

Phase 6: Description Optimization

Same as Create Mode Phase 7. Especially important in Enhance mode since trigger issues are one of the most common problems.

Phase 7: Package & Deliver

Same as Create Mode Phase 8. Preserve the original skill name in the output.

Improvement Philosophy

When iterating on a skill — whether in Create or Enhance mode — keep these principles front of mind:

Generalize, don't overfit. The skill will be used thousands of times across different prompts. You're iterating on a few examples because it's fast. Don't put in fiddly changes that only fix one test case. If something is stubborn, try different metaphors or patterns rather than adding more rigid rules.

Keep the prompt lean. Every line costs context window tokens on every invocation. Read the transcripts — if the skill makes the model waste time on unproductive steps, cut those sections.

Explain the why. LLMs have good theory of mind. When given a good understanding of why something matters, they can go beyond rote instructions and handle novel situations. This is more effective than ALWAYS/NEVER directives.

Look for repeated work. If all test runs independently write similar helper scripts, that's a strong signal the skill should bundle that script. Write it once, put it in scripts/.

The description is the front door. If the skill never triggers, nothing else matters.

Build the Gotchas section over time. The highest-signal content in any skill captures the failure modes Claude hits in practice. Treat it as a living document.

Reference Files

Read these as needed — don't load everything upfront:

File	When to read
`references/skill-anatomy.md`	When drafting structure or validating format
`references/skill-categories.md`	During intent capture to identify skill type
`references/writing-guide.md`	When writing or rewriting skill instructions
`references/evaluation-framework.md`	During Enhance mode structural evaluation
`references/mcp-rules-integration.md`	When assessing MCP opportunity or generating rules
`references/schemas.md`	When creating eval JSON files or interpreting results

Agent Files

Read when spawning the relevant subagent:

File	Purpose
`agents/grader.md`	Evaluate assertions against outputs
`agents/comparator.md`	Blind A/B comparison between two outputs
`agents/analyzer.md`	Analyze why one version beat another

Scripts

Script	Usage
`scripts/validate_skill.py`	Validate skill structure and frontmatter
`scripts/run_eval.py`	Run a single eval against a skill
`scripts/aggregate_benchmark.py`	Aggregate grading results into benchmark
`scripts/run_trigger_loop.py`	Description optimization loop
`scripts/generate_report.py`	Generate HTML eval report
`scripts/package_skill.py`	Package skill as .skill file for distribution

Hooks

Skill-arc-reactor includes on-demand hooks that activate when the skill is loaded. They catch common mistakes at write-time instead of after packaging.

Hook	Event	What It Does
`hooks/validate-frontmatter.sh`	PostToolUse → Write/Edit	Validates SKILL.md frontmatter on every save: kebab-case name, description length and content, XML brackets, trigger phrases, boundary clarity
`hooks/check-skill-length.sh`	PostToolUse → Write/Edit	Nudges when SKILL.md exceeds 500 lines — reminds to move content to references/
`hooks/capture-timing.sh`	Notification	Auto-captures subagent `total_tokens` and `duration_ms` into timing.json — this data is lost if not saved at completion time

Hook registration config is in hooks/settings.json. To activate, merge into your .claude/settings.json.

All hooks follow the exit code contract: exit 0 with stdout injects context (advice/warnings), exit 0 with no output is silent pass. No hooks block (exit 2) — they advise, never prevent.

Repository: endor-matt/Arc-Reactor-Skill-Evaluator
Commit: 95142b6

Last updated: 28 days ago
Created: 28 days ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.