25 Mar 20269 minute read

Stop guessing whether your Skill works: skill-optimizer measures and improves it
25 Mar 20269 minute read

I typed one sentence into Claude Code: Please optimize the Fastify skill in this project, and then walked away to grab a coffee.
When I returned, I had a complete picture of how well Matteo Collina's fastify-best-practices skill was actually performing: five realistic eval scenarios, a baseline score for each, a full before/after comparison, a diagnosed regression, a proposed fix, and a rerun confirming the improvement. The skill went from an average success rate of 67% to 94% across real-world scenarios. I didn't write a single eval. I didn't design a single rubric. I just said three words and let skill-optimizer do the rest.
Introducing skill-optimizer
When you write a SKILL.md, you're essentially writing instructions for an AI agent. The problem is you're writing those instructions blindly. You don't know:
- Whether the agent actually follows them
- Which parts are redundant (the agent already knows how to do things without the skill)
- Which parts cause regressions (your instructions confuse the agent more than help)
- Whether it works on cheaper models (Haiku) or only on expensive ones (Opus)
The skill-optimizer plugin runs your skill through a judge-scored eval pipeline, testing the agent with and without your skill on real tasks, then scoring the delta. You're not guessing anymore, you have real numbers to back up your feelings, as all Jedi should have.
How it works: two complementary approaches
The plugin combines two methods:
- Skill review (
tessl skill review)
A static analysis of yourSKILL.mditself. Scores it on four dimensions: completeness, actionability, conciseness, and robustness. This phase quickly catches structural problems before you even run the agent. - Task evals (
tessl eval run)
Generates realistic task scenarios from your skill, runs an agent on each scenario twice (once without your skill as a baseline, and once using your skill), then has an LLM as a judge score both outputs against a per-scenario rubric. The score delta tells you the skill's value-add.
The skill, optimize-skill-performance-and-instructions, combines both approaches into a single end-to-end cycle.
A real example: mcollina's fastify-best-practices skill
mcollina/skills is Matteo Collina's open-source collection of skills for modern Node.js development. It already has 1,200+ stars, 80+ forks. It covers Fastify, TypeScript, linting, documentation, and core Node.js patterns, with a SKILL.md per skill and shared rules files wiring it all together.
We ran skill-optimizer against the fastify-best-practices skill. Here's what I did as a how to so you can follow along if you like.
What actually happened
Step 1: Install the skill optimizer skill
In your skills project run:
tessl i tessl-labs/skill-optimizerThat's it! The skills become available to Claude Code when you start it next.

Step 2: Kick off the full optimization cycle
From within Claude Code, I asked just one thing:
Please optimize the Fastify skill in this projectRemember, always say please! That triggered a skill called optimize-skill-performance-and-instructions, which is the top level skill in the plugin that calls the others as needed. Claude Code took it from there. From Step 3, you’ll see the full sequence that claude ran automatically, and what happened at each stage.

Step 2a: Skill review (Stage 1)
Claude Code kicks off by performing a review of the Fastify skill using Tessl.
tessl skill review skills/fastify/SKILL.mdThe result was encouraging:
Average Score: 100%
Description: 100%
specificity: 3/3
trigger_term_quality: 3/3
completeness: 3/3
distinctiveness_conflict_risk: 3/3
Content: 100%
conciseness: 3/3
actionability: 3/3
workflow_clarity: 3/3
progressive_disclosure: 3/3
✔ Skill evaluation completed successfully!A perfect score. The description was praised for its explicit Use when guidance, natural trigger terms (Fastify, server.ts, app.ts, Pino), and clear Fastify-specific terminology that keeps it from conflicting with generic Node.js skills.
This wasn’t a surprise to me, of course, as I already worked with Matteo in a previous PR to improve all of these before.
Here's the important lesson though: a perfect review score doesn't mean your skill is actually working. The static review tells you the instructions are well-formed. It doesn't tell you whether the agent follows them. That's what the evals are for.
Step 2b: Generate eval scenarios (Stage 2)
Claude then generated 5 real world scenarios with Tessl for the skill:
tessl scenario generate . --count=5Here are the various scenarios that were created.

Five realistic, well-scoped scenarios covering the core surface area of the skill: production config, schema validation, auth, database plugins, and file handling with tests.
Step 2c: Run evals (Stage 3)
Following the scenario generation, Claude then ran each of the scenarios as an eval using the claude-sonnet-4-6 model, with Tessl:
tessl eval run . --agent=claude:claude-sonnet-4-6Claude Code shares a monitoring URL and polls every few minutes.
Step 2d: Analyze results (Stage 4)
Here's what came back:

Three scenarios with big gains, one modest gain, and one regression. The production config scenario is the standout. The skill took the agent from 41% to a perfect 100%. Without the skill, the agent had no idea to reach for env-schema, close-with-grace, or @fastify/under-pressure. With it, it nailed every check.
The regression on the database scenario needs attention, but we wouldn’t have known this without the fix!
Step 2e: Diagnose and fix (Stage 5)
The regression: database-plugin-architecture
Drilling into the per-check breakdown reveals the problem:
Scenario 4: Database plugin architecture with official adapters
Baseline (without context)
onClose hook for cleanup 7/10 (70%)
Async hooks used 10/10 (100%)
Structured logging in routes/hooks 2/10 (20%)
With context
onClose hook for cleanup 6/10 (60%) ← got worse
Async hooks used 7/10 (70%) ← got worse
Structured logging in routes/hooks 0/10 (0%) ← got worseTwo checks the agent handled fine without the skill actually got worse with it. Claude Code diagnosed the cause: hooks.md contained a callback-style AVOID example that was confusing the agent's async hook implementation. And database.md had no example of structured logging in route handlers, leaving a gap the baseline agent was partially filling on its own.
The gaps: TypeBox schema scenario
Shared schema with $id and $ref 0/8 (0%) → same score both runs
additionalProperties: false on input schemas 0/8 (0%) → skill not teaching this
@fastify/error used 0/10 (0%) → not mentioned in skillSo it turns out that these weren't regressions, but rather that the skill just wasn't covering them at all.
Here is the summary of fixes that Claude automatically went on to make:

Step 2f: Re-run and verify (Stage 6)
Claude then reran the tests to show the improvement after the fixes to the skill was made:

The regression is gone. The TypeBox scenario jumped from 82% to 92%. The file upload scenario went from 85% to 94%. Overall average moved from 89% to 94%.
One stubborn gap remains: Structured logging in routes/hooks is still scoring 0/10 even after the fixes. That's for the next iteration.
Step 2g (optional): Validate across models
Next came the step where Claude tested the evals across multiple models using the following commands:
tessl eval run . --agent=claude:claude-haiku-4-5
tessl eval run . --agent=claude:claude-sonnet-4-6
tessl eval run . --agent=claude:claude-opus-4-6If Haiku struggles on specific criteria, Claude Code will tell you, and the fix is usually simpler, more explicit phrasing rather than restructuring the whole skill.
Once all three models score well:
tessl skill publish ./skills/fastifySummary: when to reach for each skill
| You want to... | Use this skill |
|---|---|
| Run a full skill optimize end-to-end | optimize-skill-performance-and-instructions |
| Generate scenarios + first baseline run | setup-skill-performance |
| You have eval results, want to fix and re-run | optimize-skill-performance |
| Quickly audit SKILL.md quality (no evals) | optimize-skill-instructions |
| Validate the skill works on Haiku/Sonnet/Opus | compare-skill-model-performance |
The fastify-best-practices skill scored a perfect 100% on static review, well-structured description, good trigger terms, clean layout. And it still had a regression in production.
That's the gap skill-optimizer closes. Static review tells you the instructions are well-formed. Evals tell you whether the agent actually follows them. For the production config scenario, the skill took the agent from 41% to 100%, things like env-schema, close-with-grace, and @fastify/under-pressure that the agent simply doesn't reach for without explicit guidance. That gap is impossible to identify without measurement.
For anyone publishing skills to the Tessl registry, running this before you publish is the difference between shipping something that works and shipping something you hope works.



