The Tessl Registry now has security scores, powered by SnykLearn more
Logo
Back to articlesStop guessing whether your Skill works: skill-optimizer measures and improves it

25 Mar 20269 minute read

Simon Maple

Simon Maple is Tessl’s Founding Developer Advocate, a Java Champion, and former DevRel leader at Snyk, ZeroTurnaround, and IBM.

I typed one sentence into Claude Code: Please optimize the Fastify skill in this project, and then walked away to grab a coffee.

When I returned, I had a complete picture of how well Matteo Collina's fastify-best-practices skill was actually performing: five realistic eval scenarios, a baseline score for each, a full before/after comparison, a diagnosed regression, a proposed fix, and a rerun confirming the improvement. The skill went from an average success rate of 67% to 94% across real-world scenarios. I didn't write a single eval. I didn't design a single rubric. I just said three words and let skill-optimizer do the rest.

Introducing skill-optimizer

When you write a SKILL.md, you're essentially writing instructions for an AI agent. The problem is you're writing those instructions blindly. You don't know:

  • Whether the agent actually follows them
  • Which parts are redundant (the agent already knows how to do things without the skill)
  • Which parts cause regressions (your instructions confuse the agent more than help)
  • Whether it works on cheaper models (Haiku) or only on expensive ones (Opus)

The skill-optimizer plugin runs your skill through a judge-scored eval pipeline, testing the agent with and without your skill on real tasks, then scoring the delta. You're not guessing anymore, you have real numbers to back up your feelings, as all Jedi should have.

How it works: two complementary approaches

The plugin combines two methods:

  1. Skill review (tessl skill review)
    A static analysis of your SKILL.md itself. Scores it on four dimensions: completeness, actionability, conciseness, and robustness. This phase quickly catches structural problems before you even run the agent.
  2. Task evals (tessl eval run)
    Generates realistic task scenarios from your skill, runs an agent on each scenario twice (once without your skill as a baseline, and once using your skill), then has an LLM as a judge score both outputs against a per-scenario rubric. The score delta tells you the skill's value-add.

The skill, optimize-skill-performance-and-instructions, combines both approaches into a single end-to-end cycle.

A real example: mcollina's fastify-best-practices skill

mcollina/skills is Matteo Collina's open-source collection of skills for modern Node.js development. It already has 1,200+ stars, 80+ forks. It covers Fastify, TypeScript, linting, documentation, and core Node.js patterns, with a SKILL.md per skill and shared rules files wiring it all together.

We ran skill-optimizer against the fastify-best-practices skill. Here's what I did as a how to so you can follow along if you like.

What actually happened

Step 1: Install the skill optimizer skill

In your skills project run:

tessl i tessl-labs/skill-optimizer

That's it! The skills become available to Claude Code when you start it next.

image1

Step 2: Kick off the full optimization cycle

From within Claude Code, I asked just one thing:

Please optimize the Fastify skill in this project

Remember, always say please! That triggered a skill called optimize-skill-performance-and-instructions, which is the top level skill in the plugin that calls the others as needed. Claude Code took it from there. From Step 3, you’ll see the full sequence that claude ran automatically, and what happened at each stage.

image 2

Step 2a: Skill review (Stage 1)

Claude Code kicks off by performing a review of the Fastify skill using Tessl.

tessl skill review skills/fastify/SKILL.md

The result was encouraging:

Average Score: 100%

  Description: 100%
    specificity: 3/3
    trigger_term_quality: 3/3
    completeness: 3/3
    distinctiveness_conflict_risk: 3/3

  Content: 100%
    conciseness: 3/3
    actionability: 3/3
    workflow_clarity: 3/3
    progressive_disclosure: 3/3

✔ Skill evaluation completed successfully!

A perfect score. The description was praised for its explicit Use when guidance, natural trigger terms (Fastify, server.ts, app.ts, Pino), and clear Fastify-specific terminology that keeps it from conflicting with generic Node.js skills.

This wasn’t a surprise to me, of course, as I already worked with Matteo in a previous PR to improve all of these before.

Here's the important lesson though: a perfect review score doesn't mean your skill is actually working. The static review tells you the instructions are well-formed. It doesn't tell you whether the agent follows them. That's what the evals are for.

Step 2b: Generate eval scenarios (Stage 2)

Claude then generated 5 real world scenarios with Tessl for the skill:

tessl scenario generate . --count=5

Here are the various scenarios that were created.

image 3

Five realistic, well-scoped scenarios covering the core surface area of the skill: production config, schema validation, auth, database plugins, and file handling with tests.

Step 2c: Run evals (Stage 3)

Following the scenario generation, Claude then ran each of the scenarios as an eval using the claude-sonnet-4-6 model, with Tessl:

tessl eval run . --agent=claude:claude-sonnet-4-6

Claude Code shares a monitoring URL and polls every few minutes.

Step 2d: Analyze results (Stage 4)

Here's what came back:

image 4

Three scenarios with big gains, one modest gain, and one regression. The production config scenario is the standout. The skill took the agent from 41% to a perfect 100%. Without the skill, the agent had no idea to reach for env-schema, close-with-grace, or @fastify/under-pressure. With it, it nailed every check.

The regression on the database scenario needs attention, but we wouldn’t have known this without the fix!

Step 2e: Diagnose and fix (Stage 5)

The regression: database-plugin-architecture

Drilling into the per-check breakdown reveals the problem:

  Scenario 4: Database plugin architecture with official adapters

  Baseline (without context)
    onClose hook for cleanup           7/10  (70%)
    Async hooks used                   10/10 (100%)
    Structured logging in routes/hooks 2/10  (20%)

  With context
    onClose hook for cleanup           6/10  (60%)   ← got worse
    Async hooks used                   7/10  (70%)   ← got worse
    Structured logging in routes/hooks 0/10  (0%)    ← got worse

Two checks the agent handled fine without the skill actually got worse with it. Claude Code diagnosed the cause: hooks.md contained a callback-style AVOID example that was confusing the agent's async hook implementation. And database.md had no example of structured logging in route handlers, leaving a gap the baseline agent was partially filling on its own.

The gaps: TypeBox schema scenario

  Shared schema with $id and $ref              0/8  (0%)   → same score both runs
  additionalProperties: false on input schemas 0/8  (0%)   → skill not teaching this
  @fastify/error used                          0/10  (0%)  → not mentioned in skill

So it turns out that these weren't regressions, but rather that the skill just wasn't covering them at all.

Here is the summary of fixes that Claude automatically went on to make:

image5

Step 2f: Re-run and verify (Stage 6)

Claude then reran the tests to show the improvement after the fixes to the skill was made:

image 6

The regression is gone. The TypeBox scenario jumped from 82% to 92%. The file upload scenario went from 85% to 94%. Overall average moved from 89% to 94%.

One stubborn gap remains: Structured logging in routes/hooks is still scoring 0/10 even after the fixes. That's for the next iteration.

Step 2g (optional): Validate across models

Next came the step where Claude tested the evals across multiple models using the following commands:

tessl eval run . --agent=claude:claude-haiku-4-5
tessl eval run . --agent=claude:claude-sonnet-4-6
tessl eval run . --agent=claude:claude-opus-4-6

If Haiku struggles on specific criteria, Claude Code will tell you, and the fix is usually simpler, more explicit phrasing rather than restructuring the whole skill.

Once all three models score well:

tessl skill publish ./skills/fastify

Summary: when to reach for each skill

You want to...Use this skill
Run a full skill optimize end-to-endoptimize-skill-performance-and-instructions
Generate scenarios + first baseline runsetup-skill-performance
You have eval results, want to fix and re-runoptimize-skill-performance
Quickly audit SKILL.md quality (no evals)optimize-skill-instructions
Validate the skill works on Haiku/Sonnet/Opuscompare-skill-model-performance

The fastify-best-practices skill scored a perfect 100% on static review, well-structured description, good trigger terms, clean layout. And it still had a regression in production.

That's the gap skill-optimizer closes. Static review tells you the instructions are well-formed. Evals tell you whether the agent actually follows them. For the production config scenario, the skill took the agent from 41% to 100%, things like env-schema, close-with-grace, and @fastify/under-pressure that the agent simply doesn't reach for without explicit guidance. That gap is impossible to identify without measurement.

For anyone publishing skills to the Tessl registry, running this before you publish is the difference between shipping something that works and shipping something you hope works.