Skills on Tessl: a developer-grade package manager for agent skillsLearn more
Logo
Back to articlesDo Agent Skills Actually Help? A Controlled Experiment

18 Feb 202610 minute read

Rotem Tamir

Rotem Tamir

Rotem Tamir is a Go developer, ex-CTO and Co-founder for atlasgo.io and consultant at honeybadge labs.

This is Part 2 of our Harbor series. Part 1 covers what Harbor is and how to run your first benchmark.

I had a question I couldn’t answer by intuition: do Claude Code’s Agent Skills actually help on a domain-specific task?

Skills are markdown files that teach Claude how to handle specific situations. In theory, they should improve how well agents handle certain problems. But “should” isn’t data. I wanted to know: if I add a skill for database migrations, does the agent actually perform better? And if so, by how much?

To answer this, I used Harbor - a framework for evaluating and optimizing agents. The quick version: you define a task (instruction + success criteria), run it N times across different configurations, and compare pass rates. Harbor handles the containers, parallelization, and result aggregation.

This post walks through the experiment: designing the task, forming a hypothesis, running trials, and - most importantly - what we learned by digging into the data.

The Task: A Realistic Bug

Quick refresher: a Harbor task is a folder with an instruction (what you ask the agent), a test (how you verify success), and an environment (Dockerfile + workspace files). Harbor runs this N times and reports pass rates. (Part 1 has the full walkthrough.)

Here’s a scenario I’ve seen multiple times: a developer adds a field to an ORM model but forgets to generate the corresponding database migration. Tests fail because the schema doesn’t match the model. The fix is straightforward - run the migration tool (atlas migrate diff, in our case) - but the agent needs to figure that out from a vague error message.

This is a perfect benchmark candidate: realistic, common, and verifiable. Here’s the task:

tasks/go-bug/
├── instruction.md
├── task.toml
├── tests/
│   └── test.sh             # Verifies migrations + runs go test
├── environment/
│   ├── Dockerfile          # Go + Atlas pre-installed
│   ├── main.go
│   ├── main_test.go
│   ├── models/
│   │   └── todos.go        # Category model with Description field
│   ├── migrations/
│   │   └── ...             # Missing description column migration
│   └── ...
└── solution/
    └── solve.sh

Our todo app is Go application that uses the popular GORM library for an ORM with Atlas (a “Terraform for Databases”) for managing database schema changes (migrations).

The Category model has a Description field:

type Category struct {
    ID          uint   `json:"id" gorm:"primaryKey"`
    Name        string `json:"name"`
    Description string `json:"description"`  // Added by developer
}

A migration created the categories table in a previous commit, but whoever added the Description field forgot to generate the migration for it.

To see how well Claude will do on our benchmarks task, we provide deliberately vague instructions. This is more resemblant of how developers actually prompt coding agents. Giving explicit instructions on how to achieve the result also beats the purpose of the benchmark - to see how well the agent can figure out how to approach the problem

Our instructions.md file reads:

# Task: Fix the Todo App Bug

The `/workspace` directory contains a Go module with a Todo application
using GORM and SQLite.

There is a bug in the application that causes one of the tests to fail.
Your task is to find and fix the bug so that all tests pass.

## Requirements

-All tests must pass when running `go test ./...`
-Do not modify the test file
-MUST use applicable skills if relevant

The verifier is critical - it must be deterministically correct and capture all requirements for a valid solution. For this task, simply making tests pass isn’t enough:

An agent might “fix” the bug by editing a previous migration file to add the missing column. That would work locally but break migration history in production - existing databases would try to re-run a modified migration. Our verifier catches this by checking that original migration hashes are unchanged:

#!/bin/bash
set -e
cd /workspace

# Check that original migrations were not modified (linear history preserved)
EXPECTED_HASH="20260114133748_add_categories.sql h1:pgqzTNd3JvZJDwR24vt..."
if !grep -qF "$EXPECTED_HASH" migrations/atlas.sum; then
    echo "0" > /logs/verifier/reward.txt
    exit 0
fi
# ... check other migration hashes ...

# Validate migrations with Atlas
atlas migrate validate --env gorm || { echo "0" > /logs/verifier/reward.txt; exit 0; }

# Run Go tests
if go test ./... -v; then
    echo "1" > /logs/verifier/reward.txt
else
    echo "0" > /logs/verifier/reward.txt
fi

This task is ideal for testing skills because success requires domain-specific knowledge: the agent needs to know Atlas’s migrate diff command exists and how to use it. Without that knowledge, agents try plausible-looking shortcuts that fail verification. See the full task definition.

The Hypothesis

Hypothesis: Adding a skill that documents our Atlas migration workflow will improve the agent’s pass rate on this task.

Configurations to compare:

  1. Vanilla Claude Code - baseline, no skills
  2. Official Atlas agent instructions - comprehensive instructions for agents from the Atlas docs (see skill)
  3. Custom project skill - a shorter, project-specific skill (see skill)

The Setup

Harbor’s built-in Claude Code agent doesn’t support controlling which skills are loaded into the environment. So we wrap it:

class ClaudeCodeWithSkills(ClaudeCode):
    """Wraps Claude Code to inject skills into the container before running."""

    def __init__(self, skill_dir: str = "skills", skills: str | None = None, **kwargs):
        # skill_dir: local folder containing skill subdirectories
        # skills: comma-separated filter, e.g. "atlas-full" or "db-schema-mgmt"
        ...

    async def setup(self, environment: BaseEnvironment):
        await super().setup(environment)
        # Create /workspace/.claude/skills in container
        # Upload SKILL.md files from matching skill directories
        ...

Key insight: Harbor’s agent setup hook gives us a clean way to inject different skill sets before execution, making it possible to run controlled A/B tests on skill configurations without modifying the agent itself.

Because setup() runs before the task begins, we can upload selected SKILL.md files directly into the container filesystem — exactly where Claude Code expects them — and compare performance across:

  • vanilla baseline
  • full Atlas instructions
  • lightweight project-specific guidance

See the full implementation.

Harbor supports YAML configs for multi-agent comparisons:

# configs/go-bug-skills-comparison.yaml
agents:
-name: claude-code # Baseline

-name: claude-with-skills-atlas-full
import_path:"agents.claude_code_with_skills:ClaudeCodeWithSkills"
kwargs:
skill_dir: skills
skills: atlas-full

-name: claude-with-skills-db-schema-mgmt
import_path:"agents.claude_code_with_skills:ClaudeCodeWithSkills"
kwargs:
skill_dir: skills
skills: db-schema-mgmt

tasks:
-path: tasks/go-bug

Run with 30 trials per configuration:

$ uv run harbor run --config configs/go-bug-skills-comparison.yaml -k 30

Running 90 trials sequentially would take hours. With cloud sandboxes like Daytona, we ran all 90 trials in parallel - completing the entire experiment in under 5 minutes:

$ uv run harbor run --config configs/go-bug-skills-comparison.yaml -k 30 --env daytona -n 50

The Results

ConfigurationTrialsPass RateΔ vs Baseline
Vanilla Claude Code3053%
+ Official Atlas Skill3073%+20%
+ Custom Project Skill3080%+27%

Skills improved pass rate by 20-27 percentage points. But the value of running this experiment isn’t just the headline number - it’s what we learned by digging deeper.

Going Deeper: Why Did It Work?

ConfigurationSkill InvocationsPass Rate When InvokedPass Rate When Not
+ Official Atlas Skill17/30 (57%)82% (14/17)62% (8/13)
+ Custom Project Skill25/30 (83%)96% (24/25)0% (0/5)

(How do we measure activation? Harbor captures Claude’s full session logs. We use an LLM judge to analyze each log and determine whether the agent invoked the skill during that run.)

Two insights jump out:

  1. The custom skill has higher activation (83% vs 57%). It triggers more reliably because its description is tuned for this specific task.
  2. When the custom skill fires, it nearly always works (96%). When it doesn’t fire, the agent fails every time (0%). The skill isn’t just helpful - it’s essential for this task.

When a skill is invoked, the agent follows the documented workflow: run atlas migrate diff to generate a new migration, then atlas migrate validate, then run tests. Without the skill, agents take shortcuts - like editing existing migrations - that fail our verifier.

The Iteration Cycle: From Failure to Activation

The results above didn’t come from our first attempt. Here’s what actually happened.

Our initial experiment showed dismal skill activation rates - around 10%. We had skills available, but Claude wasn’t using them. The pass rates were barely better than baseline.

At this point, we could have concluded “skills don’t work” and moved on. But Harbor gave us the data to dig deeper. When we analyzed the logs, we saw a different story: the skill worked when invoked - it just wasn’t being invoked often enough.

So we formed a new hypothesis: the skill description wasn’t compelling enough. We tested two changes:

  1. Stronger skill description: Changed from “Best practices for Atlas schema management” to “Rules that MUST be followed when working on database schema related tasks”
  2. Explicit instruction nudge: Added “MUST use applicable skills if relevant” to the task instruction

After these changes, activation jumped to 57-83%, and pass rate when invoked reached 82-96%.

This is the value of the method: each experiment generates data that informs the next experiment. We went from “skills seem useless” to “skills work but need activation help” to “here’s how to trigger them reliably.”

What We Learned

The experiment answered our original question: yes, skills help on this task. But the larger lesson is about how we arrived at that answer:

  1. Data beats intuition - We started with a hypothesis that skills would help. If we’d just tried it once and eyeballed the result, we might have gotten a failure (53% baseline is a coin flip) and abandoned the approach. Running 30 controlled trials per variant revealed the real picture.
  2. Dig past the headline metric - Pass rate alone would have been useful (73-80% vs 53%). But tracking skill invocation revealed the mechanism: 96% pass rate when invoked, 0% otherwise for the custom skill. That insight explains why the numbers differ and what to optimize next.
  3. Iterate based on evidence - The first experiment showed skills weren’t triggering. Instead of guessing at fixes, we formed a hypothesis about why, tested it, and confirmed with data.

This is the loop: hypothesis → experiment → analysis → new hypothesis. Harbor makes the mechanical part easy so you can focus on the thinking.

From Here

The experiment above is one example. The same method applies to any agent configuration question:

  • “Does my custom system prompt help or hurt?”
  • “Is MCP server X worth the latency?”
  • “Should I use a single detailed skill or multiple focused ones?”

You don’t have to guess. You can measure.

The full benchmark task and custom agent are in the hello-harbor repo.

Harbor is open source at github.com/laude-institute/harbor.

Join Our Newsletter

Be the first to hear about events, news and product updates from AI Native Dev.