The Tessl Registry now has security scores, powered by SnykLearn more
Logo
Back to articlesCommon Pitfalls of Skills Development (And How to Fix Them)

28 Apr 202613 minute read

Alan Pope

Building the AI Native Dev community. Self-taught coder, driven by curiosity and a love for problem-solving.

I recently gave a version of this talk at AI Engineer Europe in London. What follows is the fuller story — what we found when we looked at thousands of skills, what goes wrong, and how to fix it.

You know that scene in The Matrix? Neo gets a spike in the back of his head, they upload kung fu directly into his brain, and he just... knows it.

image1

That's what a skill is for an AI coding agent. You write a markdown file — a SKILL.md — and the agent loads it when the task matches. Suddenly it knows your team's deployment process, or how your API handles pagination, or that you never use semicolons.

It's not code. It's context. Procedural knowledge, injected at the right moment.

The thing is — Neo's upload worked perfectly. Ours? Not always.

Skills are everywhere now

We spent some time analysing essentially all of public GitHub. In November last year, 12 repos had SKILL.md files. By March — five thousand four hundred and sixty. That's 450x growth in fourteen weeks.

image2

Skills went from zero to 27% of all agent config activity in three months. Faster adoption than CLAUDE.md, AGENTS.md, or any of the dotfile formats before them. And 1 in 12 merged PRs on GitHub now touches an agent config file — 8.4%, up from basically zero eighteen months ago.

This is not a niche thing anymore. This is how people are working.

But are they electrifying?

Ninety percent of agent config files are never updated after creation. Write once, forget forever.

Your codebase evolves every day. Your dependencies change. Your API contracts shift. But the instructions you gave your agent? Frozen in time.

For Gemini files it's even worse — 97% are write-once. And the purpose-built "skill-as-product" repos? Over half are under 50 kilobytes. Wrapper repos. Many are AI-generated. High churn, low staying power.

We have this explosion of skills, and most of them are going stale the moment they're committed.

What we did about it

The DevRel team at Tessl spent a couple of months doing something pretty hands-on. We went out and found open-source projects with SKILL.md files. We ran them through our review tooling. And where we could improve them, we opened pull requests. To strangers. On the internet.

622 PRs. 559 different repos. Nearly six thousand skills touched.

We weren't just theorising about what goes wrong. We were in the trenches, reading other people's skills, fixing them, and learning from the maintainer responses.

At the time of writing, 96 of those PRs got merged. 140 were closed. The rest were still open. That's a 15% merge rate on cold PRs to strangers' repos — which honestly isn't bad.

And along the way, we learned exactly where skills break.

Pitfall #1: Vague descriptions

Your description field is your activation signal. It's the if-clause the agent evaluates before it decides to load your skill. If it's generic, the agent has no signal. It either ignores you, or worse, activates on the wrong task.

Before:

"A helpful skill for code review and quality improvement"

Join us at AI Native DevCon (use TESSL30 for 30% discount)
Join us at AI Native DevCon

After:

"Runs ESLint with project rules, flags type-safety violations, and suggests fixes. Use when reviewing TypeScript PRs or running pre-commit checks."

From our outreach, 105 of our merged PRs specifically fixed descriptions. It was the single most common fix.

image3

And our research team measured this. When skills are installed but the agent isn't forced to use them, activation drops to 41%. Less than half. The skill is right there, installed, ready to go — and the agent walks right past it.

The strongest predictor of activation is what we call "distinctiveness conflict risk" — does your description use terms unique enough that the agent can tell your skill apart from its own built-in behaviours?

Skills with strong domain-specific nouns — "Remotion", "Calendly", "path-traversal-finder" — those activate well. Skills described with generic terms like "API", "code", "debugging"? They compete with the agent's own capabilities and lose.

What matters isn't how detailed your skill is. It's whether the description signals a concrete, bounded task that doesn't overlap with what the agent already knows how to do.

Pitfall #2: God skills

We found a Microsoft Foundry skill with 50 files in it. Fifty. Even with progressive disclosure, no agent is loading all of that context effectively.

And our review scores said it was fine. The evals passed. But three scenarios can't cover the surface area of fifty files. There's more content in that skill than can possibly be tested.

This is the God Skill problem. A skill that tries to do everything produces a description so broad it either never activates, or activates for the wrong reason. One skill, one workflow. That's the rule.

The SkillsBench paper from earlier this year confirmed it: 16 out of 84 tasks showed negative skill deltas. The skill actively made the agent worse. Usually because it introduced conflicting guidance or unnecessary complexity for something the model already handled well.

Pitfall #3: Context bloat

We know that leaner skills perform better. One of our users reported that after optimising their skill, it used 40% fewer tokens and finished in half the time compared to scanning source code directly.

But here's the irony: when we run our own optimiser, the output is on average 17% longer than the input. The machine adds examples, caveats, edge cases. It's thorough — but thoroughness burns context window.

Human-written skills often contain things the LLM already knows. You don't need to explain what a REST API is. You don't need to define what TypeScript generics are. The agent knows. What it doesn't know is your specific conventions.

The fix is progressive disclosure. Core instructions in the body. Detailed reference material in separate resource files, loaded on demand. Not upfront.

There's a related subtlety that bit us too. When you generate eval scenarios automatically, there's a risk that the scenario description accidentally tells the agent what to do to score well. We call it criteria leakage. The task says "implement audit logging with structured JSON output" — and the scoring rubric checks for structured JSON output. The baseline scores 80% just from reading the task description, without the skill.

image4

Our research team measured this: 30% of auto-generated scenarios had meaningful leakage. And when leakage is high but the scenario is generic, the skill can actually score worse than baseline. The leaked info is enough for the agent without the skill, and the skill just adds noise.

If your baseline scores are suspiciously high, your scenarios might be doing the agent's homework for it.

Pitfall #4: Activation varies by agent

Activation isn't just about your description. It varies dramatically by agent harness.

SetupActivation Rate
Claude Code (forced)98%
Single skill installed62%
10 skills installed58%
Claude Code (not forced)41%

With a single skill installed in a controlled test, activation is 62%. Add nine more skills and it drops to 58%. And installing too many skills can mean they conflict — the agent gets confused about which one to use, picks the wrong one, or picks none.

One of our colleagues tested a security review skill via MCP and reported: "The agent took the hint and just carried on" — completely ignoring the skill instructions. It acknowledged the skill existed but didn't follow it.

The honest bit

"We disagree pretty strongly with some of Tessl's guidance. Please stop submitting automated rewrites of our skills." — Open source maintainer

Not everyone loved our pull requests. And that's fair.

Reviewing a skill isn't just checking markdown formatting. If a skill augments a library, there's institutional knowledge baked in. The skill might encode proprietary details about how an org operates. Running a review without access to the project's test suite, without the external APIs, without the full context — you can't prove the "improvement" actually improves anything.

We can tell you if the description follows best practices. We can tell you if the structure is right. But we can't tell you if the content is correct for your specific domain without running it against your actual workload.

That's why evals matter. Static review is necessary but not sufficient. It's like static analysis versus actually running your tests.

The fix: the Context Development Lifecycle

So how do you actually fix all of this? Our very own Patrick Debois wrote about this as the Context Development Lifecycle. The idea is that context needs engineering rigour. The same discipline you'd give a shared library.

image5

Generate: Capture the implicit knowledge. Your conventions, your architecture decisions, your API quirks. The agent can draft, but the human decides what's true.

Evaluate: Test it. Reviews check structure. Task evals run the agent on real scenarios with and without the skill and measure the difference. That's the only way to know.

Distribute: Version it, publish it, secure it. Skills need owners, changelogs, and semver. A skill without version history is technical debt from the moment it's shared.

Observe: Watch what happens in production. Monitor activation. Check adherence. Close the loop.

The teams that win won't be the ones with the best models. They'll be the ones with the best context.

The numbers that matter

Our large-scale eval study across 1,200 skills showed roughly 20% absolute improvement in accuracy when the agent has skill access. Even more interesting: smaller, cheaper models remain competitive with larger models when given good skills. That's a direct cost saving.

And when you optimise properly — trim the fat, fix the description, use progressive disclosure — you get the same results with 40% fewer tokens in half the time.

But here's the caveat: human-curated skills improve performance by over 16 percentage points. Self-generated skills? Negligible or even negative. The quality of the skill matters enormously.

Skill adherence across projects ranges from 19% to 94%, with an average of 62%. The variance is huge — and that's the gap where good engineering practices make the difference.

Skills aren't just nice-to-have. They're a multiplier. But only if you treat them like software.

Start fixing your skills today

Submit for review: Send your skill to the Tessl registry for review and scoring.

Automate it: Add the tesslio/skill-review GitHub Action to your repo so every PR that touches a SKILL.md gets reviewed automatically.

Run it locally:

npx tessl skill review ./SKILL.md

The review gives you a score and line-level suggestions. The --optimize flag applies them. Iterate until you're above 70% before publishing. And when you're ready to go further, generate eval scenarios and run task evals — that's where you move from "does this look right" to "does this actually help."

If you're looking to bring this rigour to your engineering team, we can help with that too.