The Tessl Registry now has security scores, powered by SnykLearn more
Logo
Back to articlesA Proposed Framework For Evaluating Skills [Research Eng Blog]

15 Apr 202621 minute read

Maksim Shaposhnikov

ML Engineer at Tessl focused on AI-native development with LLMs. Previously at Amazon, built scalable multimodal, voice, NLP, and recommender systems.

Abstract

We study how to evaluate the practical value of skills, reusable instruction bundles that provide agents with task-specific knowledge and workflows. Although skills are easy to create, their effect on agent performance is not well understood.

We propose a large-scale evaluation framework that measures performance with and without access to a skill using realistic coding tasks derived from the skill’s content. Applying this framework to a large corpus of skills, we find that access to a relevant skill substantially improves solution quality, yielding an approximately 20% absolute accuracy gain over a no-skill baseline. We also find that smaller models with access to the right skill can perform similarly to larger models while being 3X cheaper.

In addition, we show that agents often fail to activate available skills, with activation rates dropping to about 40% in an unforced setting. Finally, we find that evaluation quality matters critically: around 30% of generated tasks exhibit issues such as leakage, which can lead to misleadingly optimistic conclusions if left unchecked. These results highlight both the promise of skills and the importance of careful evaluation design.

Based on this work, Tessl has released an evaluation framework that handles the complexity of evaluation design, allowing practitioners to focus on building high-quality skills rather than bespoke evaluation pipelines.

If you’re interested in this kind of content and want to hear more about evals, context management, and agentic development, join us at the AI Dev Conference in June.

Why You Want To Evaluate Skills

A skill is a set of instructions, packaged as a simple folder, that teaches an agent how to handle specific tasks or workflows, or how to make opinionated choices.

Skills are one of the most powerful ways to customize Claude Code, or any other agent, for your specific needs. Instead of re-explaining your preferences, processes, and domain expertise in every conversation, skills let you teach an agent once and benefit every time.

Creating a skill is easy and cheap. However, reliably answering the question “Does this skill do anything useful?” is much harder. In practice, this question is only the tip of the iceberg. Many other practical questions matter more:

  • Does the skill improve outcomes compared with having no skill at all?
  • Does the agent activate the skill at all?
  • Does it work with claude-haiku-4-5, or only with larger and more expensive claude-sonnet-4-6?
  • Are there behaviors the skill is meant to teach that no model actually picks up?
  • Is the skill making anything worse?

To answer these questions, we conducted a large-scale study of hundreds of real-world open-source skills from the Tessl Registry. We quantified their impact on coding quality.

To do that, we designed an evaluation framework that generates realistic coding problems. It then measures how models perform with and without access to a skill. The gap between those two settings tells us whether a skill is actually adding value.

In this post, we describe the evaluation methodology and quantify the impact of skills as a concept across a large set of real skills.

We used this evaluation approach as a key building block for the Tessl Evaluation Platform, which allows you to evaluate any SKILL reliably. Check it out here.

Experimental Setup: How We Evaluated Large Corpora Of Agent Skills

To evaluate the practical value of skills at scale, we created a dataset of diverse skills. We downloaded public skills curated in the https://tessl.io/registry and sampled a variety of skills from the dataset. We then automatically moderated skill quality to filter out low-quality ones. Next, we generated up to five realistic tasks per skill using its content, each paired with evaluation criteria. We then solved each task with an agent under two conditions: with access to the skill and without access to the skill. The diagram below summarizes the pipeline. We cover each step in the following sections.

image.png

What is the Tessl Registry?

The registry contains approximately 5K unique skills that pass formal review checks ( you can review any skill using the Tessl reviewer functionality here to get a quick idea of how good your skill is)

Using this large dataset, we characterized skill content at a high level by computing embeddings for each skill and using them to derive a thematic clustering of the corpus.

This analysis identified 88 themes across the registry. The largest groups were AI/ML (560 skills), Web Development (562), and DevOps/Infrastructure (555), alongside a meaningful long tail of more specialized domains, such as bioinformatics and mathematics.

Specifically, the distribution among the most popular themes is as follows:

image.png

Overall, the registry exhibits broad topical coverage across both common and niche areas of software development. We then used these clusters to construct a representative sample of 1,200 skills for the next stage of the evaluation.

Feasibility Checking: Filtering Skills That Can't Be Evaluated

We introduced a guardrail mechanism to determine whether a skill can, in principle, be evaluated through a meaningful synthetic task. Some skills can only be used in specific repositories and are not useful outside those brownfield environments. One example is the collection of skills from the PyTorch GitHub repository. These skills are useful for development within the PyTorch codebase, but they do not do anything outside that repository environment. For this study, we wanted to filter out such skills from our dataset and focus on skills that could be evaluated independently, such as skills that describe framework API documentation, tools, opinionated style guides, or a combination of these.

Specifically, our mechanism reads the skill content and assigns one of three labels:

  • INFEASIBLE: the skill depends on substantial pre-existing setup that cannot be realistically synthesized as part of the evaluation, such as an existing codebase, running database, Git history, multi-file project state, or provisioned infrastructure. In these cases, the core purpose of the skill is to operate on that pre-existing context.
  • FEASIBLE_BUT_NEEDS_INPUT_DATA: the skill can be evaluated in a synthetic setting, but requires a specific input artifact, such as a file, to be provided. The dependency is limited and does not amount to a full environment.
  • FEASIBLE: the skill can be evaluated directly in a greenfield setting, typically because it focuses on APIs, documentation, or similarly self-contained tasks.

The mechanism is fast, reliably identifies obvious mismatches between a skill and a greenfield evaluation setup, and can reduce unnecessary evaluation cost by filtering out skills that are not suitable for this setting.

The table below suggests that this distinction is important in practice: roughly 40% of skills either require some form of input data or are not feasible to evaluate in a greenfield environment at all.

CategoryNumber of SkillsPercentage (%)
Feasible72663.5
Needs Input Data13411.7
Infeasible28324.8

We retained only the feasible skills and carried this filtered set forward to the next stage of the evaluation. It is worth noting that the Tessl product allows you to determine whether your skill is feasible to evaluate. The Tessl evaluation engine supports fixtures and artifacts, allowing skill owners to upload any content. This makes it practically possible to evaluate most skills.

Generating Evaluation Tasks for AI Agent Skills

This is the most interesting part of the pipeline, where all the magic happens.

For each skill, we generated a set of evaluation tasks (task evals) designed to measure whether access to the skill leads to meaningfully different agent behavior.

task generation began with a structured analysis of the skill’s contents. We parsed each skill and extracted its actionable instructions, including recommended libraries, prescribed workflows, required conventions, prohibited patterns, and domain-specific implementation details. For each instruction, we also recorded the type of situation in which it would become relevant and whether it appeared to encode a reminder, new knowledge, or a specific preference among otherwise valid alternatives.

Using this representation, we then generated realistic coding tasks that collectively maximized coverage of the skill’s instructions. Each task was framed as a coherent user task rather than as a direct test of the instructions themselves. In particular, the task description was written to make the skill relevant without explicitly revealing the behaviors being evaluated.

Each task consisted of several components: a task specification describing the problem to be solved and a grading specification expressed as a collection of rubrics, with a score allocated to each rubric. The criteria were derived directly from the skill instructions and were designed to be as objective and binary as possible so that final artifacts could be graded reliably.

Importantly, the criteria focused on whether skill-specific guidance had been followed rather than on generic software quality or overall task success.

We fixed the maximum number of tasks per skill at five, although fewer could be generated for simpler skills. With five tasks, we achieved high coverage of the instructions encoded in each skill, whereas one or two tasks were clearly insufficient, as the table below shows:

Number of ScenariosMean Coverage (%)
128
247
359
470
578

The Problem Of Leakage In Evals

We found that it is extremely important to monitor the quality of generated tasks by assessing the amount of leakage from the skill content into the tasks. This is necessary to preserve the discriminative power of the evaluation setup.

In order to do detect the “cheating” behaviours, each generated task was validated by three reviewers:

  • Criteria Leakage. This checked how much of the scoring criteria leaked into the task content. High leakage indicated that the task instructions explicitly told the agent what to do to achieve a high score.
  • Skill Leakage. This checked how much of the skill’s instructions leaked into the task content. High leakage indicated that the task instructions explicitly described how the skill worked, making the presence or absence of the skill less meaningful.
  • task Value. This checked how tightly the task tested actual skill-specific instructions rather than generic practices that any competent agent would follow without the skill. In other words, it measured how well the task captured genuinely skill-dependent behavior.

These filters allowed us to remove low-quality tasks and prevented the agent from being able to “cheat.”

We found that a significant proportion of generated tasks, 30%, contained a meaningful amount of criteria leakage in the task description. The full breakdown is shown below:

re_post_1.png

These observations suggest that even when we explicitly prompt an agent generating tasks to avoid leakage, the generator is still prone to “cheating” without a separate verifier. Additional verification mechanisms are therefore required.

At this stage, we applied the following filters to remove low-quality tasks from the dataset:

  • We filtered out tasks with MEDIUM or HIGH criteria leakage.
  • We filtered out tasks with MEDIUM or HIGH skill leakage.
  • We retained only tasks with HIGH value.

The result was a set of realistic, skill-sensitive evaluation tasks that could be used to compare agent performance with and without access to the skill.

Result: Skills Improve Agent Accuracy by 20%

We further selected approximately 500 skills and their corresponding tasks across a range of domains. We then evaluated coding agents on tasks derived from those skills, both with and without access to the corresponding skill.

In the skill-enabled setting, the agent was explicitly informed in the prompt that the skill was available for solving the task. This allowed us to isolate the effect of the skill itself rather than confounding the results with uncertainty about whether the agent would choose to activate it. We present the ablation results for experiments without forced skill usage later in the text.

We conducted these experiments using models from the Anthropic family, specifically the latest available Haiku and Sonnet variants at the time of evaluation.

We report the results across three primary metrics: solution quality, cost, and runtime.

image.png

Several practical findings emerge from this study.

  • First, access to a skill is highly beneficial: across the evaluated tasks, we observe an approximately 20% absolute improvement in accuracy for Sonnet 4.6 relative to the no-skill setting.
  • Second, smaller and cheaper models can remain highly competitive with larger, more expensive models when they are given access to specialized context. In our experiments, Haiku performs well relative to larger models when the relevant skill is available.
  • Third, access to additional context increases both cost and runtime. This is expected, since the agent must spend tokens reading the skill content and incorporating its guidance into the solution process. In practice, however, the cost increase is modest and can be further mitigated through techniques such as progressive disclosure.

Claude Code Often Fails To Determine When To activate A Skill

As mentioned at the beginning of this section, the agent was explicitly informed in the prompt that the skill was available for solving the task. The reason was to isolate the effect of the skill itself rather than confounding the results with uncertainty about whether the agent would choose to activate it.

We conducted an additional ablation experiment in which the agent was given access to the skill. It was installed and visible to Claude Code, but the agent had to decide for itself whether to use it based on its understanding of the task.

In other words, we removed the following phrase from the instruction:

A proper skill must be available to help with the task. If the skill is not installed and not available to you, immediately abort the run. Always use the skill for solving the task as it's the source of valuable information. The fact that some skills would require an API key for some service to actually use it should not be a blocker for implementing the task.

Without forcing the agent to use the skill we observe that skill was activated only half of the times, compared to deterministic reliable activation when the skill is forced to be used

Skill is installed but not forcedSkill is installed and forced
# tasks Evaluated22862286
# tasks where skill was activated9292240
Activation Rate %4198
image.png

This experiment suggests that skill activation is an additional challenge, and Claude Code is not always able to determine from the available context that it should use the skill. The most common reason is that the skill’s description field is unclear and confuses the model. This motivates further work in two areas:

  • Improving the post-training recipe of frontier models so they can pick up skills more efficiently. This can only be done by the model provider.
  • Improving skill descriptions by making them unique, clear, and precise.

Our manual analysis showed that many publicly available skills are poorly described and use too many generic words that are not relevant to the actual content of the skill.

Tessl offers a capability to review your skill description and SKILL.md and propose improvements based on best practices. Check it out here.

Low Quality Tasks Affect Evaluations

As mentioned above, we filtered out tasks that were flagged as “bad,” either because they contained a high degree of criteria or skill leakage, or because they were judged to have low value. We then explored what happened to the score assessment when we measured it specifically on these “bad” slices of data. The results are shown below.

post_2.png

Specifically, we found the following:

  • The gap between agents with and without access to the skill is significant for high-quality tasks, that is, tasks with LOW or NO criteria leakage.
  • tasks with a MEDIUM or HIGH degree of criteria leakage lead to smaller behavioral differences between the two setups.
  • Measuring quality only on LOW value tasks leads, unsurprisingly, to no difference between setups with and without access to the skill.

These observations highlight the importance of task validation and indicate that task quality is critical for drawing reliable conclusions about performance.

Conclusions: Skills Work, But Only If You Can Measure Them

Our results support three practical conclusions.

Skills are a meaningful lever on both quality and cost. A 20% accuracy gain is not marginal. Equally important, the finding that a smaller model with the right skill can match a larger model means skill quality is a real input to your inference budget.

Having a skill installed is not the same as having it used. In an unforced setting, agents activate the relevant skill only 41% of the time. The quality and uniqueness of the skill description increase the chances of activation. Running evals allows you to measure activation explicitly and adjust if needed.

Evaluation quality determines whether conclusions are trustworthy. Without task validation, roughly 30% of generated evaluation tasks contain leakage that inflates scores and obscures the true effect of a skill. An evaluation that looks good on a poorly designed benchmark is not evidence that a skill works.Designing rigorous evaluations is hard, and the temptation to skip validation is precisely where misleading conclusions come from.

These findings motivated the design of the Tessl Evaluation Platform, which handles the full pipeline: feasibility checking, task generation, leakage detection, and scoring, so that practitioners can focus on building high-quality skills rather than thinking about how to design evaluation methodology and maintain necessary infrastructure.

If you want to know whether your skills are actually working, try it here.