Event — Securing the Agent Skill Supply Chain | Virtual | June 17Register
Logo
Registry
EnterpriseCareersDocsRegistry

ARTICLE

We ran Composer 2.5 and 2.5 Fast across 11 skills. Surprisingly, Fast won.

Discover why Composer 2.5 Fast outperforms Composer 2.5 in speed and skill across 11 benchmarks. Upgrade now for faster results at no extra cost.

Simon Maple

·28 May 2026·6 min read

Cursor just shipped Composer 2.5 and Composer 2.5 Fast. We benchmarked both across 11 engineering skills, 5 scenarios per skill, averaged across three independent LLM judges. The fast model scored higher, ran 32% quicker, and costs exactly the same. If you are reaching for Composer 2.5 over Composer 2.5 Fast, you are paying the same price for a slower, slightly worse model.

Here is the full picture.

TL;DR

  • Composer 2.5 Fast scores 92.7% with skill context. Composer 2.5 scores 92.1%. Fast wins.
  • Both are ahead of gpt-5.5, gpt-5.4, and the previous Composer 2.
  • The fast model completes scenarios in 59 seconds on average. The regular model takes 87 seconds.

Where They Land in the Benchmark

We ran 6 models across 11 skills, scoring each run with three independent judges and averaging the results. Here is where the full leaderboard sits:

ModelAvg baselineAvg with-skillLift
opus-4-780.8%93.4%+12.6
composer-2.5-fast79.6%92.7%+13.1
composer-2.579.0%92.1%+13.1
composer-274.2%89.6%+15.4
gpt-5.575.5%89.4%+13.9
gpt-5.474.1%89.3%+15.2
gpt-5.365.5%83.9%+18.4
gpt-5-codex68.7%78.7%+10.0

Composer 2.5 Fast sits 1.3 points behind opus-4-7 and 3.3 points clear of everything else. That is a meaningful gap. The previous Composer 2 sits alongside gpt-5.4 and gpt-5.5 at roughly 89-90%. Cursor has moved its own model up a full competitive tier in a single release.

The Fast model seems better.

Normally a "fast" variant trades quality for speed. Composer 2.5 Fast does not do that. It scores 0.6 points higher than the regular model while running 28 seconds faster per scenario (59s vs 87s on average across 110 scored runs).

The per-skill breakdown shows where the differences accumulate:

Skill2.5 with-skill2.5-fast with-skillWinner
documentation97%98%fast
fastify99%94%2.5
init87%86%2.5
linting98%99%fast
node-best-practices95%95%tie
nodejs-core98%98%tie
oauth92%89%2.5
octocat95%96%fast
skill-optimizer98%98%tie
snipgrapher93%93%tie
typescript82%76%2.5

The regular model wins on fastify (+5), oauth (+3), and typescript (+6). The fast model wins on documentation, linting, and octocat. For most skills they are within noise. The overall average breaks toward fast because it avoids some of the deeper failures the regular model hits on documentation and linting under stricter judges.

The typescript result is worth flagging separately. Both models score lower with skill context than without it on typescript. The regular model drops from baseline to 82% with skill; the fast model drops further to 76%. Something about how these models interact with the typescript skill works against them. If typescript is central to your workflow, treat this as a yellow flag worth investigating.

The Cost Argument

Both Composer 2.5 variants are part of the Cursor subscription. The marginal cost of choosing one over the other is zero. There is no per-token bill that changes when you switch from the regular to the fast model.

This makes the benchmark result unusually clean: faster, cheaper (relatively), and better. The only case where you might prefer the regular model is if you are working heavily in fastify or oauth-heavy codebases where it holds a consistent 3-5 point lead. For everything else, the fast model is the better default.

Compare this to the OpenAI side of the leaderboard. gpt-5.5 and gpt-5.4 both land around 89%, behind both Composer 2.5 variants, and carry per-token API costs that accumulate with usage. The Cursor subscription gives you a stronger model at a fixed price, which changes the economics significantly if you are running agents at any kind of scale.

What Changed from Composer 2

The gap between Composer 2 and Composer 2.5 is larger than the leaderboard position suggests. The with-skill scores are 89.6% vs 92.1-92.7%, a 2.5-3 point jump. More importantly, the baseline scores tell a different story: Composer 2 sits at 74.2% without context, while Composer 2.5 sits at 79-80%. That 5-6 point baseline improvement means the new model is genuinely stronger at the task, not just better at following instructions when given them.

The lift numbers reinforce this. Composer 2 shows +15.4 points of lift from skill context. Both 2.5 variants show +13.1. A lower lift number means the model needs less scaffolding to perform well. Composer 2 was getting more out of the skill context because it needed it more. Composer 2.5 is a better baseline model that skills push even higher.

The One Caveat

These scores are averaged across three judges (Sonnet, GPT-5.5, Opus-4-7). The raw Sonnet-only scores for Composer 2.5 were 94% and 92%, which looked even better. After applying stricter judges, the numbers settled at 92.1% and 92.7%. That is the correct comparison to make against the other models in this benchmark, which went through the same three-judge process. A single-judge Sonnet score would have overstated the gap.

COPY & SHARE

Simon Maple

Simon Maple is Tessl’s Founding Developer Advocate, a Java Champion, and former DevRel leader at Snyk, ZeroTurnaround, and IBM.

READING

·

0%

IN THIS POST

Where They Land in the BenchmarkThe Fast model seems better.The Cost ArgumentWhat Changed from Composer 2The One Caveat

COPY & SHARE

Simon Maple

Simon Maple is Tessl’s Founding Developer Advocate, a Java Champion, and former DevRel leader at Snyk, ZeroTurnaround, and IBM.

YOUR NEXT READ

Your benchmarks are lying to you, and your judge is to blame!

Benchmarking AI models with single LLM judges can skew results due to judge bias. Multiple judges reveal score variations, suggesting a need for diverse evaluation methods.

Simon Maple

·15 May 2026·9 min read
Read more

More articles by Simon Maple

See all articles

Stop trusting your agent skills with vibes. Eliminate the context security risk.

Learn how 'tessl-audit' helps secure AI agent plugins by scanning for vulnerabilities, assessing quality, and ensuring plugins enhance agent performance.

Simon Maple·13 May 2026