ARTICLE

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7)

Discover 7 key insights from 880 evaluations of Anthropic, OpenAI, and Cursor models. Learn which agent skills boost performance and save costs.

Baptiste Fernandez, Simon Maple

·21 Apr 2026·15 min read

Claude Opus 4.7 shipped last week, and the question any engineering team reaches for is how it compares to its peers.

It is the strongest frontier coding model we tested on the baseline leaderboard, and it will be the easy default a lot of teams reach for.

But in 2026, the model you reach for could matter less than the skill you load with it.

That is what 880 evals across nine models (Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku 4.5, gpt-5.4, gpt-5.3-codex, gpt-5-codex, and Cursor's Composer-2) tell us.

Let’s take a step back. It’s now 2026, and agent skills are spreading like wildfire… (even our favourite movies are catching up to them).

Every major agent ecosystem now has some version of them.

So the question worth asking, whether you are a dev, a platform engineer, or an engineering leader, is which skills actually earn their context weight, and which ones just add cost.

At Tessl, we believe context -particularly agent skills- and the broader concept of a context development lifecycle are where this space is heading (see also: Why the best AI coding teams will win on context). The results below add to a growing body of signals pointing to a shift that is already underway.

Top-line results

Model	Native behavior rate coverage (e.g "without skill")	Adherence to skill ("with skill")	Lift	$/run (with skill)	Avg time (with skill)
claude-opus-4-7	80.5%	94.5%	+14.0	$1.00	158.9s
claude-opus-4-6	77.1%	93.8%	+16.7	$0.53	126.6s
claude-sonnet-4-6	75.6%	93.3%	+17.7	$0.31	125.1s
claude-haiku-4-5	61.2%	84.3%	+23.1	$0.12	77.8s
gpt-5.4	75.9%	92.7%	+16.8	N/A*	135.4s
gpt-5.3-codex	75.8%	91.9%	+16.1	N/A*	87.9s
gpt-5-codex	73.8%	85.1%	+11.3	N/A*	136.2s
cursor-composer-2	73.6%	90.5%	+16.9	N/A*	152.0s

We’ve evaluated 11 node.js development skills (documentation, fastify-best-practices, init, linting-neostandard-eslint9, node-best-practices, nodejs-core, oauth, octocat, skill-optimizer, snipgrapher, typescript-magician), and aggregated “with vs without” skill performance. For each skill we generated up to 5 realistic tasks from its content, each paired with evaluation criteria - full. We then solved each task with an agent under two conditions: with access to the skill and without access to the skill (full set up explained here). Codex and Cursor per-run costs aren't reported by the eval platform - source those directly from OpenAI and Cursor pricing for now.

Seven things the numbers told us

1. Every single configuration got positive lift

Eight models, 11 skills, 5 scenarios each. 88 configurations. Every single one posted a positive average lift with a skill loaded.

Smallest: gpt-5-codex at +11.3 points.
Largest: Haiku 4.5 at +23.1.
Most configurations landed somewhere around +16 points.

This is not a story about Opus 4.7 winning and another model losing. Cursor's Composer-2 lifted from 73.6% to 90.5%, a +16.9 bump that puts it mid-pack alongside Sonnet and gpt-5.4. Across the Codex family alone, lifts ranged from +11.3 (gpt-5-codex) to +16.8 (gpt-5.4), so not every variant within a vendor benefits equally, but all of them benefited. Skills lifted scores across vendors, across tiers, across model generations. That is about as clean a signal as benchmark data gets.

2. Skills helped weaker models the most

Haiku 4.5 went from 61.2% to 84.3% with a skill loaded. That is a 23.1-point lift, the biggest gain of any configuration we tested. Opus 4.7 gained 14 percentage points. Sonnet gained 17.7.

The pattern holds across every model family in the set. If you are reaching for a smaller, cheaper model to control cost, skills could be where your adherence is going to come from, not the next tier up.

3. A cheap model with a skill can beat an expensive one without

The skill set we leveraged is a Node.js-focused skill, so models can leverage certain Node.js scenarios directly from their pre-training. In our analysis, Haiku 4.5 with a skill, at 84.3%, outperformed every single baseline configuration we tested, including Opus 4.7 at 80.5%.

Meanwhile, up at the frontier, Opus 4.7 (94.5%), Opus 4.6 (93.8%), and Sonnet 4.6 (93.3%) all landed within 1.2 points of each other with skills loaded. Without skills, that spread was closer to 5 points. Skills appear to compress the adherence gap between different mode tiers. This confirms the same result as a deeper research we recently released: small models with context become powerful!

4. The biggest gains came from context no model was trained on

The single largest skill lift in the benchmark: snipgrapher at +36 points (51.9% → 88.0%). snipgrapher is a niche CLI by Matteo Collina - public on npm, but not exactly a household name. Whatever frontier models absorbed about it from training data, they clearly had not absorbed enough: its 51.9% baseline is one of the lowest in the set. Loading the skill closed that gap fast. Second place: node-best-practices at +29.2 points.

The skills that pulled the most weight were the ones encoding knowledge the model had no way to pick up from pretraining: private APIs, internal conventions, uncommon domains. Wrappers over material a frontier model already knows rarely justified their token cost. For anyone publishing skills, that looks like the bet worth making.

5. Loading a skill is a real context budget decision

We’ve seen loading a skill can be as much as a 3x cost increase for +2pp performance. Here is what "add a skill" costs Opus 4.7 at the frontier:

Input tokens: 557K → 1,016K per run. An 82% increase.
Cost per run: $0.61 → $1.00. Two-thirds more per invocation.
Turns taken: 17.5 → 24.4. A 40% jump. (Opus 4.6 jumped from 13.7 to 22, a 63% jump.)

Skills at this capability level look less like brief hints and more like importing a big dependency: they can buy you capability, and they can cost you size, speed, and complexity. If you are orchestrating agents at scale, plan the context weight as deliberately as you plan the skill.

6. Sonnet-plus-skill might be the Opus-replacer hiding in the numbers

Sonnet 4.6 with a skill: 93.3%.

Opus 4.7 with a skill: 94.5%.

A 1.2-point gap. At a third of the per-run cost ($0.31 vs $1.00) and around 34 seconds faster on average.

For teams already running Opus on every workload and wondering whether it is earning its keep, that gap looks slim on anything that is not the hardest 5% of tasks. On almost every scenario we tested, Sonnet with a skill produced an output a senior developer would be hard pressed to separate from Opus with a skill. And you get the change back on every invocation.

If your top constraint is latency rather than cost, gpt-5.3-codex with a skill is the other sweet spot worth stress-testing. It landed at 91.9% skill adherence in 87.9 seconds, nearly half the run time of Opus-with-skill for a 2.6-point adherence gap. For latency-sensitive agentic pipelines, that combination is arguably the speed champion of the benchmark.

7. Haiku-plus-skill is the most underrated production config we tested

Haiku 4.5 with a skill: 84.3% at $0.12 per run. Average run time: 77 seconds. Roughly half the latency of Opus-with-skill, and 12% of the cost.

Adding the skill to Haiku barely moved the cost needle either. The run went from $0.104 to $0.119, a 1.5-cent marginal increase. Compare that to Opus, where the same skill switch added 39 cents per run. The lift on Haiku is enormous. The cost of getting it is effectively free at scale.

For throughput-heavy workloads such as batch jobs, eval loops, retries, or anything running at volume, that looks like the ROI champion of this benchmark.

Disclaimer: We are not saying every team should default to Haiku. We are saying the question worth asking before reaching for the most expensive tier is a simple one: would 84% with a skill be good enough for this workload?

What this means for you

If you are a dev

The skills that pulled the most weight were the ones encoding context the model was never trained on. If you are building or choosing skills for your own workflow, the ones that will move your skill adherence most are tied to your specific stack: your internal APIs, your company's style guide, the framework nobody outside your repo has ever seen. Thin wrappers over library docs the model already knows are rarely going to earn their token cost.

There is also a practical implication for day-to-day work. Your wallet already knew this, but you perhaps do not need Opus for every task. For routine work such as code review, commit message generation, or refactor suggestions, Haiku 4.5 with a well-built skill is fast enough and accurate enough, and the round trip is roughly half the time.

If you are a platform engineer or DX lead

You are the one rolling agentic tooling out across developers at scale.

Take 100 devs running their agent 20 times a day:

Opus 4.7 with a skill at $1.00 per run: around $60,000 a month.
Sonnet 4.6 with a skill at $0.31 per run: around $18,600 a month.
Haiku 4.5 with a skill at $0.12 per run: around $7,200 a month.

An 82% increase in input tokens when you switch skills on is not an edge case at scale, it is the main cost driver. Governance matters, context budgets matter, and the skills you bless need to earn their weight, not just their skill adherence.

That is exactly the problem the Tessl registry aims to solve. Every skill in the Registry ships with eval scores, security scores, and impact metrics so you can see which skills actually earn their weight before you ship them to your org. Run evals against your own workloads to quantify the productivity you are losing on generic outputs versus what you can win back with the right skills in context. That is the kind of governance layer a platform team could take into a budget conversation.

If you are a VP of engineering

You may now have defensible data to make a tier-down case where it makes sense. Sonnet-with-skill delivers output within 1.2 points of Opus-with-skill at a third of the cost. For most workloads that are not the hardest 5% of tasks, that gap will not show up in the output quality your team is shipping.

Also worth knowing if you are picking a default for your org: skills lifted every configuration we tested, across Claude, Codex and Cursor. Your agent choice does not have to be locked to a single vendor to benefit from a skill-first strategy. That is useful leverage in procurement conversations and in any "should we standardise on X?" discussion.

If you want to run this decision with numbers for your own org, head over to your terminal, spin up an agent, and ask it to run evaluation with Tessl for your skill across different models. That could turn a procurement conversation into a data conversation.

Closing thoughts for AI enablement leads (even if your job title doesn't say so yet!)

This is the role that looks most directly at numbers like these. It doesn't always come with a standard title. Right now, the responsibility is sitting inside platform teams, developer experience functions, senior devs who have taken on the hat, and VPs of engineering wearing it as a second role.

What the role is responsible for: making sure hundreds of devs in an org have agentic tooling that is reliable, affordable, and performant. Which model a team defaults to, which skills are blessed, how much context a workload is allowed to pull, which workloads run where. These decisions land on whoever has that scope.

A few things to pay attention to in this data if that is you:

The Sonnet-with-skill vs Opus-with-skill comparison could be a procurement conversation. At a 3x cost difference for effectively equivalent output on most tasks, this is the kind of number that should be going into your infra budget chats.
The 82% token increase when you switch a skill is the argument for context governance. Your skills need to be evaluated on what they lift, not just on whether they are available.
Haiku with a skill is the config worth testing for internal, high-frequency workloads. Running evals on your own skills, generating routine summaries, drafting internal docs. The output doesn't have to be Opus-grade. It has to be good enough, often enough, at a price your org can afford across hundreds of developers.

We believe the AI enablement lead will become a titled role inside engineering orgs over the next twelve to eighteen months, the same way DevOps lead and developer experience lead emerged before it. If you are that person in your org today, the above table is for you.

Opus 4.7 is a solid upgrade, but if you only take one thing from this piece: in 2026, picking the skill might matter more than picking the model.

Spin up your agent and request to leverage Tessl scenario evals for your skills, or speak to sales about Tessl for enterprise.

COPY & SHARE

Baptiste Fernandez

Building AI Native Development community, spotlighting exciting releases and innovations in the space

26 posts

Simon Maple

Simon Maple is Tessl’s Founding Developer Advocate, a Java Champion, and former DevRel leader at Snyk, ZeroTurnaround, and IBM.

27 posts

READING

IN THIS POST

Top-line results Seven things the numbers told us What this means for you