21 Apr 202615 minute read

Anthropic, OpenAI, or Cursor model for your agent skills? 7 learnings from running 880 evals (including Opus 4.7)
21 Apr 202615 minute read

Claude Opus 4.7 shipped last week, and the question any engineering team reaches for is how it compares to its peers.
It is the strongest frontier coding model we tested on the baseline leaderboard, and it will be the easy default a lot of teams reach for.
But in 2026, the model you reach for could matter less than the skill you load with it.
That is what 880 evals across nine models (Opus 4.7, Opus 4.6, Sonnet 4.6, Haiku 4.5, three Codex variants, and Cursor's Composer-2) tell us. The leaderboard shuffles once agent skills get loaded. The cost math flips, and the weakest model with a skill can outperform the strongest model without one.
Let’s take a step back. It’s now 2026, and agent skills are spreading like wildfire… (even our favourite movies are catching up to them).
Every major agent ecosystem now has some version of them.
So the question worth asking, whether you are a dev, a platform engineer, or an engineering leader, is which skills actually earn their context weight, and which ones just add cost.
At Tessl, we believe context -particularly agent skills- and the broader concept of a context development lifecycle are where this space is heading (see also: Why the best AI coding teams will win on context). The results below add to a growing body of signals pointing to a shift that is already underway.
Top-line results
| Model | Without skill | With skill | Lift | $/run (with skill) | Avg time (with skill) |
|---|---|---|---|---|---|
| claude-opus-4-7 | 80.5% | 94.5% | +14.0 | $1.00 | 158.9s |
| claude-opus-4-6 | 77.1% | 93.8% | +16.7 | $0.53 | 126.6s |
| claude-sonnet-4-6 | 75.6% | 93.3% | +17.7 | $0.31 | 125.1s |
| claude-haiku-4-5 | 61.2% | 84.3% | +23.1 | $0.12 | 77.8s |
| gpt-5.4 | 75.9% | 92.7% | +16.8 | N/A* | 135.4s |
| gpt-5.3-codex | 75.8% | 91.9% | +16.1 | N/A* | 87.9s |
| gpt-5-codex | 73.8% | 85.1% | +11.3 | N/A* | 136.2s |
| cursor-composer-2 | 73.6% | 90.5% | +16.9 | N/A* | 152.0s |
We’ve evaluated 11 skills, and aggregated “with vs without” skill performance. For each skill we generated up to 5 realistic tasks from its content, each paired with evaluation criteria - full. We then solved each task with an agent under two conditions: with access to the skill and without access to the skill (full set up explained here). Codex and Cursor per-run costs aren't reported by the eval platform - source those directly from OpenAI and Cursor pricing for now.
Seven things the numbers told us
1. Every single configuration got positive lift
Eight models, 11 skills, 5 scenarios each. 88 configurations. Every single one posted a positive average lift with a skill loaded.
- Smallest: gpt-5-codex at +11.3 points.
- Largest: Haiku 4.5 at +23.1.
- Most configurations landed somewhere around +16 points.
This is not a story about Opus 4.7 winning and another model losing. Cursor's Composer-2 lifted from 73.6% to 90.5%, a +16.9 bump that puts it mid-pack alongside Sonnet and gpt-5.4. Across the Codex family alone, lifts ranged from +11.3 (gpt-5-codex) to +16.8 (gpt-5.4), so not every variant within a vendor benefits equally, but all of them benefited. Skills lifted scores across vendors, across tiers, across model generations. That is about as clean a signal as benchmark data gets.
2. Skills helped weaker models the most
Haiku 4.5 went from 61.2% to 84.3% with a skill loaded. That is a 23.1-point lift, the biggest gain of any configuration we tested. Opus 4.7 gained 14 percentage points. Sonnet gained 17.7.
The pattern holds across every model family in the set. If you are reaching for a smaller, cheaper model to control cost, skills could be where your accuracy is going to come from, not the next tier up.
3. A cheap model with a skill beats an expensive one without
Haiku 4.5 with a skill, at 84.3%, outperformed every single baseline configuration we tested, including Opus 4.7 at 80.5%. Meanwhile up at the frontier, Opus 4.7 (94.5%), Opus 4.6 (93.8%) and Sonnet 4.6 (93.3%) all landed within 1.2 points of each other with skills loaded. Without skills, that spread was closer to 5 points. Skills appear to compress the accuracy gap between different mode tiers. This confirms the same result as a deeper research we recently release: small models with context become powerful!
4. The biggest gains came from context no model was trained on
The single largest skill lift in the benchmark: snipgrapher at +36 points (51.9% → 88.0%). snipgrapher is a custom tool with opaque APIs that never made it into any public training corpus. Second place: node-best-practices at +29.2 points.
The skills that pulled the most weight were the ones encoding knowledge the model had no way to pick up from pretraining: private APIs, internal conventions, uncommon domains. Wrappers over material a frontier model already knows rarely justified their token cost. For anyone publishing skills, that looks like the bet worth making.
5. Loading a skill is a real context budget decision
We’ve seen loading a skill can be as much as a 3x cost increase for +2pp performance. Here is what "add a skill" costs Opus 4.7 at the frontier:
- Input tokens:
557K→1,016Kper run. An82%increase. - Cost per run:
$0.61→$1.00. Two-thirds more per invocation. - Turns taken:
17.5→24.4. A40%jump. (Opus 4.6 jumped from13.7to22, a63%jump.)
Skills at this capability level look less like brief hints and more like importing a big dependency: they can buy you capability, and they can cost you size, speed, and complexity. If you are orchestrating agents at scale, plan the context weight as deliberately as you plan the skill.
One architectural outlier worth flagging. Cursor's Composer-2 achieved a 56.3% cache hit rate on these runs, versus 91-97% for every other model in the benchmark. That means Composer-2 is paying full prompt price on roughly half of its invocations, yet still produced competitive scores (90.5% with a skill). If you are running Cursor as your IDE default, the same skill-first strategy should still apply, but the economics of the context window appear fundamentally different from Claude and Codex.
6. Sonnet-plus-skill might be the Opus-replacer hiding in the numbers
Sonnet 4.6 with a skill: 93.3%.
Opus 4.7 with a skill: 94.5%.
A 1.2-point gap. At a third of the per-run cost ($0.31 vs $1.00) and around 34 seconds faster on average.
For teams already running Opus on every workload and wondering whether it is earning its keep, that gap looks slim on anything that is not the hardest 5% of tasks. On almost every scenario we tested, Sonnet with a skill produced an output a senior developer would be hard pressed to separate from Opus with a skill. And you get the change back on every invocation.
If your top constraint is latency rather than cost, gpt-5.3-codex with a skill is the other sweet spot worth stress-testing. It landed at 91.9% accuracy in 87.9 seconds, nearly half the run time of Opus-with-skill for a 2.6-point accuracy gap. For latency-sensitive agentic pipelines, that combination is arguably the speed champion of the benchmark.
7. Haiku-plus-skill is the most underrated production config we tested
Haiku 4.5 with a skill: 84.3% at $0.12 per run. Average run time: 77 seconds. Roughly half the latency of Opus-with-skill, and 12% of the cost.
Adding the skill to Haiku barely moved the cost needle either. The run went from $0.104 to $0.119, a 1.5-cent marginal increase. Compare that to Opus, where the same skill switch added 39 cents per run. The lift on Haiku is enormous. The cost of getting it is effectively free at scale.
For throughput-heavy workloads such as batch jobs, eval loops, retries, or anything running at volume, that looks like the ROI champion of this benchmark.
Disclaimer: We are not saying every team should default to Haiku. We are saying the question worth asking before reaching for the most expensive tier is a simple one: would 84% with a skill be good enough for this workload?
What this means for you
If you are a dev
The skills that pulled the most weight were the ones encoding context the model was never trained on. If you are building or choosing skills for your own workflow, the ones that will move your accuracy most are tied to your specific stack: your internal APIs, your company's style guide, the framework nobody outside your repo has ever seen. Thin wrappers over library docs the model already knows are rarely going to earn their token cost.
There is also a practical implication for day-to-day work. Your wallet already knew this, but you perhaps do not need Opus for every task. For routine work such as code review, commit message generation, or refactor suggestions, Haiku 4.5 with a well-built skill is fast enough and accurate enough, and the round trip is roughly half the time.
If you are a platform engineer or DX lead
You are the one rolling agentic tooling out across developers at scale. The cost math changes when you multiply by team size.
Take 100 devs running their agent 20 times a day:
- Opus 4.7 with a skill at
$1.00per run: around$60,000a month. - Sonnet 4.6 with a skill at
$0.31per run: around$18,600a month. - Haiku 4.5 with a skill at
$0.12per run: around$7,200a month.
An 82% increase in input tokens when you switch skills on is not an edge case at scale, it is the main cost driver. Governance matters, context budgets matter, and the skills you bless need to earn their weight, not just their accuracy.
That is exactly the problem the Tessl registry aims to solve. Every skill in the Registry ships with eval scores, security scores, and impact metrics so you can see which skills actually earn their weight before you ship them to your org. Run evals against your own workloads to quantify the productivity you are losing on generic outputs versus what you can win back with the right skills in context. That is the kind of governance layer a platform team could take into a budget conversation.
If you are a VP of engineering
You may now have defensible data to make a tier-down case where it makes sense. Sonnet-with-skill delivers output within 1.2 points of Opus-with-skill at a third of the cost. For most workloads that are not the hardest 5% of tasks, that gap will not show up in the output quality your team is shipping.
Also worth knowing if you are picking a default for your org: skills lifted every configuration we tested, across Claude, Codex and Cursor. Your agent choice does not have to be locked to a single vendor to benefit from a skill-first strategy. That is useful leverage in procurement conversations and in any "should we standardise on X?" discussion.
If you want to run this decision with numbers for your own org, head over to your terminal, spin up an agent, and ask it to run evaluation with Tessl for your skill across different models. That could turn a procurement conversation into a data conversation.
Closing thoughts for AI enablement leads (even if your job title doesn't say so yet!)
This is the role that looks most directly at numbers like these. It doesn't always come with a standard title. Right now, the responsibility is sitting inside platform teams, developer experience functions, senior devs who have taken on the hat, and VPs of engineering wearing it as a second role.
What the role is responsible for: making sure hundreds of devs in an org have agentic tooling that is reliable, affordable, and performant. Which model a team defaults to, which skills are blessed, how much context a workload is allowed to pull, which workloads run where. These decisions land on whoever has that scope.
A few things to pay attention to in this data if that is you:
- The Sonnet-with-skill vs Opus-with-skill comparison could be a procurement conversation. At a
3xcost difference for effectively equivalent output on most tasks, this is the kind of number that should be going into your infra budget chats. - The
82%token increase when you switch a skill is the argument for context governance. Your skills need to be evaluated on what they lift, not just on whether they are available. - Haiku with a skill is the config worth testing for internal, high-frequency workloads. Running evals on your own skills, generating routine summaries, drafting internal docs. The output doesn't have to be Opus-grade. It has to be good enough, often enough, at a price your org can afford across hundreds of developers.
We believe the AI enablement lead will become a titled role inside engineering orgs over the next twelve to eighteen months, the same way DevOps lead and developer experience lead emerged before it. If you are that person in your org today, the above table is for you.
Opus 4.7 is a solid upgrade, but if you only take one thing from this piece: in 2026, picking the skill might matter more than picking the model.
Spin up your agent and request to leverage Tessl scenario evals for your skills, or speak to sales about Tessl for enterprise.
Resources
Related Articles
More by Simon Maple & Baptiste Fernandez

Cloudflare introduces “Agent Memory” to help AI agents remember across sessions
20 Apr 2026
Paul Sawers

Google adds subagents to Gemini CLI to handle parallel coding tasks
20 Apr 2026
Paul Sawers

Anthropic adds 'routines' to Claude Code for scheduled agent tasks
16 Apr 2026
Paul Sawers

A Proposed Framework For Evaluating Skills [Research Eng Blog]
15 Apr 2026
Maksim Shaposhnikov

Vercel open-sources Open Agents to help companies build their own AI coding agents
15 Apr 2026
Paul Sawers




