ARTICLE
AI Coding Agent Accuracy: Opus 4.7 vs 4.8
Discover the efficiency gains of AI coding agents Opus 4.7 vs 4.8. Learn how 4.8 improves cost and performance in real tasks. Upgrade smartly!

Rob Willoughby

You are deciding whether to roll your default agent model from Opus 4.7 to 4.8. The release notes promise improvements. The leaderboard shows a fraction of a point. So you shrug, schedule the upgrade for some quiet Friday, and move on.
That shrug is the mistake. We ran both versions through the same skills evaluation, 1000 scenarios solved twice each, and on the headline metric they finished in a dead heat. But underneath that tie, 4.8 reached the same answers in four fewer turns and for measurably less money. The version that looks like a non-event on the scoreboard is a real efficiency upgrade in the only place that bills you: the agent loop.
AI agent evaluation is the practice of measuring an agent's behavior on real tasks, not just its final accuracy. It tracks cost, turns, and reliability across paired runs. It matters because two models can post the same score while differing sharply in how much work they spend to get there.
Two versions, one eval harness
Both models ran the identical setup we use across this benchmark series. Every scenario is solved twice, once with no help and once with the relevant skill installed, so we can isolate what the skill contributes from what the base model already knows. We score three things: instruction following (did it do what was asked, the way it was asked), task completion (did it reach the goal), and an overall blend weighted toward instruction following. We also flag integrity issues,
like an agent peeking at the grading rubric instead of solving the task.
Two facts make this a clean comparison. First, we restrict every delta below to the 824 scenarios that both versions ran, so 4.7 and 4.8 are graded on identical work. Second, we priced both versions at the same per-token rate. Every cost difference you see is behavior, not a pricing change. We verified that rate by back-solving it from the logged spend across thousands of calls, and it reconstructs to zero residual.
Opus 4.7 is the incumbent. In our runs it is a strong agent that leans heavily on skills to reach its ceiling, and it explores a lot of paths to get there.
Opus 4.8 is the point release. It posts the same ceiling with a skill installed, but it starts from a higher floor without one, and it gets to the answer with noticeably less wandering.
Where AI coding agent accuracy stops being the story
Here is the head-to-head on the shared scenario set, all with the relevant skill installed unless noted.
| Dimension (with skill, 824 shared scenarios) | Opus 4.7 | Opus 4.8 |
|---|---|---|
| Overall score | 91.9 | 92.1 |
| Baseline score, no skill | 71.4 | 74.1 |
| Task completion | 97.1 | 97.4 |
| Instruction following | 88.1 | 88.1 |
| Turns per task | 19.2 | 15.0 |
| Output tokens per task | 7,820 | 9,763 |
| Cost per task, equal pricing | baseline | about 5% lower |
| Integrity flags raised | 10.5% | 7.8% |
The overall accuracy gap is 0.2 points. If you stopped reading the row labeled "overall score," you would conclude nothing changed. Three other rows say otherwise.
The first is the baseline. Without any skill, 4.8 scores 74.1 against 4.7's 71.4, a 2.6 point gain, and its no-skill instruction following climbed from the high 50s into the low 60s. The ceiling is shared because the skill pulls both versions up to roughly the same place. The floor is where 4.8 actually improved. That has a practical consequence: 4.8 depends on the skill slightly less to do good work.
The second is turns. 4.8 finishes the average task in 15.0 turns versus 19.2 for 4.7, a 21% reduction. In an agent loop, a turn is a full round trip of context, reasoning, and tool use. Cutting four turns off the median task cuts latency, cuts the surface area for an agent to talk itself into a wrong path, and as we will see, cuts cost.
The third is integrity. The eval flags runs where the agent took a shortcut, like reading the grading rubric or reaching outside its workspace. Those flags dropped from 10.5% of runs to 7.8%. 4.8 is modestly more honest about how it reaches an answer, not just whether it reaches one.
Reading the cost: turns, not tokens
Look again at two rows that seem to contradict each other. 4.8 produces more output per task, 9,763 tokens against 7,820, yet it costs about 5% less. More words, smaller bill.
The resolution is that output volume is not what dominates agentic cost. The dominant term is the context replayed on every turn. Each turn re-sends the accumulated conversation and tool results, and in long agent runs that cached input swamps the fresh output the model writes. Fewer turns means fewer replays. So 4.8 can be more verbose inside each turn and still come out ahead, because it takes four fewer turns to converge.
This is the lever no model card prints. Per-token rate sets the price of a unit of work. Turn count sets how many units the model decides to spend. A point release that holds accuracy flat while spending 21% fewer turns is doing real work on the second term, and the second term is the one that scales with your usage.
The same dynamic shows up in how each version absorbs a skill. Adding the relevant skill is not free. It pulls in instructions and reference material that the agent has to process. The question is how efficiently the model turns that overhead into a result.
| Effect of installing the skill | Opus 4.7 | Opus 4.8 |
|---|---|---|
| Overall score gain | +20.4 | +17.6 |
| Cost increase | +38% | +17% |
| Turn increase | +41% | +19% |
On 4.7, switching on a skill nearly doubled the turn count to cash in a 20 point accuracy gain. On 4.8, the same class of skill buys nearly the same gain for roughly half the turn and cost overhead. 4.8 treats a skill more like a shortcut and less like an invitation to explore. If you run agent skills at scale, that halved skill tax compounds across every task you ship.
The one place 4.8 regressed
A fair comparison reports where the new version is worse, not just where it wins. Per scenario, the record is close to a wash: 4.8 scored higher on 23% of shared tasks, tied on 61%, and scored lower on 17%, using a two point threshold. The interesting part is that the losses are not random noise. They cluster.
4.8 regressed on web research and scraping skill families. Firecrawl tasks dropped 3.3 points on average across 72 scenarios. LangChain dropped 2.9 points across 48. Smaller families like Tavily and Apify fell further, 10.4 and 7.6 points, though on fewer tasks. Meanwhile 4.8 improved on infrastructure, auth, and code tooling: Cloudflare gained 4.5 points across 38 scenarios, Auth0 gained 4.3 across 18, and Mastra gained 10.1 across 10.
The aggregate hid this completely, because the gains and losses nearly cancel. Only a per domain breakdown surfaces it. That is the whole argument for paired skill evals over a single leaderboard number: the headline can be a tie while two coherent shifts run in opposite directions underneath it.
When to roll forward to 4.8
The data supports a clear recommendation rather than a hedge.
Roll forward to 4.8 if your agents run long, multi turn tasks where turn count, latency, and cost matter, which is most production agent work. You get the same accuracy ceiling, a higher floor before skills, a 21% turn reduction, a cheaper skill tax, and fewer integrity flags. If your workloads lean on infrastructure, auth, or general code tooling, 4.8 is flat to clearly better.
Test before you roll forward if your agents live in the scrape, crawl, and summarize world. The web research regression is small in absolute terms but consistent across the families we measured, and it is exactly the kind of domain specific shift that an aggregate score will not warn you about. Run your own A/B on your top scraping workflows first.
For everyone else, the deciding question is not "did accuracy go up." Accuracy was a tie. The question is whether you want the same answers for fewer turns and less money, and there the answer is unambiguous.
The takeaway: measure behavior, not the changelog
A skeptic has two reasonable objections. The first: a flat score is just no improvement, so why care? Because in agentic systems the score is the least sensitive metric you have. Two models can tie on accuracy while one spends 21% more of your budget to get there. The tie is real. It is also not the whole measurement.
The second: these are harness numbers, not your invoice. True. But because we priced both versions identically, the relative differences in turns, tokens, and cost are exactly what you would see at any pricing, including your own. The absolute dollar figure is a valuation. The 21% turn gap is a fact about the model.
Point releases deserve more than a glance at the leaderboard. Measure them on behavior, on your own tasks, with skills installed and stripped out, and look at the per domain breakdown before you trust the average. The upgrade that hides behind a tie is often the one worth shipping.
Want to see how your own stack behaves across a model upgrade? Browse the Tessl Registry to find the skills your agents depend on, then run the same paired evaluation we used here to measure what actually changed.
COPY & SHARE

Rob Willoughby
Member of Technical Staff at Tessl
READING
·
0%
IN THIS POST
COPY & SHARE

Rob Willoughby
Member of Technical Staff at Tessl
YOUR NEXT READ
Why We're Changing Our Default Eval Model
The default eval model is changing from Claude Sonnet 4.6 to GLM 5.1 to reduce costs without losing signal quality, focusing on skill evaluation over model specificity.

Rob Willoughby

