The Tessl Registry now has security scores, powered by SnykLearn more
Logo
Back to articlesAnthropic postmortem shows how small changes compounded into Claude Code failure

28 Apr 20268 minute read

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

A decline in Claude Code’s performance over the past month prompted Anthropic to publish a detailed breakdown of where things went wrong, tracing the issue not to the model itself but to a set of changes in the systems around it.

In a detailed postmortem, Anthropic said the issues stemmed from a combination of prompt changes, caching bugs, and adjustments to how the system handled reasoning — all introduced within a short period of time.

The result was a noticeable decline in coding quality that users have been flagging en-masse.

“In the past week or so I noticed a major decrease in [the] quality of Claude Code outputs -- no changes were introduced at my end,” one user on Reddit wrote. “It suddenly went from being the most capable piece of AI to ‘I don't trust it with the simplest stuff’.”

The issue, they said, was that even careful, experienced usage offered no protection — breaking tasks into smaller pieces, the usual fix for LLM quality dips, stopped working. Simple instructions that had previously served them well were being ignored mid-session, with the model admitting it had skipped them.

“I went from a feeling of having superpowers to thinking that this is an OK AI that requires a lot of handholding — feels like I'm back to 2024,” they wrote.

Others described a similar productivity drop — custom workflows, spec-driven development, hooks all in place and, until recently, a feeling of getting more done in a week than previously possible in months. Within days that had reversed entirely, with Opus requiring so much handholding that some tasks were simply faster to do manually. "Writing code by hand like it's 2024," as one put it — and even rolling back to shorter context windows did little to help.

The saga offers a window into how AI coding tools can fail in production — not through dramatic model changes, but through small, compounding adjustments to the systems around them that users feel long before anyone behind the scenes acknowledges it.

Multiple small changes, one visible failure

Anthropic’s account points to three separate changes from early March to mid-April that interacted in unexpected ways.

One involved reducing the amount of reasoning the system performed before responding, after Anthropic lowered the default reasoning effort from high to medium to reduce latency issues that were causing the interface to appear frozen — a trade-off it later reversed after users pushed back on the drop in output quality. Another followed a change to clear older context from idle sessions to reduce latency, but a bug caused it to repeatedly wipe context, making the system appear to forget prior instructions. A third tweak — intended to make responses shorter — impacted code quality, with a prompt change to reduce verbosity interacting poorly with other instructions and degrading coding performance before being rolled back days later.

“Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation,” Anthropic wrote.

In other words, each change was limited in scope, but in combination they produced a drop in performance that was visible to users working on real projects.

The company said it had “immediately confirm[ed] that [its] API and inference layer were unaffected,” attributing the issues to changes in reasoning settings, memory handling, and prompts.

This points to reliability issues in AI coding tools even when the model itself remains stable, with outcomes shaped by prompting, context management, and the evaluation harness used to test changes before release.

Tools such as Tessl Evals are designed to measure exactly that: how different configurations, context files, and agent setups affect performance on real tasks, running scenarios with and without context to compare outcomes and isolate their impact.

Put simply, any company making changes to prompts, context handling, or agent configuration — whether tuning internal systems or building on top of foundation models — needs a way to test how those changes affect output before they reach users.

The product layer as a point of failure

Zooming out, the episode highlights the growing importance of what sits around the model.

AI coding tools such as Claude Code and Cursor are not just interfaces to a model. They manage context windows, apply system prompts, handle memory, and decide how much reasoning effort to allocate to a task. Each of these elements can influence output quality.

In this case, a prompt change designed to reduce verbosity had unintended consequences. Anthropic said the adjustment “hurt coding quality,” linking a surface-level change in tone to deeper issues in how the model approached tasks.

A separate caching issue compounded the problem. By mishandling conversation state, the system could fail to carry forward relevant information, giving users the impression that it was inconsistent or forgetful.

These aren’t failures of reasoning in isolation. They are failures in how reasoning is orchestrated.

That gap has led to more emphasis on testing changes against realistic tasks and scenarios before release, including approaches that compare performance with and without additional context or configuration.

Community reaction: ‘Everything points toward compute saving’

Much like the initial problems Claude Code was causing, the detailed breakdown from Anthropic has generated significant online debate in itself.

Some users welcomed the level of detail as unusual transparency from a company discussing decisions that had a negative impact on users.

“Rare to see a company be this transparent about shipping decisions that hurt users,” one Redditor wrote. Others were more sceptical, suggesting the explanation masked attempts to reduce compute costs. “Lmao crazy how they’re all claiming it’s ‘bugs’ and ‘to improve latency’ while everything points toward compute saving,” another opined.

A more balanced view suggested a combination of cost considerations and unintended bugs.

“I do think it’s both,” one user noted. “They definitely intentionally tried to save compute. But I have no doubt that they also introduced bugs that were unintended. I don’t think Anthropic ever shipped a feature that is not broken and unstable.”

Anthropic, for its part, said it has since rolled back the changes and introduced safeguards to prevent similar issues. The takeaway centres on how these systems are built and tested.

For teams adopting AI coding tools, attention turns to how changes to prompts, context, and configuration are managed and evaluated over time.