Back to articlesClaude Opus 4.6 and GPT-5.3 Codex take aim at long-running AI work

10 Feb 20267 minute read

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

Table of Contents

Model behavior: Claude Opus 4.6 & GPT-5.3.-Codex

Convergence without uniformity – and why this matters

Back to articles

Claude Opus 4.6 and GPT-5.3 Codex take aim at long-running AI work

10 Feb 20267 minute read

Not one but two big new models landed last week, as Anthropic and OpenAI each rolled out major updates to their flagship coding systems within hours of each other: Claude Opus 4.6, an LLM with expanded reasoning and coding chops; and GPT-5.3-Codex, OpenAI’s latest agentic iteration of its Codex line.

While the releases are not like-for-like in their technical design and priorities, they converge on a shared objective: enabling AI systems to sustain work over longer periods of time.

Despite steady gains in model performance over recent years, one limitation has persisted: they struggle to stay useful on longer jobs. Ask them to refactor a large codebase, investigate a bug across multiple files, or combine research with execution, and they tend to lose track of what they were doing, repeat work they’ve already done, or require frequent intervention to stay on task.

Over time, these behaviours limit how confidently teams can rely on agents inside real projects, particularly once work stretches beyond a few tightly scoped steps.

Model behavior: Claude Opus 4.6 & GPT-5.3.-Codex

Anthropic pitches Claude Opus 4.6 as a step forward in extended coding and agentic tasks. The company says the model plans more carefully, sustains autonomous activity for longer, and operates more reliably across large codebases. It also introduces support for a 1-million token context window in beta, a headline feature intended to expand how much material the model can reason over in a single session.

In reality, access to that full 1-million token window appears to be limited. As discussed by users in Reddit’s Claude developer community, the larger context window is currently exposed via the API rather than being enabled by default across all interfaces. For most users, context limits remain closer to those of earlier Opus releases, typically around 200,000 tokens.

But even without universal access to the maximum context size, the update points toward use cases where continuity really does matter: navigating unfamiliar repositories, tracking architectural decisions, or reviewing code without repeatedly reloading state. The emphasis isn’t just on writing code, but on reviewing, debugging, and catching its own mistakes — areas where earlier models often struggled once tasks spanned multiple steps.

As per the Claude Opus 4.6 system card, these improvements stem from additional training and evaluation on long-horizon, multi-step coding tasks, with stronger planning and self-review during execution. As Terminal-Bench 2.0 (a benchmark designed to test sustained, agentic coding in terminal environments) shows, Opus 4.6 performs more reliably across extended sequences of actions than earlier Claude models.

Anthropic says Opus 4.6 "excels"at real-world agentic coding and system tasks — Opus 4.6 benchmarked for terminal coding tasks

GPT-5.3-Codex, meanwhile, is positioned as an agent-oriented coding system, with evaluation centred on completing multi-step coding tasks using the Codex CLI. In its system card, OpenAI describes evaluating GPT-5.3-Codex using agent-style coding tasks run through the Codex CLI, where the system is required to edit files, invoke tools, run tests, and iterate until a task is completed.

Ultimately, GPT-5.3-Codex is pitched as an evolution of its prior GPT-5.2-Codex and GPT-5.2 models, merging improvements in coding ability with reasoning and professional knowledge.

“This enables it to take on long-running tasks that involve research, tool use, and complex execution,” the card reads. “Much like a colleague, you can steer and interact with GPT-5.3-Codex while it’s working, without losing context.”

Convergence without uniformity – and why this matters

What’s notable is not that Anthropic and OpenAI released major model updates in the same week, but that both are converging on the same underlying challenge from different directions. Neither release presents agentic coding as a solved problem, with both pitching progress in terms of whether systems can remain useful as work unfolds over longer periods of time — carrying context, incorporating feedback, and following tasks through to completion.

That emphasis reflects a broader shift in how progress is being defined. Improvements are increasingly described in terms of behaviour under prolonged pressure: how systems handle iteration, recover from mistakes, and maintain coherence across multi-step work, rather than how they perform on isolated prompts.

As agents are trusted with longer-running responsibilities, consistency becomes part of that picture. Foundation models change frequently, and even subtle behavioural differences can affect how the same task is approached from one run to the next.

For teams embedding agents directly into real repositories, that variability can complicate repeatability and trust. One response has been to move more of an agent’s skills, expectations, and constraints into the project itself, treating them as part of the codebase rather than properties of a particular model. By defining how work should be carried out alongside the repository, teams can preserve a more consistent approach even as the underlying LLM evolves. Platforms such as Tessl take this approach by managing agent skills at the project level, helping decouple long-running behaviour from the black-box system beneath.

For developers building with these systems, the takeaway is that long-running, context-aware work has become a first-class design target — one that spans models, tooling, and how projects themselves are structured. And that is a shift likely to shape how agents are built, tested, and trusted in the months ahead.