NEWS

The hidden cost of agentic software development: why context engineering matters

AI token costs in software development are rising, impacting budgets. Context engineering is crucial for managing these expenses and ensuring efficient resource use.

Paul Sawers

·9 Jun 2026·7 min read

AI token bills are becoming one of the fastest-growing line items in engineering budgets, and most teams have little visibility into where the money is actually going. GitHub recently abandoned flat-rate pricing for its Copilot coding agent in favour of token-based billing — a move that sent some subscribers' projected costs up tenfold overnight. Anthropic, too, is increasingly moving toward consumption-based API token pricing — a direction that has developers bracing for a potential cost surge, with many VPs of engineering already exploring whether open-weight models can absorb more of their workload.

However you slice and dice it, the message from the market is clear: token costs are now a governance problem. When consumption is opaque and billing is variable, engineering leaders lose the ability to forecast spend, set budgets, or hold teams accountable — the same control problems that plagued cloud costs a decade ago, before FinOps became a discipline in its own right.

Tessl has already run this experiment. When it switched its default eval solver from Claude Sonnet 4.6 to the open-weight GLM 5.1 — a lower-cost model it uses to measure whether agent skills are working — it found that skills-equipped agents agreed on the right outcome in 88.5% of tasks, at an overall eval cost roughly 28% lower.

Recent research from Concordia University puts some empirical weight behind the over-arching concern — and its findings may surprise engineering leaders who assume they know where their agent spending is concentrated.

Context engineering is the cost lever

The paper, titled Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering, analysed execution traces from 30 software development tasks run through ChatDev, an open source multi-agent framework that simulates a software development team — complete with agents assigned roles such as programmer, tester, and code reviewer. The researchers, led by Emad Shihab at Concordia's Data-driven Analysis of Software lab, mapped token consumption across six development stages: design, coding, code completion, code review, testing, and documentation.

The headline finding is that code review accounts for an average of 59.4% of all token consumption — by far the largest single cost centre. Initial code generation, by contrast, comes in at just 8.6%.

ChatDev with GPT-5 reasoning (Credit: Concordia University)

The reason proffered in the report is structural: in a conversational multi-agent system, agents engaged in code review repeatedly pass the full codebase back and forth on every turn, accumulating what the researchers call a "communication tax." Across all tasks, input tokens — context being fed into the model — made up 53.9% of total consumption, compared to 24.4% for output tokens.

In other words, the agents are spending more tokens communicating context to each other than they are generating new work.

The coding stage is the one notable outlier: it runs output-heavy, with 58% output tokens versus just 6.9% input, which makes intuitive sense — a single instruction can yield hundreds of lines of code. Every other stage, including testing and documentation, is dominated by input tokens.

Phase-by-phase token ratio breakdown (Credit: Concordia University)

Know your cost map before the bill arrives

For teams running agents in production, the research offers a way to think about cost prediction based on the nature of the work. A greenfield project with substantial initial coding will look very different from a refactoring or review-heavy effort, which will be dominated by the expensive, input-heavy code review cycle. The researchers suggest that inserting a human checkpoint before the iterative code review loop begins could prevent a significant amount of unnecessary token burn, pointing to where the real inefficiency lies.

“This suggests that the primary cost of agentic software engineering lies not in initial code generation but in the iterative, conversational process of refinement and verification,” the report notes.

There are important caveats. The study used a single framework and a single model — GPT-5 — across 30 tasks. ChatDev is primarily a research framework rather than a production tool, so the specific percentages may not map directly onto commercial agents. The authors are candid about these limitations. However, the underlying dynamic — that verification and refinement loops, where agents repeatedly ingest large amounts of existing code, are structurally more expensive than generation — is likely to hold across conversational multi-agent architectures more broadly.

The research also connects to a growing body of practitioner thinking on context engineering: keeping token costs down is less about the model and more about how carefully you manage what gets passed into it. A community-contributed skill already in the Tessl registry cites this line of research directly, framing context engineering — loading only what's needed, compressing history, applying strict retrieval thresholds — as the practical discipline for keeping agent costs under control.

Tessl's evals layer adds another dimension: by running paired evaluations across models and measuring turn count, cost per task, and skill performance side by side, engineering teams can make data-driven decisions about which model delivers the best results for their specific workloads, rather than relying on headline accuracy scores that can mask significant cost differences underneath.

As token-based billing becomes the norm, understanding where tokens actually go is a prerequisite for running agents responsibly.

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

130 posts

READING

IN THIS POST

Context engineering is the cost lever Know your cost map before the bill arrives

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

130 posts

YOUR NEXT READ

What GitHub learned when better tools made Copilot code review worse

GitHub's migration of Copilot code review to shared tools initially worsened performance. Rewriting instructions improved accuracy and reduced costs by 20%.

Paul Sawers

·14 Jul 2026·8 min read

The hidden cost of agentic software development: why context engineering matters

Context engineering is the cost lever

Know your cost map before the bill arrives

What GitHub learned when better tools made Copilot code review worse

More articles by Paul Sawers