11 Mar 202611 minute read

A ‘high blast radius’: Amazon probes surge in outages linked to AI coding tools
11 Mar 202611 minute read

Recent outages at Amazon are drawing attention to a growing tension inside modern software development: how far engineering teams can push AI-assisted coding before the guardrails around production systems catch up?
Internal documents reviewed by the Financial Times (paywall) describe a “trend of incidents” affecting Amazon’s retail infrastructure that involved “Gen-AI assisted changes” and had a “high blast radius.”
The memo also cited “novel GenAI usage for which best practices and safeguards are not yet fully established.”
The document was circulated ahead of a mandatory internal engineering meeting called to examine the incidents and discuss potential safeguards.
The company says it has since tightened its development process, meaning that junior and mid-level engineers can no longer push AI-assisted code to production without approval from a senior engineer, according to the internal communication.
A string of reliability incidents
Earlier this month, Amazon’s online store experienced an outage lasting several hours that prevented customers from completing purchases or checking product prices. At the time, the company said the disruption stemmed from an erroneous software code deployment.
The internal documents indicate that some of the incidents involved changes generated or assisted by AI coding tools.
One episode inside Amazon Web Services (AWS) involved an internal AI coding assistant called Kiro. Engineers allowed the system to make changes to the infrastructure supporting a cost-calculation service. Instead of applying a small modification, the tool reportedly deleted and recreated an entire environment, leading to a mid-December disruption that took roughly 13 hours to resolve.
The incident was previously reported by the Financial Times back in February, which said Amazon had experienced at least two AWS outages in recent months involving its internal AI coding tools.
Amazon described the December incident as “an extremely limited event,” affecting a single service and customers primarily in mainland China.
While that earlier FT report focused on disruptions inside AWS, the latest report suggests the issue may be broader, with internal documents describing incidents affecting Amazon’s retail infrastructure.
Reports indicate that reliability issues tied to AI-assisted changes may stretch back to around the third quarter of 2025, with the latest episode prompting Amazon’s retail technology leadership to call engineers into a deeper review of operational performance and deployment practices.
Kiro and spec-driven AI coding
The outage, ultimately, shines a light on Kiro, the AI coding assistant launched by Amazon Web Services last July. The tool is designed to move the current slate of tools beyond so-called “vibe coding” — rapid prototyping driven by prompts — toward generating production code from structured specifications.
Instead of jumping straight from a prompt to code, Kiro follows a spec-driven development model, where developers define requirements, architecture and implementation tasks before the AI generates code. The specifications act as a shared source of truth between engineers and the AI system, guiding how changes are implemented and tested.
Amazon has been encouraging engineers to adopt the tool as part of a wider push to integrate AI into software development.
The December AWS incident illustrates the operational challenges that can emerge as AI systems gain the ability to modify complex infrastructure. Engineers allowed Kiro to apply changes intended to resolve an issue with a cost-calculation system, but the agent determined the best course of action was to delete and recreate the environment, resulting in a disruption that took roughly 13 hours to resolve.
Amazon, for its part, said the problem stemmed from user permissions rather than the AI tool itself, adding that the engineer involved had broader access than expected and that the same issue could occur with any developer tool. The company also noted that its coding agents typically request authorization before taking actions.
That explanation points to a broader issue around governance rather than code generation itself. As AI agents become capable of making infrastructure changes, engineering teams are increasingly introducing approval gates and peer review before those actions reach production systems.
Startups such as Tessl are exploring development models that structure how AI coding agents produce software. A spec-driven approach treats specifications, architecture and implementation tasks as first-class artifacts, allowing engineers to review the intent behind an AI-generated change before code is produced.
The incidents at Amazon suggest that as AI coding tools become more capable, the critical challenge may lie less in how those systems generate code than in the operational guardrails governing how those changes reach production, including permissions, staged deployments and human review.
Guardrails catch up with tools
Amazon’s immediate response has focused on tightening deployment controls.
Under the updated policy, engineers can continue using generative AI tools while developing software, but AI-assisted changes must now pass through an additional layer of review before reaching production systems. Junior and mid-level engineers are required to obtain approval from a senior engineer before those changes can be deployed.
Amazon said the review of site availability was part of routine operational oversight and that the company continually evaluates the reliability of its retail systems.
A lesson for AI-assisted development
The incidents at Amazon highlight a challenge emerging across the software industry as AI coding tools move from experimentation into everyday development.
Systems such as Amazon’s retail platform and AWS operate through thousands of interconnected services. Changes that appear small in isolation can ripple across infrastructure when deployed at scale.
The company’s response carries a certain irony, too. In earlier reporting on AWS disruptions, Amazon said the incidents were “user error, not AI error,” describing the involvement of AI tools as a coincidence. The latest internal memo cited by the Financial Times, however, points to “Gen-AI assisted changes” as a contributing factor in a broader pattern of outages, while the remedy now involves restoring a more traditional safeguard: senior human review before AI-assisted changes reach production.
Questions about engineering capacity
Some observers say the incidents may reflect broader operational pressures inside Amazon. The company has carried out sweeping workforce reductions in recent months, eliminating around 30,000 corporate roles across two layoff rounds in late 2025 and early 2026, with engineers among the groups heavily affected.
James Gosling — the creator of the Java programming language and a former Distinguished Engineer at AWS — wrote in a LinkedIn post Tuesday that the combination of engineering layoffs and “hype-driven” technology can undermine system stability.
“The ridiculous engineering layoffs I’ve witnessed and hype-driven technology choices all inevitably lead to system instability,” Gosling wrote, adding that he remained skeptical that the company would learn from the recent disruptions.
In an earlier post discussing a previous AWS outage, Gosling argued that internal restructuring had pushed teams to evaluate services largely on their direct return on investment.
“The only metric they cared about when measuring a service was ROI — how much money the service brought in from customers,” he wrote, adding that teams supporting lower-revenue systems were “decimated.”
“Many of the services have little to no direct revenue. And yet they are critical to the operation of the system.”
He pointed to internal infrastructure such as DNS as an example, warning that cutting those teams damages “the ability to improve, the ability to reduce technical debt, and the ability to respond to operational issues.”
Put simply, large platforms such as Amazon’s retail systems and AWS rely on numerous internal services that generate little direct revenue but remain critical to overall reliability.
Whether the recent outages stem primarily from AI-assisted coding, operational changes, or a combination of both, the incidents highlight how introducing new automation tools into large-scale systems can expose weaknesses in the engineering processes that support them.
However the issue is framed, AI-assisted coding tools aren’t going away. Incidents like these may instead become early case studies in how engineering teams govern systems that increasingly help write and modify production code.



