NEWS

How enterprises are scaling AI: 5 patterns from OpenAI

OpenAI identifies five patterns for scaling AI in enterprises, focusing on operational integration, governance, and engineering ownership over model capabilities.

Paul Sawers

·19 May 2026·12 min read

For enterprises adopting AI, model capability is no longer the main bottleneck. The bigger challenge now lies in the infrastructure around the model itself: evaluation systems, context handling, memory, orchestration, and the operational controls needed to make these tools reliable inside large organizations.

A new report from OpenAI hints at the contours of a new engineering discipline beginning to form around those problems. Drawing on examples from companies including BBVA, Philips, Scania, and JetBrains, the report argues that long-term results are now about how organizations manage trust, evaluation, governance, and integration into real team environments.

Sanj Bhayro, managing director for OpenAI in the EMEA region, argues that many companies are struggling to turn rapidly advancing model capabilities into dependable systems that fit into production environments. This translates into enterprises running into fragmented tooling, isolated pilots, and difficulties connecting experimentation to measurable operational outcomes.

“This is a leadership challenge, not a technical one,” Bhayro writes in the report.

He points to the companies getting traction from AI as those focusing on operational integration, internal adoption, and engineering ownership across the organization.

“The organizations that win with AI won’t be the ones that tried it first – they’ll be the ones that operationalized it best,” Bhayro continues.

The TL;DR: A new operational layer emerges

OpenAI’s enterprise AI report suggests the main challenge facing large organizations isn’t model capability, but the systems surrounding it: evaluation, context management, orchestration, governance, and human review.

The companies seeing traction from AI are defining quality standards before deployment, embedding AI into day-to-day engineering work, involving governance teams early, giving teams controlled autonomy to experiment, and keeping human judgment close to production decisions.

The result is the outline of a new operational layer around enterprise AI engineering: eval systems running against real codebases, shared context and skills layers, harnesses coordinating agents and tools, observability systems tracking performance and cost, and governance infrastructure recording how AI systems are built, tested, and approved over time.

1. Defining quality standards before scaling AI

One of the strongest threads running through OpenAI’s report is the emphasis companies placed on defining quality standards before attempting broader rollout. German property marketplace Scout24, for example, invested heavily in evaluation while building its conversational real-estate search product, pairing custom testing frameworks inspired by OpenAI Evals with broad internal testing designed to surface edge cases and calibrate confidence ahead of launch.

Features were delayed when quality thresholds were not met.

The report repeatedly returns to the idea that organizations built confidence in AI systems by defining quality standards early, resisting pressure to widen deployment before those standards were met.

“We learned that defining ‘good’ before scaling AI is critical, because quality turns experimentation into something users can truly trust,” Scout24 CTO Gertrud Kolb  said.

That challenge is becoming more pronounced as AI systems grow more dependent on prompts, context windows, orchestration layers, and external tools surrounding the model itself. Small changes to those systems can create visible shifts in output quality, even when the underlying model remains unchanged.

Anthropic's recent Claude Code postmortem was a useful reminder that this problem extends all the way up the stack. Small changes to prompts, caching behaviour, and reasoning settings compounded into something that looked like broad, inconsistent degradation. It got past code reviews, automated tests, and internal dogfooding. But it didn’t get past the end-user.

If Anthropic can run into these problems, nobody further down the stack gets a free pass. Evaluation isn’t a nice-to-have.

Tessl is building evaluation tooling designed to measure how different skills, context files, and agent setups perform against real codebases over time, allowing engineering teams to compare outcomes, identify regressions, and refine how coding agents behave inside production environments. It’s about context maturity — the idea that evaluation and the management of the context feeding agents in the first place need to advance together, not independently.

An operational stack is forming around enterprise AI engineering: evaluation systems running against real codebases, shared context and skills layers carrying institutional knowledge across teams, harnesses coordinating how agents interact with tools and environments, observability systems tracking cost and performance over time, and governance layers recording who changed what and why.

2. Giving engineering teams autonomy to build with AI

The flip side of that quality imperative is an organisational one: how do you let teams build and experiment freely while still gatekeeping what gets deployed? It's a tension most senior leaders will recognise.

OpenAI’s report points to Mirakl as one of the examples of companies moving beyond general AI usage into something more demanding. The French e-commerce software company gave employees tools and autonomy to create their own agents and rethink how AI could support day-to-day work. According to the report, that translated into faster creation of internal technical documentation, higher customer support efficiency while maintaining satisfaction, and quicker catalog onboarding with fewer errors.

The pattern here centres on giving teams enough autonomy to adapt AI systems around specific operational problems, while still keeping clear boundaries around where human review, judgment, and accountability remain essential.

“The lesson was not to deploy AI everywhere, but to use it deliberately as a system-level lever — while keeping human judgment firmly where it matters,” the report notes.

A shared infrastructure — evaluation tooling, context registries, skills layers — is part of what makes that balance possible: autonomy at the edges, consistency at the centre.

3. AI adoption starts with culture, not tooling

One of the more revealing points in OpenAI’s report is how companies approached AI adoption internally. Philips, for example, started by training senior leaders early, creating internal support and clearer expectations around how AI should be used across the organization. From there, teams working closer to day-to-day operational and engineering problems were encouraged to experiment and identify where AI could genuinely reduce friction or improve existing processes.

“Senior leaders were trained first – bottom-up ideas surfaced practical use cases,” the report notes.

It's a pattern echoed elsewhere. Jason Kellington, a trainer and technical writer at Microsoft, described the company's approach as "a phased journey grounded in people, culture, and continuous learning" — focused on helping engineers understand where AI could genuinely improve their work, rather than encouraging blanket adoption.

That emphasis on culture and operational habits sits at the centre of the Philips example in OpenAI’s report, which argues that organizations seeing traction from AI treat it as part of how teams work, collaborate, and make decisions day-to-day.

“By treating AI as an organizational capability — not a toolkit — Philips drove broad adoption aligned with patient care and clinical quality,” the report notes.

4. Building AI governance that accelerates, not blocks

Another key theme running through OpenAI’s report is that the companies seeing traction from AI involved governance and risk teams early, rather than treating them as approval gates at the end of the process.

Spanish banking giant BBVA, for example, brought security, legal, compliance, and IT teams into deployment discussions from the outset, helping the bank move faster once AI systems reached production environments.

“This foundation made it possible to move quickly without losing control – a non‑negotiable requirement in a global, regulated bank,” the report noted.

The broader point wasn’t simply about compliance oversight, but about creating enough internal confidence for teams to use AI systems in meaningful operational work without second-guessing whether the tools themselves had organizational backing.

“Governance worked because it reinforced trust,” the report concluded. “Employees had confidence that AI could be used for real thinking and everyday problem-solving – not just ‘safe’ or superficial tasks.”

The implication, ultimately, is that governance done well doesn't slow adoption — it's what makes meaningful adoption possible.

5. Protecting engineering judgment as AI scales

The final pattern in OpenAI’s report centres on keeping human judgment close to the work AI systems now support. In the JetBrains example, the focus was on using AI to help developers review, reason about, and design systems, rather than simply generate more code.

That becomes more important as AI tools produce larger volumes of plausible-looking output. The risk is that teams move faster while losing sight of architecture, maintainability, security, or production behaviour.

“It’s not just about generating code – it has to be safe, readable, and maintainable,” JetBrains’ CPO Kris Kang noted.

Microsoft’s internal AI rollout pointed in a similar direction, with the company emphasizing review culture, engineering standards, and preserving space for higher-value technical reasoning as adoption expanded across teams.

“Leaders played a critical role in that shift,” Kellington wrote. “Rather than positioning AI as a productivity shortcut, they framed it as a way to strengthen engineering fundamentals: clearer design discussions, better documentation, faster feedback loops, and more time for deep problem-solving.”

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

130 posts

READING

IN THIS POST

The TL;DR: A new operational layer emerges 1. Defining quality standards before scaling AI 2. Giving engineering teams autonomy to build with AI 3. AI adoption starts with culture, not tooling 4. Building AI governance that accelerates, not blocks 5. Protecting engineering judgment as AI scales

COPY & SHARE

Paul Sawers

Freelance tech writer at Tessl, former TechCrunch senior writer covering startups and open source

130 posts

YOUR NEXT READ

What GitHub learned when better tools made Copilot code review worse

GitHub's migration of Copilot code review to shared tools initially worsened performance. Rewriting instructions improved accuracy and reduced costs by 20%.

Paul Sawers

·14 Jul 2026·8 min read

How enterprises are scaling AI: 5 patterns from OpenAI

The TL;DR: A new operational layer emerges

1. Defining quality standards before scaling AI

2. Giving engineering teams autonomy to build with AI

3. AI adoption starts with culture, not tooling

4. Building AI governance that accelerates, not blocks

5. Protecting engineering judgment as AI scales

What GitHub learned when better tools made Copilot code review worse

More articles by Paul Sawers