Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.
100
100%
Does it follow best practices?
Impact
100%
2.00xAverage score across 3 eval scenarios
Passed
No known issues
A SaaS company provides a customer experience analytics platform to enterprise clients. Each client streams English-language call-center audio to the platform for real-time transcription and sentiment tagging. The largest enterprise clients route up to 800 simultaneous call streams during peak hours. The company has committed to self-hosting Whisper rather than relying on third-party APIs due to data privacy concerns from their financial-services clients.
Their current setup uses a single-process Python wrapper around Whisper that worked fine at 20 streams but falls over completely above 80. The infrastructure team has procured three NVIDIA H100 SXM 80 GB GPUs and wants a proper architecture that can serve all 800 streams concurrently within a two-second end-to-end latency budget. They need guidance on what serving framework to adopt, how to configure it correctly for streaming output of partial transcripts, how to split load across the three GPUs, and how to scale automatically as client call volume rises and falls.
Produce a complete architecture blueprint covering the serving stack, multi-GPU layout, streaming configuration, runtime and precision choice, GPU capacity calculation, load distribution, and autoscaling policy. The blueprint will be reviewed by both the platform engineering team and the client data-security officer, so it must explain deliberate choices and name alternatives that were considered and rejected.
Write the blueprint as a single Markdown file named blueprint.md. It must include a hardware sizing section with explicit numeric calculations that the team can audit and adjust if GPU count changes. The autoscaling section must name the specific metric or metrics the team should instrument. Any special configuration requirements for the chosen serving framework should be described in terms of what settings to enable and why, not as code or configuration file contents.