sharaf/speech-recognition-architect

Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.

100

2.00x

Quality

100%

Does it follow best practices?

Impact

100%

2.00x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

Customer Experience Platform: Large-Scale English Transcription Blueprint

Name: sharaf/speech-recognition-architect
Rating: 100 (1 reviews)
Author: sharaf

Problem/Feature Description

A SaaS company provides a customer experience analytics platform to enterprise clients. Each client streams English-language call-center audio to the platform for real-time transcription and sentiment tagging. The largest enterprise clients route up to 800 simultaneous call streams during peak hours. The company has committed to self-hosting Whisper rather than relying on third-party APIs due to data privacy concerns from their financial-services clients.

Their current setup uses a single-process Python wrapper around Whisper that worked fine at 20 streams but falls over completely above 80. The infrastructure team has procured three NVIDIA H100 SXM 80 GB GPUs and wants a proper architecture that can serve all 800 streams concurrently within a two-second end-to-end latency budget. They need guidance on what serving framework to adopt, how to configure it correctly for streaming output of partial transcripts, how to split load across the three GPUs, and how to scale automatically as client call volume rises and falls.

Produce a complete architecture blueprint covering the serving stack, multi-GPU layout, streaming configuration, runtime and precision choice, GPU capacity calculation, load distribution, and autoscaling policy. The blueprint will be reviewed by both the platform engineering team and the client data-security officer, so it must explain deliberate choices and name alternatives that were considered and rejected.

Output Specification

Write the blueprint as a single Markdown file named blueprint.md. It must include a hardware sizing section with explicit numeric calculations that the team can audit and adjust if GPU count changes. The autoscaling section must name the specific metric or metrics the team should instrument. Any special configuration requirements for the chosen serving framework should be described in terms of what settings to enable and why, not as code or configuration file contents.