Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.
100
100%
Does it follow best practices?
Impact
100%
2.00xAverage score across 3 eval scenarios
Passed
No known issues
Create local, on-prem, or self-hosted ASR architecture blueprints. Return a design document, not implementation code.
Use only matching rows; combine rows when requirements overlap.
| Need | Required picks and rejections |
|---|---|
| Romanian live/PSTN | Whisper-large-v3-turbo, MIT, SimulStreaming AlignAtt, Silero VAD before Whisper, ~1s chunks, ~30s buffer. Whisper alignment heads may be English-tuned; calibrate Romanian heads or use LocalAgreement-2. Name G.711/G.729 and normalize 8 kHz to 16 kHz. Reject Voxtral, Voxtral Realtime, Riva streaming Parakeet 1.1B RNNT, Parakeet-TDT-0.6B-v2, Nemotron Streaming for Romanian gaps. |
| High-concurrency Whisper | Use Triton Inference Server + TensorRT-LLM Whisper, not single-process FastAPI/faster-whisper. Require model_transaction_policy { decoupled: true }, bidirectional gRPC ModelStreamInfer; HTTP is insufficient. Use FP16, FP8 on H100 only after WER validation, inflight_fused_batching, separate self/cross-attention KV tuning, replica scaling with stream affinity via consistent/ring hash, sticky cookie/stick table, or session-to-replica map. Autoscale on queue duration/depth or GPU utilization, never CPU/memory. Start at 200-300 streams per H100. |
| RO+EN translation | Never force language="ro" on mixed audio. Add per-segment LID after VAD/diarization and before ASR/translation. Route confident ro to Romanian, en to English, mixed/low-confidence to dual-token concat prompt. Use Canary-1B-v2 with target_lang="en" and CC-BY-4.0 commercial-use-with-attribution. Reject Whisper-large-v3-turbo for translation because decoder pruning removed it; reject MMS and SeamlessM4T-v2 because CC-BY-NC. Treat Moldovan as a research gap requiring a representative evaluation set. |
Use visible math with 70% headroom:
safe_streams = 0.7 * streams_per_gpu
gpu_count = ceil(target_streams / safe_streams)Use ~40 MB/stream for Whisper-large-v3-turbo.
For Romanian, include WER on at least two of CV-21 RO, FLEURS-RO, RSC, plus CER for morphology/diacritics and a domain-held-out set when feasible. Also include RTFx/RTF on target hardware, TTFT or TTCT, endpoint F1, cutoff rate, streaming-vs-offline WER, and a silence hallucination probe that must produce zero tokens.
For any full blueprint, produce exactly:
# ASR Stack Blueprint - <project name>
## Requirements Recap
## Model Selection
## Streaming Architecture
## Serving Topology
## Hardware Sizing
## Quantization & Runtime
## Layered Processing
## Audio I/O & Protocols
## Evaluation Plan
## Rollout & Risks
## Open QuestionsEvery model pick must include model name, rationale, benchmark when available, and license. Every sizing recommendation must show streams/GPU or per-stream VRAM, 70% headroom, GPU count, and cost/audio-hour when pricing is available.
No implementation artifacts: Dockerfile, Helm chart, Triton config.pbtxt,
NeMo YAML, Python, or bash. Do not present a survey; make deliberate picks and
name rejected alternatives. Do not skip VAD upstream of Whisper for production.
Do not skip metrics tied to the deployment's audio.
Missing requirements are answered or listed in Open Questions. Model, serving, hardware, runtime, capacity math, and evaluation metrics are coherent and verifiable. Romanian coverage, code switching, Moldovan dialect, licensing, and vendor lock-in risks are surfaced when relevant.