Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.
100
100%
Does it follow best practices?
Impact
100%
2.00xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent designs a high-concurrency Whisper serving architecture correctly: recommending Triton+TensorRT-LLM over single-process wrappers, specifying Triton's decoupled mode and gRPC requirements, applying 70% headroom to H100 sizing, using replica scaling with stream affinity via consistent hashing, choosing queue-based autoscaling signals (not CPU/memory), and selecting TensorRT-LLM with FP16/FP8 as the runtime.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Triton + TensorRT-LLM recommended",
"description": "Recommends Triton Inference Server combined with TensorRT-LLM as the serving stack for high-concurrency Whisper (not a single-process wrapper like FastAPI or WhisperLive alone)",
"max_score": 10
},
{
"name": "Decoupled streaming mode",
"description": "States that Triton must enable decoupled mode (model_transaction_policy with decoupled: true) for streaming partial transcript output",
"max_score": 8
},
{
"name": "Bidirectional gRPC required",
"description": "States that bidirectional gRPC (ModelStreamInfer) must be used and that plain HTTP is insufficient for decoupled streaming",
"max_score": 7
},
{
"name": "Inflight fused batching",
"description": "Specifies inflight fused batching for TensorRT-LLM Whisper",
"max_score": 7
},
{
"name": "KV cache tuning note",
"description": "Mentions that cross-attention KV cache and self-attention KV cache must be tuned separately",
"max_score": 6
},
{
"name": "Replica scaling with stream affinity",
"description": "Specifies replica scaling (not tensor-parallel across all GPUs by default) and stream affinity so consecutive chunks from the same call hit the same GPU replica",
"max_score": 8
},
{
"name": "Consistent hashing for affinity",
"description": "Names consistent hashing (not standard/round-robin hashing) as the required mechanism for stream affinity, or names one of: ring hash, sticky cookie/stick table keyed by session ID, or session-to-replica map",
"max_score": 7
},
{
"name": "70% headroom GPU sizing",
"description": "Shows explicit sizing math with 70% headroom applied (safe_streams = 0.7 × streams_per_gpu) and derives GPU count using ceil(800 / safe_streams) for H100s",
"max_score": 8
},
{
"name": "H100 per-GPU concurrency anchor",
"description": "References approximately 200-300 streams per H100 as the practical starting concurrency for Whisper-large-v3-turbo streaming workloads",
"max_score": 6
},
{
"name": "TensorRT-LLM runtime with FP16 or FP8",
"description": "Specifies TensorRT-LLM with FP16 or FP8 precision as the runtime for NVIDIA GPU Whisper serving (not faster-whisper or CTranslate2)",
"max_score": 7
},
{
"name": "Queue-based autoscaling metric",
"description": "Names queue duration, queue depth, or GPU utilization as the autoscaling signal — does NOT recommend scaling on CPU utilization or memory",
"max_score": 10
},
{
"name": "No FastAPI as inference engine",
"description": "Does NOT propose hand-rolled FastAPI as the inference engine for high-concurrency GPU serving (may allow it as a gateway)",
"max_score": 6
},
{
"name": "No implementation artifacts",
"description": "Does NOT contain Dockerfile, Helm chart, Triton config.pbtxt, or Python/bash code blocks implementing the stack",
"max_score": 5
},
{
"name": "Evaluation plan present",
"description": "Includes an evaluation section with at least WER measurement, RTFx on the specified hardware, and a streaming latency metric (TTFT or TTCT)",
"max_score": 5
}
]
}