Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.
100
100%
Does it follow best practices?
Impact
100%
2.00xAverage score across 3 eval scenarios
Passed
No known issues
{
"context": "Tests whether the agent produces a correct ASR architecture blueprint for a Romanian real-time contact-center use case: choosing the right model for Romanian streaming, properly handling telephony audio, sizing GPUs with 70% headroom and visible math, including all required blueprint sections, specifying VAD for Whisper, naming the correct evaluation benchmarks for Romanian, and rejecting models excluded for Romanian.",
"type": "weighted_checklist",
"checklist": [
{
"name": "Blueprint sections present",
"description": "The blueprint contains all 11 required top-level sections using the exact names: Requirements Recap, Model Selection, Streaming Architecture, Serving Topology, Hardware Sizing, Quantization & Runtime, Layered Processing, Audio I/O & Protocols, Evaluation Plan, Rollout & Risks, Open Questions",
"max_score": 8
},
{
"name": "Romanian streaming model",
"description": "Recommends Whisper-large-v3-turbo (not a plain Whisper-large-v3 or another model family) as the primary model for Romanian real-time streaming",
"max_score": 8
},
{
"name": "SimulStreaming or AlignAtt named",
"description": "Names SimulStreaming or AlignAtt as the commit policy or streaming strategy for Romanian live transcription",
"max_score": 7
},
{
"name": "Silero VAD included",
"description": "Includes Silero VAD (or explicitly WebRTC VAD) upstream of the Whisper decoder in the pipeline description",
"max_score": 8
},
{
"name": "Romanian alignment heads note",
"description": "Mentions that Whisper alignment heads may be English-tuned and recommends either calibrating language-specific heads or falling back to LocalAgreement-2 for Romanian",
"max_score": 7
},
{
"name": "Chunk size specified",
"description": "Specifies approximately 1 second chunks and a rolling buffer around 30 seconds for Romanian Whisper streaming (or explicitly names these as the starting tuning point)",
"max_score": 6
},
{
"name": "Telephony normalization",
"description": "States the telephony codec (e.g., G.711, G.729) and the 8 kHz to 16 kHz upsampling/normalization step in the Audio I/O section",
"max_score": 7
},
{
"name": "70% headroom sizing math",
"description": "Shows explicit GPU sizing math using 70% headroom: safe_streams = 0.7 × streams_per_gpu, and derives GPU count using ceil(N / safe_streams)",
"max_score": 8
},
{
"name": "Per-stream VRAM stated",
"description": "States per-stream KV cache or VRAM usage for the chosen model (e.g., ~40 MB per stream for Whisper-large-v3-turbo)",
"max_score": 6
},
{
"name": "Romanian eval benchmarks",
"description": "The evaluation plan includes at least two of: CV-21 RO, FLEURS-RO, RSC as named benchmark datasets for Romanian WER measurement",
"max_score": 7
},
{
"name": "Hallucination probe",
"description": "The evaluation plan includes a hallucination probe (silence input should produce zero tokens)",
"max_score": 6
},
{
"name": "Rejected Romanian-excluded models",
"description": "Names at least one of the following as a rejected option with a reason: Voxtral, Riva streaming Parakeet 1.1B RNNT, Parakeet-TDT-0.6B-v2, Nemotron Streaming (specifically citing Romanian language exclusion)",
"max_score": 7
},
{
"name": "Model pick has license",
"description": "The recommended model entry states its license (e.g., MIT for Whisper-large-v3-turbo)",
"max_score": 6
},
{
"name": "No implementation artifacts",
"description": "Does NOT contain any of: Dockerfile, Helm chart, Triton config.pbtxt, NeMo YAML, or actual Python/bash code blocks implementing the stack",
"max_score": 5
},
{
"name": "No CPU/memory autoscaling",
"description": "Autoscaling section (if present) does NOT recommend scaling on CPU utilization or memory; uses queue-based or GPU metrics instead",
"max_score": 4
}
]
}