Name: sharaf/speech-recognition-architect
Rating: 100 (1 reviews)
Author: sharaf

sharaf/speech-recognition-architect

Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.

100

2.00x

Quality

100%

Does it follow best practices?

Impact

100%

2.00x

Average score across 3 eval scenarios

Securityby

Passed

No known issues

{
  "context": "Tests whether the agent produces a correct ASR architecture blueprint for a Romanian real-time contact-center use case: choosing the right model for Romanian streaming, properly handling telephony audio, sizing GPUs with 70% headroom and visible math, including all required blueprint sections, specifying VAD for Whisper, naming the correct evaluation benchmarks for Romanian, and rejecting models excluded for Romanian.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Blueprint sections present",
      "description": "The blueprint contains all 11 required top-level sections using the exact names: Requirements Recap, Model Selection, Streaming Architecture, Serving Topology, Hardware Sizing, Quantization & Runtime, Layered Processing, Audio I/O & Protocols, Evaluation Plan, Rollout & Risks, Open Questions",
      "max_score": 8
    },
    {
      "name": "Romanian streaming model",
      "description": "Recommends Whisper-large-v3-turbo (not a plain Whisper-large-v3 or another model family) as the primary model for Romanian real-time streaming",
      "max_score": 8
    },
    {
      "name": "SimulStreaming or AlignAtt named",
      "description": "Names SimulStreaming or AlignAtt as the commit policy or streaming strategy for Romanian live transcription",
      "max_score": 7
    },
    {
      "name": "Silero VAD included",
      "description": "Includes Silero VAD (or explicitly WebRTC VAD) upstream of the Whisper decoder in the pipeline description",
      "max_score": 8
    },
    {
      "name": "Romanian alignment heads note",
      "description": "Mentions that Whisper alignment heads may be English-tuned and recommends either calibrating language-specific heads or falling back to LocalAgreement-2 for Romanian",
      "max_score": 7
    },
    {
      "name": "Chunk size specified",
      "description": "Specifies approximately 1 second chunks and a rolling buffer around 30 seconds for Romanian Whisper streaming (or explicitly names these as the starting tuning point)",
      "max_score": 6
    },
    {
      "name": "Telephony normalization",
      "description": "States the telephony codec (e.g., G.711, G.729) and the 8 kHz to 16 kHz upsampling/normalization step in the Audio I/O section",
      "max_score": 7
    },
    {
      "name": "70% headroom sizing math",
      "description": "Shows explicit GPU sizing math using 70% headroom: safe_streams = 0.7 × streams_per_gpu, and derives GPU count using ceil(N / safe_streams)",
      "max_score": 8
    },
    {
      "name": "Per-stream VRAM stated",
      "description": "States per-stream KV cache or VRAM usage for the chosen model (e.g., ~40 MB per stream for Whisper-large-v3-turbo)",
      "max_score": 6
    },
    {
      "name": "Romanian eval benchmarks",
      "description": "The evaluation plan includes at least two of: CV-21 RO, FLEURS-RO, RSC as named benchmark datasets for Romanian WER measurement",
      "max_score": 7
    },
    {
      "name": "Hallucination probe",
      "description": "The evaluation plan includes a hallucination probe (silence input should produce zero tokens)",
      "max_score": 6
    },
    {
      "name": "Rejected Romanian-excluded models",
      "description": "Names at least one of the following as a rejected option with a reason: Voxtral, Riva streaming Parakeet 1.1B RNNT, Parakeet-TDT-0.6B-v2, Nemotron Streaming (specifically citing Romanian language exclusion)",
      "max_score": 7
    },
    {
      "name": "Model pick has license",
      "description": "The recommended model entry states its license (e.g., MIT for Whisper-large-v3-turbo)",
      "max_score": 6
    },
    {
      "name": "No implementation artifacts",
      "description": "Does NOT contain any of: Dockerfile, Helm chart, Triton config.pbtxt, NeMo YAML, or actual Python/bash code blocks implementing the stack",
      "max_score": 5
    },
    {
      "name": "No CPU/memory autoscaling",
      "description": "Autoscaling section (if present) does NOT recommend scaling on CPU utilization or memory; uses queue-based or GPU metrics instead",
      "max_score": 4
    }
  ]
}

evals

scenario-1

criteria.json

task.md

scenario-2

scenario-3

skills

README.md

tile.json

sharaf/speech-recognition-architect

criteria.json.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}evals/scenario-1/

criteria.jsonevals/scenario-1/