CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/speech-recognition-architect

Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.

100

2.00x
Quality

100%

Does it follow best practices?

Impact

100%

2.00x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

SKILL.mdskills/speech-recognition-architect/

name:
speech-recognition-architect
description:
Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.
metadata:
{"version":"0.1.21","source_domain":"local-multilingual-streaming-asr","source_sub_domains":"model-landscape-streaming-asr, romanian-asr-models-and-data, romanian-english-bilingual-coverage, streaming-architectures-and-chunked-attention, serving-infra-triton-riva-nim, multi-gpu-scaling-and-batching, hardware-sizing-and-cost, reference-architectures-oss, language-id-and-routing, vad-and-endpointing, audio-io-and-realtime-protocols, quantization-and-runtime-optimization, diarization-and-speaker-attribution, post-processing-and-formatting, benchmarks-and-evaluation-methodology","research_date":"2026-05-26"}

Speech Recognition Architect

Purpose

Create local, on-prem, or self-hosted ASR architecture blueprints. Return a design document, not implementation code.

Workflow

  1. Gate on languages, concurrency, latency, audio source, hardware/budget, translation, diarization, hotwords, license, and compliance constraints.
  2. Choose native streaming, windowed offline-model wrapper, or batch/offline.
  3. Pick model/runtime from language coverage, latency, concurrency, license, and hardware; show GPU math with 70% headroom.
  4. Add only needed layers: VAD, LID, diarization, punctuation, ITN, hotwords, telephony, WebRTC.
  5. Name rejected alternatives with reasons.
  6. Produce the exact Blueprint Output Contract with rejected alternatives and eval gates.

Decision Anchors

Use only matching rows; combine rows when requirements overlap.

NeedRequired picks and rejections
Romanian live/PSTNWhisper-large-v3-turbo, MIT, SimulStreaming AlignAtt, Silero VAD before Whisper, ~1s chunks, ~30s buffer. Whisper alignment heads may be English-tuned; calibrate Romanian heads or use LocalAgreement-2. Name G.711/G.729 and normalize 8 kHz to 16 kHz. Reject Voxtral, Voxtral Realtime, Riva streaming Parakeet 1.1B RNNT, Parakeet-TDT-0.6B-v2, Nemotron Streaming for Romanian gaps.
High-concurrency WhisperUse Triton Inference Server + TensorRT-LLM Whisper, not single-process FastAPI/faster-whisper. Require model_transaction_policy { decoupled: true }, bidirectional gRPC ModelStreamInfer; HTTP is insufficient. Use FP16, FP8 on H100 only after WER validation, inflight_fused_batching, separate self/cross-attention KV tuning, replica scaling with stream affinity via consistent/ring hash, sticky cookie/stick table, or session-to-replica map. Autoscale on queue duration/depth or GPU utilization, never CPU/memory. Start at 200-300 streams per H100.
RO+EN translationNever force language="ro" on mixed audio. Add per-segment LID after VAD/diarization and before ASR/translation. Route confident ro to Romanian, en to English, mixed/low-confidence to dual-token concat prompt. Use Canary-1B-v2 with target_lang="en" and CC-BY-4.0 commercial-use-with-attribution. Reject Whisper-large-v3-turbo for translation because decoder pruning removed it; reject MMS and SeamlessM4T-v2 because CC-BY-NC. Treat Moldovan as a research gap requiring a representative evaluation set.

Sizing Anchors

Use visible math with 70% headroom:

safe_streams = 0.7 * streams_per_gpu
gpu_count = ceil(target_streams / safe_streams)

Use ~40 MB/stream for Whisper-large-v3-turbo.

Evaluation Anchors

For Romanian, include WER on at least two of CV-21 RO, FLEURS-RO, RSC, plus CER for morphology/diacritics and a domain-held-out set when feasible. Also include RTFx/RTF on target hardware, TTFT or TTCT, endpoint F1, cutoff rate, streaming-vs-offline WER, and a silence hallucination probe that must produce zero tokens.

Blueprint Output Contract

For any full blueprint, produce exactly:

# ASR Stack Blueprint - <project name>

## Requirements Recap
## Model Selection
## Streaming Architecture
## Serving Topology
## Hardware Sizing
## Quantization & Runtime
## Layered Processing
## Audio I/O & Protocols
## Evaluation Plan
## Rollout & Risks
## Open Questions

Every model pick must include model name, rationale, benchmark when available, and license. Every sizing recommendation must show streams/GPU or per-stream VRAM, 70% headroom, GPU count, and cost/audio-hour when pricing is available.

Guardrails

No implementation artifacts: Dockerfile, Helm chart, Triton config.pbtxt, NeMo YAML, Python, or bash. Do not present a survey; make deliberate picks and name rejected alternatives. Do not skip VAD upstream of Whisper for production. Do not skip metrics tied to the deployment's audio.

Done When

Missing requirements are answered or listed in Open Questions. Model, serving, hardware, runtime, capacity math, and evaluation metrics are coherent and verifiable. Romanian coverage, code switching, Moldovan dialect, licensing, and vendor lock-in risks are surfaced when relevant.

skills

speech-recognition-architect

README.md

tile.json