CtrlK
BlogDocsLog inGet started
Tessl Logo

sharaf/speech-recognition-architect

Use when the user wants to design, size, audit, or choose a self-hosted speech recognition or streaming ASR stack, including Whisper, Parakeet, Canary, Riva, NIM, Triton ASR, faster-whisper, sherpa-onnx, voice-agent transcription, Romanian or Moldovan ASR, contact-center transcription, GPU sizing, latency budgets, multilingual routing, VAD, diarization, or production evaluation.

100

2.00x
Quality

100%

Does it follow best practices?

Impact

100%

2.00x

Average score across 3 eval scenarios

SecuritybySnyk

Passed

No known issues

Overview
Quality
Evals
Security
Files

criteria.jsonevals/scenario-2/

{
  "context": "Tests whether the agent designs a high-concurrency Whisper serving architecture correctly: recommending Triton+TensorRT-LLM over single-process wrappers, specifying Triton's decoupled mode and gRPC requirements, applying 70% headroom to H100 sizing, using replica scaling with stream affinity via consistent hashing, choosing queue-based autoscaling signals (not CPU/memory), and selecting TensorRT-LLM with FP16/FP8 as the runtime.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "Triton + TensorRT-LLM recommended",
      "description": "Recommends Triton Inference Server combined with TensorRT-LLM as the serving stack for high-concurrency Whisper (not a single-process wrapper like FastAPI or WhisperLive alone)",
      "max_score": 10
    },
    {
      "name": "Decoupled streaming mode",
      "description": "States that Triton must enable decoupled mode (model_transaction_policy with decoupled: true) for streaming partial transcript output",
      "max_score": 8
    },
    {
      "name": "Bidirectional gRPC required",
      "description": "States that bidirectional gRPC (ModelStreamInfer) must be used and that plain HTTP is insufficient for decoupled streaming",
      "max_score": 7
    },
    {
      "name": "Inflight fused batching",
      "description": "Specifies inflight fused batching for TensorRT-LLM Whisper",
      "max_score": 7
    },
    {
      "name": "KV cache tuning note",
      "description": "Mentions that cross-attention KV cache and self-attention KV cache must be tuned separately",
      "max_score": 6
    },
    {
      "name": "Replica scaling with stream affinity",
      "description": "Specifies replica scaling (not tensor-parallel across all GPUs by default) and stream affinity so consecutive chunks from the same call hit the same GPU replica",
      "max_score": 8
    },
    {
      "name": "Consistent hashing for affinity",
      "description": "Names consistent hashing (not standard/round-robin hashing) as the required mechanism for stream affinity, or names one of: ring hash, sticky cookie/stick table keyed by session ID, or session-to-replica map",
      "max_score": 7
    },
    {
      "name": "70% headroom GPU sizing",
      "description": "Shows explicit sizing math with 70% headroom applied (safe_streams = 0.7 × streams_per_gpu) and derives GPU count using ceil(800 / safe_streams) for H100s",
      "max_score": 8
    },
    {
      "name": "H100 per-GPU concurrency anchor",
      "description": "References approximately 200-300 streams per H100 as the practical starting concurrency for Whisper-large-v3-turbo streaming workloads",
      "max_score": 6
    },
    {
      "name": "TensorRT-LLM runtime with FP16 or FP8",
      "description": "Specifies TensorRT-LLM with FP16 or FP8 precision as the runtime for NVIDIA GPU Whisper serving (not faster-whisper or CTranslate2)",
      "max_score": 7
    },
    {
      "name": "Queue-based autoscaling metric",
      "description": "Names queue duration, queue depth, or GPU utilization as the autoscaling signal — does NOT recommend scaling on CPU utilization or memory",
      "max_score": 10
    },
    {
      "name": "No FastAPI as inference engine",
      "description": "Does NOT propose hand-rolled FastAPI as the inference engine for high-concurrency GPU serving (may allow it as a gateway)",
      "max_score": 6
    },
    {
      "name": "No implementation artifacts",
      "description": "Does NOT contain Dockerfile, Helm chart, Triton config.pbtxt, or Python/bash code blocks implementing the stack",
      "max_score": 5
    },
    {
      "name": "Evaluation plan present",
      "description": "Includes an evaluation section with at least WER measurement, RTFx on the specified hardware, and a streaming latency metric (TTFT or TTCT)",
      "max_score": 5
    }
  ]
}

evals

README.md

tile.json