deepgram-python-speech-to-text

Use when writing or reviewing Python code in this repo that calls Deepgram Speech-to-Text v1 (`/v1/listen`) for prerecorded or live audio transcription. Covers `client.listen.v1.media.transcribe_url` / `transcribe_file` (REST) and `client.listen.v1.connect` (WebSocket). Use this skill for basic ASR; use `deepgram-python-audio-intelligence` for summarize/sentiment/topics/diarize overlays, `deepgram-python-conversational-stt` for turn-taking v2/Flux, and `deepgram-python-voice-agent` for full-duplex assistants. Triggers include "transcribe", "live transcription", "speech to text", "STT", "listen endpoint", "nova-3", "listen.v1".

Quality

86%

Does it follow best practices?

Impact

—

No eval scenarios have been run

Securityby

Advisory

Suggest reviewing before use

Using Deepgram Speech-to-Text (Python SDK)

Name: deepgram-python-speech-to-text
Rating: 71.2 (1 reviews)
Author: deepgram

Basic transcription (ASR) for prerecorded audio (REST) or live audio (WebSocket) via /v1/listen.

When to use this product

REST (transcribe_url / transcribe_file) — one-shot transcription of a complete file or URL. Use for batch jobs, captioning pipelines, offline analysis.
WebSocket (listen.v1.connect) — continuous streaming transcription. Use for live captions, real-time microphone input, phone audio.

Use a different skill when:

You want summaries, sentiment, topics, intents, diarization, or redaction on the audio → deepgram-python-audio-intelligence (same endpoint, different params).
You need turn-taking / end-of-turn events → deepgram-python-conversational-stt (v2 / Flux).
You need a full-duplex interactive assistant (STT + LLM + TTS + function calls) → deepgram-python-voice-agent.

Authentication

import os
from dotenv import load_dotenv
load_dotenv()

from deepgram import DeepgramClient

client = DeepgramClient()  # reads DEEPGRAM_API_KEY from env
# or: DeepgramClient(api_key=os.environ["DEEPGRAM_API_KEY"])

Header sent on every request: Authorization: Token <api_key> (NOT Bearer).

Quick start — REST (prerecorded URL)

response = client.listen.v1.media.transcribe_url(
    url="https://dpgr.am/spacewalk.wav",
    model="nova-3",
    smart_format=True,
    punctuate=True,
)
transcript = response.results.channels[0].alternatives[0].transcript

Quick start — REST (prerecorded file)

with open("audio.wav", "rb") as f:
    audio_bytes = f.read()

response = client.listen.v1.media.transcribe_file(
    request=audio_bytes,
    model="nova-3",
)

request= accepts raw bytes or an iterator of bytes (stream large files chunk-by-chunk). Do NOT pass a file handle.

Quick start — WebSocket (live streaming with interim results)

Live transcription emits interim (partial) and final results. Pass interim_results=True and switch on is_final to display partial text in real time, then overwrite it with the final transcript when the speaker pauses.

import threading
from deepgram.core.events import EventType
from deepgram.listen.v1.types import (
    ListenV1Results, ListenV1Metadata,
    ListenV1SpeechStarted, ListenV1UtteranceEnd,
)

with client.listen.v1.connect(
    model="nova-3",
    interim_results=True,    # ← emit partial results while user is still speaking
    utterance_end_ms=1000,   # silence (ms) before server emits UtteranceEnd
    vad_events=True,         # SpeechStarted events
    smart_format=True,
) as conn:
    # Mutable container so the on_message closure can update state without `global`
    state = {"last_interim_len": 0}

    def on_message(m):
        if isinstance(m, ListenV1Results) and m.channel and m.channel.alternatives:
            transcript = m.channel.alternatives[0].transcript
            if not transcript:
                return
            if m.is_final:
                # Final segment: overwrite the running interim line, newline if utterance ended
                pad = " " * max(0, state["last_interim_len"] - len(transcript))
                end = "\n" if m.speech_final else ""
                print(f"\r{transcript}{pad}", end=end, flush=True)
                state["last_interim_len"] = 0
            else:
                # Interim: keep overwriting the same console line as the user speaks
                print(f"\r{transcript}", end="", flush=True)
                state["last_interim_len"] = len(transcript)
        elif isinstance(m, ListenV1UtteranceEnd):
            print()  # newline; UtteranceEnd fires after final results when audio goes silent
        elif isinstance(m, ListenV1SpeechStarted):
            pass  # optional: reset UI when a new utterance begins

    conn.on(EventType.OPEN,    lambda _: print("connected"))
    conn.on(EventType.MESSAGE, on_message)
    conn.on(EventType.CLOSE,   lambda _: print("\nclosed"))
    conn.on(EventType.ERROR,   lambda e: print(f"\nerr: {e}"))

    # Start receive loop in background so we can send concurrently
    threading.Thread(target=conn.start_listening, daemon=True).start()

    for chunk in audio_chunks:         # raw PCM bytes at declared encoding/sample_rate
        conn.send_media(chunk)

    conn.send_finalize()               # flush final partial before closing

Interim vs. final flag semantics

is_final = False — interim hypothesis. Will be revised. Display in a non-committal style (lighter colour, italic) and overwrite when the next message arrives.
is_final = True, speech_final = False — confirmed segment, but the speaker is still talking. Append to the transcript; another final will follow.
is_final = True, speech_final = True — confirmed segment AND the utterance ended (silence detected). Commit the line and start a new one.
from_finalize = True — this final was triggered by your explicit send_finalize() call (vs natural endpointing). Useful to distinguish "I asked for a flush" from "the speaker paused".

Send send_finalize() to force the server to emit final results immediately (e.g. user clicks "stop"). Send send_close_stream() after send_finalize to terminate cleanly.

WSS message types live under deepgram.listen.v1.types.

Async equivalents

from deepgram import AsyncDeepgramClient
client = AsyncDeepgramClient()

response = await client.listen.v1.media.transcribe_url(url=..., model="nova-3")

async with client.listen.v1.connect(model="nova-3") as conn:
    # same .on(...) handlers, then:
    await conn.start_listening()

Async / deferred result patterns

There are two distinct notions of "async" — don't confuse them.

1. Python `async/await` (sync-style, immediate result)

AsyncDeepgramClient returns Awaitable[<full response>]. The result is delivered when you await, not later. Use this when integrating with FastAPI, aiohttp, or any asyncio app.

import asyncio
from deepgram import AsyncDeepgramClient

client = AsyncDeepgramClient()

async def transcribe(url: str) -> str:
    response = await client.listen.v1.media.transcribe_url(
        url=url,
        model="nova-3",
        smart_format=True,
    )
    # `response` is the FULL transcription — no polling, no callback, just await.
    return response.results.channels[0].alternatives[0].transcript

text = asyncio.run(transcribe("https://dpgr.am/spacewalk.wav"))

2. Deferred via callback URL (webhook, results posted later)

Pass callback="https://your.app/webhook" and the request returns immediately with a request_id. Deepgram processes the audio in the background and POSTs the final result to your webhook URL. There is no polling endpoint — your server must be reachable to receive the result.

response = client.listen.v1.media.transcribe_url(
    url="https://dpgr.am/spacewalk.wav",
    callback="https://your.app/deepgram-webhook",
    callback_method="POST",         # or "PUT"
    model="nova-3",
    smart_format=True,
)
print(f"Accepted; tracking id: {response.request_id}")
# response is a "listen accepted" — NOT the transcript. Wait for your webhook.

The webhook receives the same JSON body you would have received from a synchronous transcribe_url call. Use this for very long files or when you don't want the request hanging open.

Pattern	Returns	When to use
`client.listen.v1.media.transcribe_url(...)`	full transcription synchronously	files up to ~10 min; HTTP timeout-bound
`await AsyncDeepgramClient().listen.v1.media.transcribe_url(...)`	full transcription, non-blocking	inside asyncio apps
`transcribe_url(..., callback="https://...")`	`{request_id}` immediately, transcription POSTs to webhook later	very long files; no long-lived HTTP connection
`client.listen.v1.connect(...)` (WebSocket)	streaming events as audio is sent	live audio (mic, telephony)

See examples/12-transcription-prerecorded-callback.py for a working callback example.

Key parameters

model, language, encoding, sample_rate, channels, multichannel, punctuate, smart_format, diarize, endpointing, interim_results, utterance_end_ms, vad_events, keywords, search, redact, numerals, paragraphs, utterances.

API reference (layered)

In-repo Fern-generated reference: reference.md — sections "Listen V1 Media" (REST) and "Listen V1 Connect" (WSS).
Canonical OpenAPI (REST): https://developers.deepgram.com/openapi.yaml
Canonical AsyncAPI (WSS): https://developers.deepgram.com/asyncapi.yaml
Context7 — natural-language queries over the full Deepgram docs corpus. Library ID: /llmstxt/developers_deepgram_llms_txt.
Product docs:
- https://developers.deepgram.com/reference/speech-to-text/listen-pre-recorded
- https://developers.deepgram.com/reference/speech-to-text/listen-streaming

Gotchas

Use the right auth scheme for the credential type. API keys use Authorization: Token <api_key>. Temporary / access tokens (from client.auth.v1.tokens.grant() or an equivalent server) use Authorization: Bearer <access_token> — the custom DeepgramClient installs a Bearer override when you pass access_token=... (see src/deepgram/client.py). Sending Bearer <api_key> with a long-lived API key is what fails.
Encoding must match the audio. Declaring encoding="linear16" but sending Opus → garbage output or 400.
Close streams cleanly. Call send_finalize() before exiting the WSS context — otherwise the last partial is dropped.
Keepalive on long WSS sessions. If idle > ~10s, the server closes. Send KeepAlive messages or audio chunks.
Intelligence features are REST-only. summarize, topics, intents, sentiment, detect_language do NOT work over WSS — see deepgram-python-audio-intelligence.
transcribe_file(request=...) takes bytes or an iterator, not a file handle.
nova-3 is the current flagship STT model. Check client.manage.v1.models.list() for the live set.
Sync connection.start_listening() blocks. Run it in a thread (sync) or as a task (async) so you can send audio concurrently.

Example files in this repo

examples/10-transcription-prerecorded-url.py
examples/11-transcription-prerecorded-file.py
examples/12-transcription-prerecorded-callback.py
examples/13-transcription-live-websocket.py
tests/wire/test_listen_v1_media.py — wire-level fixtures
tests/manual/listen/v1/connect/main.py — live WSS connection test

Central product skills

For cross-language Deepgram product knowledge — the consolidated API reference, documentation finder, focused runnable recipes, third-party integration examples, and MCP setup — install the central skills:

npx skills add deepgram/skills

This SDK ships language-idiomatic code skills; deepgram/skills ships cross-language product knowledge (see api, docs, recipes, examples, starters, setup-mcp).

Repository: deepgram/deepgram-python-sdk
Commit: e169f63

Last updated: 9 days ago
Created: 9 days ago

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.