Tessl Tile for pypi/openai@2.8.1

or run

npx @tessl/cli init

audio.mddocs/

0
# Audio
1

2
Convert audio to text (transcription and translation) and text to speech using Whisper and TTS models. Supports multiple audio formats and languages.
3

4
## Capabilities
5

6
### Transcription
7

8
Convert audio to text in the original language using the Whisper model.
9

10
```python { .api }
11
def create(
12
    self,
13
    *,
14
    file: FileTypes,
15
    model: str | AudioModel,
16
    chunking_strategy: dict | str | Omit = omit,
17
    include: list[str] | Omit = omit,
18
    known_speaker_names: list[str] | Omit = omit,
19
    known_speaker_references: list[str] | Omit = omit,
20
    language: str | Omit = omit,
21
    prompt: str | Omit = omit,
22
    response_format: Literal["json", "text", "srt", "verbose_json", "vtt", "diarized_json"] | Omit = omit,
23
    stream: bool | Omit = omit,
24
    temperature: float | Omit = omit,
25
    timestamp_granularities: list[Literal["word", "segment"]] | Omit = omit,
26
    extra_headers: dict[str, str] | None = None,
27
    extra_query: dict[str, object] | None = None,
28
    extra_body: dict[str, object] | None = None,
29
    timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
30
) -> Transcription | TranscriptionVerbose:
31
    """
32
    Transcribe audio to text in the original language.
33

34
    Args:
35
        file: Audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg,
36
            mpga, m4a, ogg, wav, webm. Max file size: 25 MB.
37
            Can be file path string, file object, or tuple.
38

39
        model: Model ID. Options:
40
            - "gpt-4o-transcribe": Advanced transcription with streaming support
41
            - "gpt-4o-mini-transcribe": Faster, cost-effective transcription
42
            - "gpt-4o-transcribe-diarize": Speaker diarization model
43
            - "whisper-1": Powered by open source Whisper V2 model
44

45
        chunking_strategy: Controls how audio is cut into chunks. Options:
46
            - "auto": Server normalizes loudness and uses voice activity detection (VAD)
47
            - {"type": "server_vad", ...}: Manually configure VAD parameters
48
            - If unset: Audio transcribed as a single block
49
            - Required for gpt-4o-transcribe-diarize with inputs >30 seconds
50

51
        include: Additional information to include. Options:
52
            - "logprobs": Returns log probabilities for confidence analysis
53
            - Only works with response_format="json"
54
            - Only supported for gpt-4o-transcribe and gpt-4o-mini-transcribe
55
            - Not supported with gpt-4o-transcribe-diarize
56

57
        known_speaker_names: List of speaker names for diarization (e.g., ["customer", "agent"]).
58
            Corresponds to audio samples in known_speaker_references. Up to 4 speakers.
59
            Used with gpt-4o-transcribe-diarize model.
60

61
        known_speaker_references: List of audio samples (as data URLs) containing known speaker
62
            references. Each sample must be 2-10 seconds. Matches known_speaker_names.
63
            Used with gpt-4o-transcribe-diarize model.
64

65
        language: Language of the audio in ISO-639-1 format (e.g., "en", "fr", "de").
66
            Providing language improves accuracy and latency.
67

68
        prompt: Optional text to guide the model's style or continue previous segment.
69
            Should match audio language.
70

71
        response_format: Output format. Options:
72
            - "json": JSON with text (default)
73
            - "text": Plain text only
74
            - "srt": SubRip subtitle format
75
            - "verbose_json": JSON with segments, timestamps, confidence
76
            - "vtt": WebVTT subtitle format
77
            - "diarized_json": JSON with speaker annotations (for gpt-4o-transcribe-diarize)
78
            Note: gpt-4o-transcribe/mini only support "json". gpt-4o-transcribe-diarize
79
            supports "json", "text", and "diarized_json" (required for speaker annotations).
80

81
        stream: If true, model response will be streamed using server-sent events.
82
            Returns Stream[TranscriptionStreamEvent]. Not supported for whisper-1.
83

84
        temperature: Sampling temperature between 0 and 1. Higher values increase
85
            randomness. Default is 0.
86

87
        timestamp_granularities: Timestamp precision options.
88
            - ["segment"]: Segment-level timestamps (default)
89
            - ["word"]: Word-level timestamps
90
            - ["segment", "word"]: Both levels
91

92
        extra_headers: Additional HTTP headers.
93
        extra_query: Additional query parameters.
94
        extra_body: Additional JSON fields.
95
        timeout: Request timeout in seconds.
96

97
    Returns:
98
        Transcription: Basic response with text
99
        TranscriptionVerbose: Detailed response with segments and timestamps
100

101
    Raises:
102
        BadRequestError: Invalid file format or size
103
        AuthenticationError: Invalid API key
104
    """
105
```
106

107
Usage examples:
108

109
```python
110
from openai import OpenAI
111

112
client = OpenAI()
113

114
# Basic transcription
115
with open("audio.mp3", "rb") as audio_file:
116
    transcript = client.audio.transcriptions.create(
117
        model="whisper-1",
118
        file=audio_file
119
    )
120
print(transcript.text)
121

122
# With language hint for better accuracy
123
with open("french_audio.mp3", "rb") as audio_file:
124
    transcript = client.audio.transcriptions.create(
125
        model="whisper-1",
126
        file=audio_file,
127
        language="fr"
128
    )
129

130
# Verbose JSON with detailed information
131
with open("audio.mp3", "rb") as audio_file:
132
    transcript = client.audio.transcriptions.create(
133
        model="whisper-1",
134
        file=audio_file,
135
        response_format="verbose_json",
136
        timestamp_granularities=["word", "segment"]
137
    )
138

139
print(f"Duration: {transcript.duration}")
140
print(f"Language: {transcript.language}")
141

142
for segment in transcript.segments:
143
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
144

145
for word in transcript.words:
146
    print(f"{word.word} ({word.start:.2f}s)")
147

148
# SRT subtitle format
149
with open("video_audio.mp3", "rb") as audio_file:
150
    srt = client.audio.transcriptions.create(
151
        model="whisper-1",
152
        file=audio_file,
153
        response_format="srt"
154
    )
155
# Save to file
156
with open("subtitles.srt", "w") as f:
157
    f.write(srt)
158

159
# With prompt for context/style
160
with open("continuation.mp3", "rb") as audio_file:
161
    transcript = client.audio.transcriptions.create(
162
        model="whisper-1",
163
        file=audio_file,
164
        prompt="Previous text for context..."
165
    )
166

167
# Using file_from_path helper
168
from openai import OpenAI
169
from openai import file_from_path
170

171
client = OpenAI()
172

173
transcript = client.audio.transcriptions.create(
174
    model="whisper-1",
175
    file=file_from_path("audio.mp3")
176
)
177

178
# Advanced: Using gpt-4o-transcribe with streaming
179
with open("audio.mp3", "rb") as audio_file:
180
    stream = client.audio.transcriptions.create(
181
        model="gpt-4o-transcribe",
182
        file=audio_file,
183
        stream=True
184
    )
185
    for event in stream:
186
        print(event.text, end="", flush=True)
187

188
# Advanced: Speaker diarization with gpt-4o-transcribe-diarize
189
with open("meeting.mp3", "rb") as audio_file:
190
    transcript = client.audio.transcriptions.create(
191
        model="gpt-4o-transcribe-diarize",
192
        file=audio_file,
193
        response_format="diarized_json",
194
        chunking_strategy="auto"
195
    )
196
    for segment in transcript.segments:
197
        print(f"[{segment.speaker}]: {segment.text}")
198

199
# Advanced: With known speaker references
200
with open("call.mp3", "rb") as audio_file:
201
    transcript = client.audio.transcriptions.create(
202
        model="gpt-4o-transcribe-diarize",
203
        file=audio_file,
204
        response_format="diarized_json",
205
        known_speaker_names=["customer", "agent"],
206
        known_speaker_references=[
207
            "data:audio/mp3;base64,...",  # Customer voice sample
208
            "data:audio/mp3;base64,..."   # Agent voice sample
209
        ]
210
    )
211

212
# Advanced: Using include for confidence scores
213
with open("audio.mp3", "rb") as audio_file:
214
    transcript = client.audio.transcriptions.create(
215
        model="gpt-4o-transcribe",
216
        file=audio_file,
217
        response_format="json",
218
        include=["logprobs"]
219
    )
220
    # Access logprobs for confidence analysis
221
```
222

223
### Translation
224

225
Translate audio to English text using the Whisper model.
226

227
```python { .api }
228
def create(
229
    self,
230
    *,
231
    file: FileTypes,
232
    model: str | AudioModel,
233
    prompt: str | Omit = omit,
234
    response_format: Literal["json", "text", "srt", "verbose_json", "vtt"] | Omit = omit,
235
    temperature: float | Omit = omit,
236
    extra_headers: dict[str, str] | None = None,
237
    extra_query: dict[str, object] | None = None,
238
    extra_body: dict[str, object] | None = None,
239
    timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
240
) -> Translation | TranslationVerbose | str:
241
    """
242
    Translate audio to English text.
243

244
    Args:
245
        file: Audio file to translate. Supported formats: flac, mp3, mp4, mpeg,
246
            mpga, m4a, ogg, wav, webm. Max file size: 25 MB.
247

248
        model: Model ID. Currently only "whisper-1" is available.
249

250
        prompt: Optional text to guide the model's style.
251

252
        response_format: Output format. Options:
253
            - "json": JSON with text (default)
254
            - "text": Plain text only
255
            - "srt": SubRip subtitle format
256
            - "verbose_json": JSON with segments and details
257
            - "vtt": WebVTT subtitle format
258

259
        temperature: Sampling temperature between 0 and 1.
260

261
        extra_headers: Additional HTTP headers.
262
        extra_query: Additional query parameters.
263
        extra_body: Additional JSON fields.
264
        timeout: Request timeout in seconds.
265

266
    Returns:
267
        Translation: Basic response with English text (for json format)
268
        TranslationVerbose: Detailed response with segments (for verbose_json format)
269
        str: Plain text string (for text, srt, vtt formats)
270

271
    Raises:
272
        BadRequestError: Invalid file format or size
273
    """
274
```
275

276
Usage example:
277

278
```python
279
from openai import OpenAI
280

281
client = OpenAI()
282

283
# Translate French audio to English
284
with open("french_audio.mp3", "rb") as audio_file:
285
    translation = client.audio.translations.create(
286
        model="whisper-1",
287
        file=audio_file
288
    )
289
print(translation.text)
290

291
# Verbose format with segments
292
with open("spanish_audio.mp3", "rb") as audio_file:
293
    translation = client.audio.translations.create(
294
        model="whisper-1",
295
        file=audio_file,
296
        response_format="verbose_json"
297
    )
298

299
for segment in translation.segments:
300
    print(f"[{segment.start:.2f}s]: {segment.text}")
301
```
302

303
### Text-to-Speech
304

305
Convert text to spoken audio using TTS models.
306

307
```python { .api }
308
def create(
309
    self,
310
    *,
311
    input: str,
312
    model: str | SpeechModel,
313
    voice: Literal["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"],
314
    instructions: str | Omit = omit,
315
    response_format: Literal["mp3", "opus", "aac", "flac", "wav", "pcm"] | Omit = omit,
316
    speed: float | Omit = omit,
317
    stream_format: Literal["sse", "audio"] | Omit = omit,
318
    extra_headers: dict[str, str] | None = None,
319
    extra_query: dict[str, object] | None = None,
320
    extra_body: dict[str, object] | None = None,
321
    timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
322
) -> HttpxBinaryResponseContent:
323
    """
324
    Convert text to spoken audio.
325

326
    Args:
327
        input: Text to convert to audio. Max length: 4096 characters.
328

329
        model: TTS model to use. Options:
330
            - "tts-1": Standard quality, faster, lower cost
331
            - "tts-1-hd": High definition quality, slower, higher cost
332
            - "gpt-4o-mini-tts": Advanced model with instruction support
333

334
        voice: Voice to use for generation. Options:
335
            - "alloy": Neutral, balanced
336
            - "ash": Clear and articulate
337
            - "ballad": Warm and expressive
338
            - "coral": Bright and engaging
339
            - "echo": Calm and measured
340
            - "sage": Wise and authoritative
341
            - "shimmer": Soft and gentle
342
            - "verse": Dynamic and versatile
343
            - "marin": Smooth and professional
344
            - "cedar": Rich and grounded
345

346
        instructions: Control the voice with additional instructions.
347
            Does not work with tts-1 or tts-1-hd. Only supported by gpt-4o-mini-tts.
348

349
        response_format: Audio format. Options:
350
            - "mp3": Default, good compression
351
            - "opus": Best for streaming, lower latency
352
            - "aac": Good compression, widely supported
353
            - "flac": Lossless compression
354
            - "wav": Uncompressed
355
            - "pcm": Raw 16-bit PCM audio
356

357
        speed: Playback speed between 0.25 and 4.0. Default 1.0.
358

359
        stream_format: Format to stream the audio in. Options:
360
            - "sse": Server-sent events streaming
361
            - "audio": Raw audio streaming
362
            Note: "sse" not supported for tts-1 or tts-1-hd.
363

364
        extra_headers: Additional HTTP headers.
365
        extra_query: Additional query parameters.
366
        extra_body: Additional JSON fields.
367
        timeout: Request timeout in seconds.
368

369
    Returns:
370
        HttpxBinaryResponseContent: Audio file content. Use .content for bytes,
371
            .read() for streaming, .stream_to_file(path) for direct save.
372

373
    Raises:
374
        BadRequestError: Invalid parameters or text too long
375
    """
376
```
377

378
Usage examples:
379

380
```python
381
from openai import OpenAI
382
from pathlib import Path
383

384
client = OpenAI()
385

386
# Basic TTS
387
response = client.audio.speech.create(
388
    model="tts-1",
389
    voice="alloy",
390
    input="Hello! This is a test of text to speech."
391
)
392

393
# Save to file
394
speech_file = Path("output.mp3")
395
response.stream_to_file(speech_file)
396

397
# Different voices
398
voices = ["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"]
399
for voice in voices:
400
    response = client.audio.speech.create(
401
        model="tts-1",
402
        voice=voice,
403
        input="Testing different voices."
404
    )
405
    response.stream_to_file(f"voice_{voice}.mp3")
406

407
# High quality audio (using marin or cedar recommended for best quality)
408
response = client.audio.speech.create(
409
    model="tts-1-hd",
410
    voice="marin",
411
    input="High definition audio output."
412
)
413
response.stream_to_file("hd_output.mp3")
414

415
# Streaming optimized format (Opus)
416
response = client.audio.speech.create(
417
    model="tts-1",
418
    voice="shimmer",
419
    input="Optimized for streaming.",
420
    response_format="opus"
421
)
422
response.stream_to_file("output.opus")
423

424
# Adjust playback speed
425
response = client.audio.speech.create(
426
    model="tts-1",
427
    voice="alloy",
428
    input="This will play faster.",
429
    speed=1.5
430
)
431
response.stream_to_file("fast_speech.mp3")
432

433
# Get raw bytes
434
response = client.audio.speech.create(
435
    model="tts-1",
436
    voice="echo",
437
    input="Getting raw audio bytes."
438
)
439
audio_bytes = response.content
440
# Process bytes as needed
441

442
# Streaming response
443
response = client.audio.speech.create(
444
    model="tts-1",
445
    voice="ballad",
446
    input="Streaming audio data."
447
)
448

449
with open("streaming.mp3", "wb") as f:
450
    for chunk in response.iter_bytes():
451
        f.write(chunk)
452

453
# Advanced: Using gpt-4o-mini-tts with instructions
454
response = client.audio.speech.create(
455
    model="gpt-4o-mini-tts",
456
    voice="sage",
457
    input="This is a test of voice control.",
458
    instructions="Speak in a warm, friendly tone with slight enthusiasm."
459
)
460
response.stream_to_file("instructed_speech.mp3")
461

462
# Advanced: Server-sent events streaming
463
response = client.audio.speech.create(
464
    model="gpt-4o-mini-tts",
465
    voice="coral",
466
    input="Real-time audio streaming.",
467
    stream_format="sse"
468
)
469
response.stream_to_file("sse_output.mp3")
470
```
471

472
## Types
473

474
```python { .api }
475
from typing import Literal
476
from pydantic import BaseModel
477

478
# Transcription types
479
class Transcription(BaseModel):
480
    text: str
481

482
class TranscriptionVerbose(BaseModel):
483
    text: str
484
    language: str
485
    duration: float
486
    segments: list[TranscriptionSegment] | None
487
    words: list[TranscriptionWord] | None
488

489
class TranscriptionSegment(BaseModel):
490
    id: int
491
    seek: int
492
    start: float
493
    end: float
494
    text: str
495
    tokens: list[int]
496
    temperature: float
497
    avg_logprob: float
498
    compression_ratio: float
499
    no_speech_prob: float
500

501
class TranscriptionWord(BaseModel):
502
    word: str
503
    start: float
504
    end: float
505

506
class TranscriptionDiarized(BaseModel):
507
    """Transcription with speaker diarization."""
508
    text: str
509
    language: str
510
    duration: float
511
    segments: list[DiarizedSegment]
512

513
class DiarizedSegment(BaseModel):
514
    """Segment with speaker information."""
515
    speaker: str  # Speaker identifier
516
    start: float
517
    end: float
518
    text: str
519

520
# Translation types
521
class Translation(BaseModel):
522
    text: str
523

524
class TranslationVerbose(BaseModel):
525
    text: str
526
    language: str
527
    duration: float
528
    segments: list[TranscriptionSegment] | None
529

530
# Model types
531
AudioModel = Literal["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "gpt-4o-transcribe-diarize", "whisper-1"]
532
SpeechModel = Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts"]
533

534
# File types
535
FileTypes = Union[
536
    FileContent,  # File-like object
537
    tuple[str | None, FileContent],  # (filename, content)
538
    tuple[str | None, FileContent, str | None]  # (filename, content, content_type)
539
]
540

541
# Response type for TTS
542
class HttpxBinaryResponseContent:
543
    content: bytes
544
    def read(self) -> bytes: ...
545
    def iter_bytes(self, chunk_size: int = None) -> Iterator[bytes]: ...
546
    def stream_to_file(self, file_path: str | Path) -> None: ...
547
```
548

549
## Async Usage
550

551
```python
552
import asyncio
553
from openai import AsyncOpenAI
554

555
async def transcribe_audio():
556
    client = AsyncOpenAI()
557

558
    with open("audio.mp3", "rb") as audio_file:
559
        transcript = await client.audio.transcriptions.create(
560
            model="whisper-1",
561
            file=audio_file
562
        )
563
    return transcript.text
564

565
async def generate_speech():
566
    client = AsyncOpenAI()
567

568
    response = await client.audio.speech.create(
569
        model="tts-1",
570
        voice="alloy",
571
        input="Async text to speech"
572
    )
573
    response.stream_to_file("async_output.mp3")
574

575
# Run async operations
576
text = asyncio.run(transcribe_audio())
577
asyncio.run(generate_speech())
578
```
579

Version

Tile

Files

audio.mddocs/

Version

Tile

Files

audio.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

audio.mddocs/