docs
0
# Audio
1
2
Convert audio to text (transcription and translation) and text to speech using Whisper and TTS models. Supports multiple audio formats and languages.
3
4
## Capabilities
5
6
### Transcription
7
8
Convert audio to text in the original language using the Whisper model.
9
10
```python { .api }
11
def create(
12
self,
13
*,
14
file: FileTypes,
15
model: str | AudioModel,
16
chunking_strategy: dict | str | Omit = omit,
17
include: list[str] | Omit = omit,
18
known_speaker_names: list[str] | Omit = omit,
19
known_speaker_references: list[str] | Omit = omit,
20
language: str | Omit = omit,
21
prompt: str | Omit = omit,
22
response_format: Literal["json", "text", "srt", "verbose_json", "vtt", "diarized_json"] | Omit = omit,
23
stream: bool | Omit = omit,
24
temperature: float | Omit = omit,
25
timestamp_granularities: list[Literal["word", "segment"]] | Omit = omit,
26
extra_headers: dict[str, str] | None = None,
27
extra_query: dict[str, object] | None = None,
28
extra_body: dict[str, object] | None = None,
29
timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
30
) -> Transcription | TranscriptionVerbose:
31
"""
32
Transcribe audio to text in the original language.
33
34
Args:
35
file: Audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg,
36
mpga, m4a, ogg, wav, webm. Max file size: 25 MB.
37
Can be file path string, file object, or tuple.
38
39
model: Model ID. Options:
40
- "gpt-4o-transcribe": Advanced transcription with streaming support
41
- "gpt-4o-mini-transcribe": Faster, cost-effective transcription
42
- "gpt-4o-transcribe-diarize": Speaker diarization model
43
- "whisper-1": Powered by open source Whisper V2 model
44
45
chunking_strategy: Controls how audio is cut into chunks. Options:
46
- "auto": Server normalizes loudness and uses voice activity detection (VAD)
47
- {"type": "server_vad", ...}: Manually configure VAD parameters
48
- If unset: Audio transcribed as a single block
49
- Required for gpt-4o-transcribe-diarize with inputs >30 seconds
50
51
include: Additional information to include. Options:
52
- "logprobs": Returns log probabilities for confidence analysis
53
- Only works with response_format="json"
54
- Only supported for gpt-4o-transcribe and gpt-4o-mini-transcribe
55
- Not supported with gpt-4o-transcribe-diarize
56
57
known_speaker_names: List of speaker names for diarization (e.g., ["customer", "agent"]).
58
Corresponds to audio samples in known_speaker_references. Up to 4 speakers.
59
Used with gpt-4o-transcribe-diarize model.
60
61
known_speaker_references: List of audio samples (as data URLs) containing known speaker
62
references. Each sample must be 2-10 seconds. Matches known_speaker_names.
63
Used with gpt-4o-transcribe-diarize model.
64
65
language: Language of the audio in ISO-639-1 format (e.g., "en", "fr", "de").
66
Providing language improves accuracy and latency.
67
68
prompt: Optional text to guide the model's style or continue previous segment.
69
Should match audio language.
70
71
response_format: Output format. Options:
72
- "json": JSON with text (default)
73
- "text": Plain text only
74
- "srt": SubRip subtitle format
75
- "verbose_json": JSON with segments, timestamps, confidence
76
- "vtt": WebVTT subtitle format
77
- "diarized_json": JSON with speaker annotations (for gpt-4o-transcribe-diarize)
78
Note: gpt-4o-transcribe/mini only support "json". gpt-4o-transcribe-diarize
79
supports "json", "text", and "diarized_json" (required for speaker annotations).
80
81
stream: If true, model response will be streamed using server-sent events.
82
Returns Stream[TranscriptionStreamEvent]. Not supported for whisper-1.
83
84
temperature: Sampling temperature between 0 and 1. Higher values increase
85
randomness. Default is 0.
86
87
timestamp_granularities: Timestamp precision options.
88
- ["segment"]: Segment-level timestamps (default)
89
- ["word"]: Word-level timestamps
90
- ["segment", "word"]: Both levels
91
92
extra_headers: Additional HTTP headers.
93
extra_query: Additional query parameters.
94
extra_body: Additional JSON fields.
95
timeout: Request timeout in seconds.
96
97
Returns:
98
Transcription: Basic response with text
99
TranscriptionVerbose: Detailed response with segments and timestamps
100
101
Raises:
102
BadRequestError: Invalid file format or size
103
AuthenticationError: Invalid API key
104
"""
105
```
106
107
Usage examples:
108
109
```python
110
from openai import OpenAI
111
112
client = OpenAI()
113
114
# Basic transcription
115
with open("audio.mp3", "rb") as audio_file:
116
transcript = client.audio.transcriptions.create(
117
model="whisper-1",
118
file=audio_file
119
)
120
print(transcript.text)
121
122
# With language hint for better accuracy
123
with open("french_audio.mp3", "rb") as audio_file:
124
transcript = client.audio.transcriptions.create(
125
model="whisper-1",
126
file=audio_file,
127
language="fr"
128
)
129
130
# Verbose JSON with detailed information
131
with open("audio.mp3", "rb") as audio_file:
132
transcript = client.audio.transcriptions.create(
133
model="whisper-1",
134
file=audio_file,
135
response_format="verbose_json",
136
timestamp_granularities=["word", "segment"]
137
)
138
139
print(f"Duration: {transcript.duration}")
140
print(f"Language: {transcript.language}")
141
142
for segment in transcript.segments:
143
print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
144
145
for word in transcript.words:
146
print(f"{word.word} ({word.start:.2f}s)")
147
148
# SRT subtitle format
149
with open("video_audio.mp3", "rb") as audio_file:
150
srt = client.audio.transcriptions.create(
151
model="whisper-1",
152
file=audio_file,
153
response_format="srt"
154
)
155
# Save to file
156
with open("subtitles.srt", "w") as f:
157
f.write(srt)
158
159
# With prompt for context/style
160
with open("continuation.mp3", "rb") as audio_file:
161
transcript = client.audio.transcriptions.create(
162
model="whisper-1",
163
file=audio_file,
164
prompt="Previous text for context..."
165
)
166
167
# Using file_from_path helper
168
from openai import OpenAI
169
from openai import file_from_path
170
171
client = OpenAI()
172
173
transcript = client.audio.transcriptions.create(
174
model="whisper-1",
175
file=file_from_path("audio.mp3")
176
)
177
178
# Advanced: Using gpt-4o-transcribe with streaming
179
with open("audio.mp3", "rb") as audio_file:
180
stream = client.audio.transcriptions.create(
181
model="gpt-4o-transcribe",
182
file=audio_file,
183
stream=True
184
)
185
for event in stream:
186
print(event.text, end="", flush=True)
187
188
# Advanced: Speaker diarization with gpt-4o-transcribe-diarize
189
with open("meeting.mp3", "rb") as audio_file:
190
transcript = client.audio.transcriptions.create(
191
model="gpt-4o-transcribe-diarize",
192
file=audio_file,
193
response_format="diarized_json",
194
chunking_strategy="auto"
195
)
196
for segment in transcript.segments:
197
print(f"[{segment.speaker}]: {segment.text}")
198
199
# Advanced: With known speaker references
200
with open("call.mp3", "rb") as audio_file:
201
transcript = client.audio.transcriptions.create(
202
model="gpt-4o-transcribe-diarize",
203
file=audio_file,
204
response_format="diarized_json",
205
known_speaker_names=["customer", "agent"],
206
known_speaker_references=[
207
"data:audio/mp3;base64,...", # Customer voice sample
208
"data:audio/mp3;base64,..." # Agent voice sample
209
]
210
)
211
212
# Advanced: Using include for confidence scores
213
with open("audio.mp3", "rb") as audio_file:
214
transcript = client.audio.transcriptions.create(
215
model="gpt-4o-transcribe",
216
file=audio_file,
217
response_format="json",
218
include=["logprobs"]
219
)
220
# Access logprobs for confidence analysis
221
```
222
223
### Translation
224
225
Translate audio to English text using the Whisper model.
226
227
```python { .api }
228
def create(
229
self,
230
*,
231
file: FileTypes,
232
model: str | AudioModel,
233
prompt: str | Omit = omit,
234
response_format: Literal["json", "text", "srt", "verbose_json", "vtt"] | Omit = omit,
235
temperature: float | Omit = omit,
236
extra_headers: dict[str, str] | None = None,
237
extra_query: dict[str, object] | None = None,
238
extra_body: dict[str, object] | None = None,
239
timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
240
) -> Translation | TranslationVerbose | str:
241
"""
242
Translate audio to English text.
243
244
Args:
245
file: Audio file to translate. Supported formats: flac, mp3, mp4, mpeg,
246
mpga, m4a, ogg, wav, webm. Max file size: 25 MB.
247
248
model: Model ID. Currently only "whisper-1" is available.
249
250
prompt: Optional text to guide the model's style.
251
252
response_format: Output format. Options:
253
- "json": JSON with text (default)
254
- "text": Plain text only
255
- "srt": SubRip subtitle format
256
- "verbose_json": JSON with segments and details
257
- "vtt": WebVTT subtitle format
258
259
temperature: Sampling temperature between 0 and 1.
260
261
extra_headers: Additional HTTP headers.
262
extra_query: Additional query parameters.
263
extra_body: Additional JSON fields.
264
timeout: Request timeout in seconds.
265
266
Returns:
267
Translation: Basic response with English text (for json format)
268
TranslationVerbose: Detailed response with segments (for verbose_json format)
269
str: Plain text string (for text, srt, vtt formats)
270
271
Raises:
272
BadRequestError: Invalid file format or size
273
"""
274
```
275
276
Usage example:
277
278
```python
279
from openai import OpenAI
280
281
client = OpenAI()
282
283
# Translate French audio to English
284
with open("french_audio.mp3", "rb") as audio_file:
285
translation = client.audio.translations.create(
286
model="whisper-1",
287
file=audio_file
288
)
289
print(translation.text)
290
291
# Verbose format with segments
292
with open("spanish_audio.mp3", "rb") as audio_file:
293
translation = client.audio.translations.create(
294
model="whisper-1",
295
file=audio_file,
296
response_format="verbose_json"
297
)
298
299
for segment in translation.segments:
300
print(f"[{segment.start:.2f}s]: {segment.text}")
301
```
302
303
### Text-to-Speech
304
305
Convert text to spoken audio using TTS models.
306
307
```python { .api }
308
def create(
309
self,
310
*,
311
input: str,
312
model: str | SpeechModel,
313
voice: Literal["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"],
314
instructions: str | Omit = omit,
315
response_format: Literal["mp3", "opus", "aac", "flac", "wav", "pcm"] | Omit = omit,
316
speed: float | Omit = omit,
317
stream_format: Literal["sse", "audio"] | Omit = omit,
318
extra_headers: dict[str, str] | None = None,
319
extra_query: dict[str, object] | None = None,
320
extra_body: dict[str, object] | None = None,
321
timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
322
) -> HttpxBinaryResponseContent:
323
"""
324
Convert text to spoken audio.
325
326
Args:
327
input: Text to convert to audio. Max length: 4096 characters.
328
329
model: TTS model to use. Options:
330
- "tts-1": Standard quality, faster, lower cost
331
- "tts-1-hd": High definition quality, slower, higher cost
332
- "gpt-4o-mini-tts": Advanced model with instruction support
333
334
voice: Voice to use for generation. Options:
335
- "alloy": Neutral, balanced
336
- "ash": Clear and articulate
337
- "ballad": Warm and expressive
338
- "coral": Bright and engaging
339
- "echo": Calm and measured
340
- "sage": Wise and authoritative
341
- "shimmer": Soft and gentle
342
- "verse": Dynamic and versatile
343
- "marin": Smooth and professional
344
- "cedar": Rich and grounded
345
346
instructions: Control the voice with additional instructions.
347
Does not work with tts-1 or tts-1-hd. Only supported by gpt-4o-mini-tts.
348
349
response_format: Audio format. Options:
350
- "mp3": Default, good compression
351
- "opus": Best for streaming, lower latency
352
- "aac": Good compression, widely supported
353
- "flac": Lossless compression
354
- "wav": Uncompressed
355
- "pcm": Raw 16-bit PCM audio
356
357
speed: Playback speed between 0.25 and 4.0. Default 1.0.
358
359
stream_format: Format to stream the audio in. Options:
360
- "sse": Server-sent events streaming
361
- "audio": Raw audio streaming
362
Note: "sse" not supported for tts-1 or tts-1-hd.
363
364
extra_headers: Additional HTTP headers.
365
extra_query: Additional query parameters.
366
extra_body: Additional JSON fields.
367
timeout: Request timeout in seconds.
368
369
Returns:
370
HttpxBinaryResponseContent: Audio file content. Use .content for bytes,
371
.read() for streaming, .stream_to_file(path) for direct save.
372
373
Raises:
374
BadRequestError: Invalid parameters or text too long
375
"""
376
```
377
378
Usage examples:
379
380
```python
381
from openai import OpenAI
382
from pathlib import Path
383
384
client = OpenAI()
385
386
# Basic TTS
387
response = client.audio.speech.create(
388
model="tts-1",
389
voice="alloy",
390
input="Hello! This is a test of text to speech."
391
)
392
393
# Save to file
394
speech_file = Path("output.mp3")
395
response.stream_to_file(speech_file)
396
397
# Different voices
398
voices = ["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"]
399
for voice in voices:
400
response = client.audio.speech.create(
401
model="tts-1",
402
voice=voice,
403
input="Testing different voices."
404
)
405
response.stream_to_file(f"voice_{voice}.mp3")
406
407
# High quality audio (using marin or cedar recommended for best quality)
408
response = client.audio.speech.create(
409
model="tts-1-hd",
410
voice="marin",
411
input="High definition audio output."
412
)
413
response.stream_to_file("hd_output.mp3")
414
415
# Streaming optimized format (Opus)
416
response = client.audio.speech.create(
417
model="tts-1",
418
voice="shimmer",
419
input="Optimized for streaming.",
420
response_format="opus"
421
)
422
response.stream_to_file("output.opus")
423
424
# Adjust playback speed
425
response = client.audio.speech.create(
426
model="tts-1",
427
voice="alloy",
428
input="This will play faster.",
429
speed=1.5
430
)
431
response.stream_to_file("fast_speech.mp3")
432
433
# Get raw bytes
434
response = client.audio.speech.create(
435
model="tts-1",
436
voice="echo",
437
input="Getting raw audio bytes."
438
)
439
audio_bytes = response.content
440
# Process bytes as needed
441
442
# Streaming response
443
response = client.audio.speech.create(
444
model="tts-1",
445
voice="ballad",
446
input="Streaming audio data."
447
)
448
449
with open("streaming.mp3", "wb") as f:
450
for chunk in response.iter_bytes():
451
f.write(chunk)
452
453
# Advanced: Using gpt-4o-mini-tts with instructions
454
response = client.audio.speech.create(
455
model="gpt-4o-mini-tts",
456
voice="sage",
457
input="This is a test of voice control.",
458
instructions="Speak in a warm, friendly tone with slight enthusiasm."
459
)
460
response.stream_to_file("instructed_speech.mp3")
461
462
# Advanced: Server-sent events streaming
463
response = client.audio.speech.create(
464
model="gpt-4o-mini-tts",
465
voice="coral",
466
input="Real-time audio streaming.",
467
stream_format="sse"
468
)
469
response.stream_to_file("sse_output.mp3")
470
```
471
472
## Types
473
474
```python { .api }
475
from typing import Literal
476
from pydantic import BaseModel
477
478
# Transcription types
479
class Transcription(BaseModel):
480
text: str
481
482
class TranscriptionVerbose(BaseModel):
483
text: str
484
language: str
485
duration: float
486
segments: list[TranscriptionSegment] | None
487
words: list[TranscriptionWord] | None
488
489
class TranscriptionSegment(BaseModel):
490
id: int
491
seek: int
492
start: float
493
end: float
494
text: str
495
tokens: list[int]
496
temperature: float
497
avg_logprob: float
498
compression_ratio: float
499
no_speech_prob: float
500
501
class TranscriptionWord(BaseModel):
502
word: str
503
start: float
504
end: float
505
506
class TranscriptionDiarized(BaseModel):
507
"""Transcription with speaker diarization."""
508
text: str
509
language: str
510
duration: float
511
segments: list[DiarizedSegment]
512
513
class DiarizedSegment(BaseModel):
514
"""Segment with speaker information."""
515
speaker: str # Speaker identifier
516
start: float
517
end: float
518
text: str
519
520
# Translation types
521
class Translation(BaseModel):
522
text: str
523
524
class TranslationVerbose(BaseModel):
525
text: str
526
language: str
527
duration: float
528
segments: list[TranscriptionSegment] | None
529
530
# Model types
531
AudioModel = Literal["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "gpt-4o-transcribe-diarize", "whisper-1"]
532
SpeechModel = Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts"]
533
534
# File types
535
FileTypes = Union[
536
FileContent, # File-like object
537
tuple[str | None, FileContent], # (filename, content)
538
tuple[str | None, FileContent, str | None] # (filename, content, content_type)
539
]
540
541
# Response type for TTS
542
class HttpxBinaryResponseContent:
543
content: bytes
544
def read(self) -> bytes: ...
545
def iter_bytes(self, chunk_size: int = None) -> Iterator[bytes]: ...
546
def stream_to_file(self, file_path: str | Path) -> None: ...
547
```
548
549
## Async Usage
550
551
```python
552
import asyncio
553
from openai import AsyncOpenAI
554
555
async def transcribe_audio():
556
client = AsyncOpenAI()
557
558
with open("audio.mp3", "rb") as audio_file:
559
transcript = await client.audio.transcriptions.create(
560
model="whisper-1",
561
file=audio_file
562
)
563
return transcript.text
564
565
async def generate_speech():
566
client = AsyncOpenAI()
567
568
response = await client.audio.speech.create(
569
model="tts-1",
570
voice="alloy",
571
input="Async text to speech"
572
)
573
response.stream_to_file("async_output.mp3")
574
575
# Run async operations
576
text = asyncio.run(transcribe_audio())
577
asyncio.run(generate_speech())
578
```
579