pypi-openai

Description
Official Python library for the OpenAI API providing chat completions, embeddings, audio, images, and more
Author
tessl
Last updated

How to use

npx @tessl/cli registry install tessl/pypi-openai@1.106.0

audio.md docs/

1
# Audio APIs
2
3
Comprehensive audio processing including text-to-speech synthesis, speech-to-text transcription, and audio translation capabilities using Whisper and TTS models.
4
5
## Capabilities
6
7
### Text-to-Speech (TTS)
8
9
Generate high-quality audio from text input using various voice options and audio formats.
10
11
```python { .api }
12
def create(
13
self,
14
*,
15
input: str,
16
model: Union[str, SpeechModel],
17
voice: Union[str, Literal["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"]],
18
instructions: str | NotGiven = NOT_GIVEN,
19
response_format: Literal["mp3", "opus", "aac", "flac", "wav", "pcm"] | NotGiven = NOT_GIVEN,
20
speed: float | NotGiven = NOT_GIVEN,
21
stream_format: Literal["sse", "audio"] | NotGiven = NOT_GIVEN,
22
# Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
23
# The extra values given here take precedence over values defined on the client or passed to this method.
24
extra_headers: Headers | None = None,
25
extra_query: Query | None = None,
26
extra_body: Body | None = None,
27
timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
28
) -> HttpxBinaryResponseContent: ...
29
```
30
31
Usage examples:
32
33
```python
34
from openai import OpenAI
35
import io
36
37
client = OpenAI()
38
39
# Basic text-to-speech
40
response = client.audio.speech.create(
41
model="tts-1",
42
voice="alloy",
43
input="Hello! This is a text-to-speech example using OpenAI's API."
44
)
45
46
# Save to file
47
response.stream_to_file("speech.mp3")
48
49
# Different voices
50
voices = ["alloy", "ash", "ballad", "coral", "echo", "sage"]
51
text = "This is a voice comparison test."
52
53
for voice in voices:
54
response = client.audio.speech.create(
55
model="tts-1",
56
voice=voice,
57
input=text
58
)
59
response.stream_to_file(f"voice_{voice}.mp3")
60
print(f"Generated audio with {voice} voice")
61
62
# High-quality TTS model
63
response = client.audio.speech.create(
64
model="tts-1-hd",
65
voice="nova",
66
input="This is high-definition text-to-speech synthesis.",
67
response_format="wav"
68
)
69
70
response.stream_to_file("hd_speech.wav")
71
72
# Custom speed and format
73
response = client.audio.speech.create(
74
model="tts-1",
75
voice="shimmer",
76
input="This speech will be faster than normal.",
77
speed=1.25, # 25% faster
78
response_format="opus"
79
)
80
81
response.stream_to_file("fast_speech.opus")
82
```
83
84
### Audio Transcription
85
86
Convert audio files to text using Whisper models with support for multiple languages and formats.
87
88
```python { .api }
89
def create(
90
self,
91
*,
92
file: FileTypes,
93
model: Union[str, AudioModel],
94
chunking_strategy: Optional[transcription_create_params.ChunkingStrategy] | NotGiven = NOT_GIVEN,
95
include: List[TranscriptionInclude] | NotGiven = NOT_GIVEN,
96
language: str | NotGiven = NOT_GIVEN,
97
prompt: str | NotGiven = NOT_GIVEN,
98
response_format: Union[AudioResponseFormat, NotGiven] = NOT_GIVEN,
99
stream: Optional[Literal[False]] | Literal[True] | NotGiven = NOT_GIVEN,
100
temperature: float | NotGiven = NOT_GIVEN,
101
timestamp_granularities: List[Literal["word", "segment"]] | NotGiven = NOT_GIVEN,
102
# Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
103
# The extra values given here take precedence over values defined on the client or passed to this method.
104
extra_headers: Headers | None = None,
105
extra_query: Query | None = None,
106
extra_body: Body | None = None,
107
timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
108
) -> str | Transcription | TranscriptionVerbose | Stream[TranscriptionStreamEvent]: ...
109
```
110
111
Usage examples:
112
113
```python
114
# Basic transcription
115
with open("audio_file.mp3", "rb") as audio_file:
116
transcription = client.audio.transcriptions.create(
117
model="whisper-1",
118
file=audio_file
119
)
120
121
print(transcription.text)
122
123
# Specify language for better accuracy
124
with open("french_audio.mp3", "rb") as audio_file:
125
transcription = client.audio.transcriptions.create(
126
model="whisper-1",
127
file=audio_file,
128
language="fr" # French
129
)
130
131
print("French transcription:", transcription.text)
132
133
# Detailed transcription with timestamps
134
with open("interview.wav", "rb") as audio_file:
135
transcription = client.audio.transcriptions.create(
136
model="whisper-1",
137
file=audio_file,
138
response_format="verbose_json",
139
timestamp_granularities=["word", "segment"]
140
)
141
142
print(f"Duration: {transcription.duration} seconds")
143
print(f"Language: {transcription.language}")
144
145
# Print segments with timestamps
146
for segment in transcription.segments:
147
print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
148
149
# Print words with timestamps
150
for word in transcription.words:
151
print(f"{word.word} ({word.start:.2f}s - {word.end:.2f}s)")
152
153
# SRT subtitle format
154
with open("video_audio.mp4", "rb") as audio_file:
155
srt_transcription = client.audio.transcriptions.create(
156
model="whisper-1",
157
file=audio_file,
158
response_format="srt"
159
)
160
161
# Save as subtitle file
162
with open("subtitles.srt", "w") as srt_file:
163
srt_file.write(srt_transcription)
164
165
# With context prompt for technical terms
166
with open("technical_presentation.m4a", "rb") as audio_file:
167
transcription = client.audio.transcriptions.create(
168
model="whisper-1",
169
file=audio_file,
170
prompt="This presentation discusses machine learning, neural networks, and artificial intelligence."
171
)
172
```
173
174
### Audio Translation
175
176
Translate audio in any language to English using Whisper's translation capabilities.
177
178
```python { .api }
179
def create(
180
self,
181
*,
182
file: FileTypes,
183
model: Union[str, AudioModel],
184
prompt: str | NotGiven = NOT_GIVEN,
185
response_format: Union[Literal["json", "text", "srt", "verbose_json", "vtt"], NotGiven] = NOT_GIVEN,
186
temperature: float | NotGiven = NOT_GIVEN,
187
# Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
188
# The extra values given here take precedence over values defined on the client or passed to this method.
189
extra_headers: Headers | None = None,
190
extra_query: Query | None = None,
191
extra_body: Body | None = None,
192
timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
193
) -> Translation | TranslationVerbose | str: ...
194
```
195
196
Usage examples:
197
198
```python
199
# Basic translation (any language to English)
200
with open("spanish_audio.mp3", "rb") as audio_file:
201
translation = client.audio.translations.create(
202
model="whisper-1",
203
file=audio_file
204
)
205
206
print("English translation:", translation.text)
207
208
# Translation with detailed output
209
with open("german_podcast.wav", "rb") as audio_file:
210
translation = client.audio.translations.create(
211
model="whisper-1",
212
file=audio_file,
213
response_format="verbose_json"
214
)
215
216
print(f"Original language detected: {translation.language}")
217
print(f"Translation: {translation.text}")
218
print(f"Duration: {translation.duration} seconds")
219
220
# Translation for subtitles
221
with open("french_movie.mp4", "rb") as audio_file:
222
vtt_translation = client.audio.translations.create(
223
model="whisper-1",
224
file=audio_file,
225
response_format="vtt"
226
)
227
228
# Save VTT subtitle file
229
with open("english_subtitles.vtt", "w") as vtt_file:
230
vtt_file.write(vtt_translation)
231
232
# Translation with context
233
with open("japanese_lecture.mp3", "rb") as audio_file:
234
translation = client.audio.translations.create(
235
model="whisper-1",
236
file=audio_file,
237
prompt="This is a university lecture about physics and quantum mechanics."
238
)
239
```
240
241
### Advanced Audio Processing
242
243
Handle various audio formats, file sizes, and processing options for optimal results.
244
245
Usage examples:
246
247
```python
248
import os
249
from pathlib import Path
250
251
# Handle multiple audio formats
252
audio_formats = [".mp3", ".wav", ".m4a", ".flac", ".ogg"]
253
audio_dir = Path("audio_files/")
254
255
for audio_file in audio_dir.iterdir():
256
if audio_file.suffix.lower() in audio_formats:
257
print(f"Processing {audio_file.name}...")
258
259
with open(audio_file, "rb") as file:
260
transcription = client.audio.transcriptions.create(
261
model="whisper-1",
262
file=file
263
)
264
265
# Save transcription
266
output_file = audio_dir / f"{audio_file.stem}_transcription.txt"
267
with open(output_file, "w") as f:
268
f.write(transcription.text)
269
270
# Handle large audio files (split if necessary)
271
def transcribe_large_audio(file_path, max_size_mb=25):
272
"""Transcribe audio files, splitting if they exceed size limit"""
273
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
274
275
if file_size_mb <= max_size_mb:
276
# File is small enough, transcribe directly
277
with open(file_path, "rb") as audio_file:
278
transcription = client.audio.transcriptions.create(
279
model="whisper-1",
280
file=audio_file
281
)
282
return transcription.text
283
else:
284
print(f"File too large ({file_size_mb:.1f}MB), please split first")
285
return None
286
287
# Temperature control for consistency
288
audio_files = ["recording1.mp3", "recording2.mp3", "recording3.mp3"]
289
290
for audio_file in audio_files:
291
with open(audio_file, "rb") as file:
292
# Low temperature for consistent output
293
transcription = client.audio.transcriptions.create(
294
model="whisper-1",
295
file=file,
296
temperature=0.0 # Most deterministic
297
)
298
299
print(f"{audio_file}: {transcription.text}")
300
301
# Streaming TTS for real-time applications
302
def stream_tts(text_generator, voice="alloy"):
303
"""Stream TTS for dynamically generated text"""
304
305
for text_chunk in text_generator:
306
if text_chunk.strip(): # Skip empty chunks
307
response = client.audio.speech.create(
308
model="tts-1",
309
voice=voice,
310
input=text_chunk,
311
response_format="mp3"
312
)
313
314
# Stream or save each chunk
315
chunk_filename = f"chunk_{hash(text_chunk)}.mp3"
316
response.stream_to_file(chunk_filename)
317
318
yield chunk_filename
319
320
# Example text generator
321
def generate_story():
322
sentences = [
323
"Once upon a time, in a distant galaxy.",
324
"There lived a brave astronaut named Alex.",
325
"Alex discovered a mysterious planet.",
326
"The planet was filled with strange creatures."
327
]
328
for sentence in sentences:
329
yield sentence
330
331
# Generate streaming TTS
332
for audio_file in stream_tts(generate_story()):
333
print(f"Generated: {audio_file}")
334
```
335
336
### File Handling and Utilities
337
338
Efficient file management and audio processing utilities for various use cases.
339
340
```python { .api }
341
FileTypes = Union[
342
bytes, # Raw audio bytes
343
IO[bytes], # File-like object
344
str, # File path
345
os.PathLike[str] # Path object
346
]
347
```
348
349
Usage examples:
350
351
```python
352
import io
353
import base64
354
from pathlib import Path
355
356
# File path transcription
357
audio_path = Path("meeting_recording.wav")
358
with open(audio_path, "rb") as audio_file:
359
transcription = client.audio.transcriptions.create(
360
model="whisper-1",
361
file=audio_file
362
)
363
364
# Bytes transcription
365
with open("interview.mp3", "rb") as f:
366
audio_bytes = f.read()
367
368
transcription = client.audio.transcriptions.create(
369
model="whisper-1",
370
file=audio_bytes
371
)
372
373
# In-memory audio processing
374
audio_buffer = io.BytesIO()
375
376
# Generate TTS to buffer
377
response = client.audio.speech.create(
378
model="tts-1",
379
voice="alloy",
380
input="This will be stored in memory."
381
)
382
383
# Write to buffer
384
for chunk in response.iter_bytes():
385
audio_buffer.write(chunk)
386
387
# Reset buffer position for reading
388
audio_buffer.seek(0)
389
390
# Transcribe from buffer
391
transcription = client.audio.transcriptions.create(
392
model="whisper-1",
393
file=audio_buffer
394
)
395
396
print("Round-trip transcription:", transcription.text)
397
398
# Base64 audio handling
399
def audio_to_base64(file_path):
400
"""Convert audio file to base64 string"""
401
with open(file_path, "rb") as f:
402
return base64.b64encode(f.read()).decode()
403
404
def base64_to_audio(base64_str, output_path):
405
"""Convert base64 string to audio file"""
406
audio_bytes = base64.b64decode(base64_str)
407
with open(output_path, "wb") as f:
408
f.write(audio_bytes)
409
410
# Example usage
411
base64_audio = audio_to_base64("original.mp3")
412
base64_to_audio(base64_audio, "restored.mp3")
413
414
# Batch processing utility
415
def process_audio_batch(audio_files, operation="transcribe"):
416
"""Process multiple audio files in batch"""
417
results = []
418
419
for audio_file in audio_files:
420
try:
421
with open(audio_file, "rb") as file:
422
if operation == "transcribe":
423
result = client.audio.transcriptions.create(
424
model="whisper-1",
425
file=file
426
)
427
results.append({
428
"file": audio_file,
429
"text": result.text
430
})
431
elif operation == "translate":
432
result = client.audio.translations.create(
433
model="whisper-1",
434
file=file
435
)
436
results.append({
437
"file": audio_file,
438
"translation": result.text
439
})
440
except Exception as e:
441
results.append({
442
"file": audio_file,
443
"error": str(e)
444
})
445
446
return results
447
448
# Process multiple files
449
audio_files = ["file1.mp3", "file2.wav", "file3.m4a"]
450
batch_results = process_audio_batch(audio_files, "transcribe")
451
452
for result in batch_results:
453
if "error" in result:
454
print(f"Error processing {result['file']}: {result['error']}")
455
else:
456
print(f"{result['file']}: {result['text']}")
457
```
458
459
## Types
460
461
### Core Response Types
462
463
```python { .api }
464
class Transcription(BaseModel):
465
text: str
466
language: Optional[str]
467
duration: Optional[float]
468
words: Optional[List[TranscriptionWord]]
469
segments: Optional[List[TranscriptionSegment]]
470
471
class TranscriptionWord(BaseModel):
472
word: str
473
start: float
474
end: float
475
476
class TranscriptionSegment(BaseModel):
477
id: int
478
seek: int
479
start: float
480
end: float
481
text: str
482
tokens: List[int]
483
temperature: float
484
avg_logprob: float
485
compression_ratio: float
486
no_speech_prob: float
487
488
class Translation(BaseModel):
489
text: str
490
language: Optional[str]
491
duration: Optional[float]
492
segments: Optional[List[TranscriptionSegment]]
493
494
class TranslationVerbose(BaseModel):
495
text: str
496
language: Optional[str]
497
duration: Optional[float]
498
segments: Optional[List[TranscriptionSegment]]
499
500
class HttpxBinaryResponseContent:
501
def stream_to_file(self, file: Union[str, os.PathLike[str]]) -> None: ...
502
def iter_bytes(self, chunk_size: int = 1024) -> Iterator[bytes]: ...
503
```
504
505
### Parameter Types
506
507
```python { .api }
508
# Speech synthesis parameters
509
SpeechCreateParams = TypedDict('SpeechCreateParams', {
510
'input': Required[str],
511
'model': Required[Union[str, SpeechModel]],
512
'voice': Required[Union[str, AudioVoice]],
513
'instructions': NotRequired[str],
514
'response_format': NotRequired[AudioFormat],
515
'speed': NotRequired[float],
516
'stream_format': NotRequired[Literal["sse", "audio"]],
517
'extra_headers': NotRequired[Headers],
518
'extra_query': NotRequired[Query],
519
'extra_body': NotRequired[Body],
520
'timeout': NotRequired[float],
521
}, total=False)
522
523
# Transcription parameters
524
TranscriptionCreateParams = TypedDict('TranscriptionCreateParams', {
525
'file': Required[FileTypes],
526
'model': Required[Union[str, AudioModel]],
527
'chunking_strategy': NotRequired[Optional[ChunkingStrategy]],
528
'include': NotRequired[List[TranscriptionInclude]],
529
'language': NotRequired[str],
530
'prompt': NotRequired[str],
531
'response_format': NotRequired[AudioResponseFormat],
532
'stream': NotRequired[bool],
533
'temperature': NotRequired[float],
534
'timestamp_granularities': NotRequired[List[TimestampGranularity]],
535
'extra_headers': NotRequired[Headers],
536
'extra_query': NotRequired[Query],
537
'extra_body': NotRequired[Body],
538
'timeout': NotRequired[float],
539
}, total=False)
540
541
# Translation parameters
542
TranslationCreateParams = TypedDict('TranslationCreateParams', {
543
'file': Required[FileTypes],
544
'model': Required[Union[str, AudioModel]],
545
'prompt': NotRequired[str],
546
'response_format': NotRequired[AudioResponseFormat],
547
'temperature': NotRequired[float],
548
'extra_headers': NotRequired[Headers],
549
'extra_query': NotRequired[Query],
550
'extra_body': NotRequired[Body],
551
'timeout': NotRequired[float],
552
}, total=False)
553
```
554
555
### Model and Format Types
556
557
```python { .api }
558
# TTS Models
559
SpeechModel = Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts"]
560
561
# Audio processing models
562
AudioModel = Literal["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "whisper-1"]
563
564
# Voice options
565
AudioVoice = Literal[
566
"alloy", "ash", "ballad", "coral",
567
"echo", "sage", "shimmer", "verse", "marin", "cedar"
568
]
569
570
# Audio formats
571
AudioFormat = Literal["mp3", "opus", "aac", "flac", "wav", "pcm"]
572
573
# Response formats
574
AudioResponseFormat = Literal["json", "text", "srt", "verbose_json", "vtt"]
575
576
# Timestamp and streaming options
577
TimestampGranularity = Literal["word", "segment"]
578
TranscriptionInclude = Literal["logprobs"]
579
580
# Chunking strategy types
581
ChunkingStrategy = Union[Literal["auto"], Dict[str, Any]] # server_vad object
582
583
# Streaming support
584
TranscriptionStreamEvent = Dict[str, Any]
585
Stream = Iterator[TranscriptionStreamEvent]
586
587
# File type union
588
FileTypes = Union[
589
bytes, # Raw bytes
590
IO[bytes], # File-like object
591
str, # File path string
592
os.PathLike[str] # Path object
593
]
594
```
595
596
### Configuration Types
597
598
```python { .api }
599
# Parameter ranges and limits
600
class AudioLimits:
601
# File size limits
602
max_file_size: int = 25 * 1024 * 1024 # 25MB
603
604
# Supported formats
605
supported_formats: List[str] = [
606
"flac", "m4a", "mp3", "mp4", "mpeg", "mpga",
607
"oga", "ogg", "wav", "webm"
608
]
609
610
# TTS speed range
611
speed_range: Tuple[float, float] = (0.25, 4.0)
612
613
# Temperature range
614
temperature_range: Tuple[float, float] = (0.0, 1.0)
615
616
# Max input text length for TTS
617
max_tts_input: int = 4096 # characters
618
```
619
620
## Best Practices
621
622
### Text-to-Speech
623
624
- Choose appropriate voice for your use case (alloy for general use, nova for conversational)
625
- Use `tts-1-hd` for higher quality when latency is less important
626
- Adjust speed based on content type (slower for technical content)
627
- Break long text into chunks for better processing
628
- Use appropriate audio format (mp3 for web, wav for processing)
629
630
### Transcription
631
632
- Provide language hint when known for better accuracy
633
- Use context prompts for technical terms or proper nouns
634
- Choose appropriate response format (verbose_json for detailed analysis)
635
- Ensure audio quality is good (clear speech, minimal background noise)
636
- Split large files before uploading (25MB limit)
637
638
### Translation
639
640
- Whisper automatically detects source language
641
- Works best with clear, well-enunciated speech
642
- Context prompts help with domain-specific terminology
643
- Consider transcription + separate translation for very long content
644
645
### Performance and Cost
646
647
- Batch similar requests when possible
648
- Cache results for repeated content
649
- Use appropriate model (tts-1 vs tts-1-hd) based on quality needs
650
- Consider preprocessing audio (noise reduction, normalization)
651
- Monitor usage and implement rate limiting for production applications