pypi-openai

Description: Official Python library for the OpenAI API providing chat completions, embeddings, audio, images, and more

Author: tessl

Last updated: 21 days ago

How to use

npx @tessl/cli registry install tessl/pypi-openai@1.106.0

Provide feedback Docs

audio.md docs/

1
# Audio APIs
2

3
Comprehensive audio processing including text-to-speech synthesis, speech-to-text transcription, and audio translation capabilities using Whisper and TTS models.
4

5
## Capabilities
6

7
### Text-to-Speech (TTS)
8

9
Generate high-quality audio from text input using various voice options and audio formats.
10

11
```python { .api }
12
def create(
13
    self,
14
    *,
15
    input: str,
16
    model: Union[str, SpeechModel],
17
    voice: Union[str, Literal["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"]],
18
    instructions: str | NotGiven = NOT_GIVEN,
19
    response_format: Literal["mp3", "opus", "aac", "flac", "wav", "pcm"] | NotGiven = NOT_GIVEN,
20
    speed: float | NotGiven = NOT_GIVEN,
21
    stream_format: Literal["sse", "audio"] | NotGiven = NOT_GIVEN,
22
    # Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
23
    # The extra values given here take precedence over values defined on the client or passed to this method.
24
    extra_headers: Headers | None = None,
25
    extra_query: Query | None = None,
26
    extra_body: Body | None = None,
27
    timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
28
) -> HttpxBinaryResponseContent: ...
29
```
30

31
Usage examples:
32

33
```python
34
from openai import OpenAI
35
import io
36

37
client = OpenAI()
38

39
# Basic text-to-speech
40
response = client.audio.speech.create(
41
    model="tts-1",
42
    voice="alloy",
43
    input="Hello! This is a text-to-speech example using OpenAI's API."
44
)
45

46
# Save to file
47
response.stream_to_file("speech.mp3")
48

49
# Different voices
50
voices = ["alloy", "ash", "ballad", "coral", "echo", "sage"]
51
text = "This is a voice comparison test."
52

53
for voice in voices:
54
    response = client.audio.speech.create(
55
        model="tts-1",
56
        voice=voice,
57
        input=text
58
    )
59
    response.stream_to_file(f"voice_{voice}.mp3")
60
    print(f"Generated audio with {voice} voice")
61

62
# High-quality TTS model
63
response = client.audio.speech.create(
64
    model="tts-1-hd",
65
    voice="nova",
66
    input="This is high-definition text-to-speech synthesis.",
67
    response_format="wav"
68
)
69

70
response.stream_to_file("hd_speech.wav")
71

72
# Custom speed and format
73
response = client.audio.speech.create(
74
    model="tts-1",
75
    voice="shimmer",
76
    input="This speech will be faster than normal.",
77
    speed=1.25,  # 25% faster
78
    response_format="opus"
79
)
80

81
response.stream_to_file("fast_speech.opus")
82
```
83

84
### Audio Transcription
85

86
Convert audio files to text using Whisper models with support for multiple languages and formats.
87

88
```python { .api }
89
def create(
90
    self,
91
    *,
92
    file: FileTypes,
93
    model: Union[str, AudioModel],
94
    chunking_strategy: Optional[transcription_create_params.ChunkingStrategy] | NotGiven = NOT_GIVEN,
95
    include: List[TranscriptionInclude] | NotGiven = NOT_GIVEN,
96
    language: str | NotGiven = NOT_GIVEN,
97
    prompt: str | NotGiven = NOT_GIVEN,
98
    response_format: Union[AudioResponseFormat, NotGiven] = NOT_GIVEN,
99
    stream: Optional[Literal[False]] | Literal[True] | NotGiven = NOT_GIVEN,
100
    temperature: float | NotGiven = NOT_GIVEN,
101
    timestamp_granularities: List[Literal["word", "segment"]] | NotGiven = NOT_GIVEN,
102
    # Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
103
    # The extra values given here take precedence over values defined on the client or passed to this method.
104
    extra_headers: Headers | None = None,
105
    extra_query: Query | None = None,
106
    extra_body: Body | None = None,
107
    timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
108
) -> str | Transcription | TranscriptionVerbose | Stream[TranscriptionStreamEvent]: ...
109
```
110

111
Usage examples:
112

113
```python
114
# Basic transcription
115
with open("audio_file.mp3", "rb") as audio_file:
116
    transcription = client.audio.transcriptions.create(
117
        model="whisper-1",
118
        file=audio_file
119
    )
120

121
print(transcription.text)
122

123
# Specify language for better accuracy
124
with open("french_audio.mp3", "rb") as audio_file:
125
    transcription = client.audio.transcriptions.create(
126
        model="whisper-1",
127
        file=audio_file,
128
        language="fr"  # French
129
    )
130

131
print("French transcription:", transcription.text)
132

133
# Detailed transcription with timestamps
134
with open("interview.wav", "rb") as audio_file:
135
    transcription = client.audio.transcriptions.create(
136
        model="whisper-1",
137
        file=audio_file,
138
        response_format="verbose_json",
139
        timestamp_granularities=["word", "segment"]
140
    )
141

142
print(f"Duration: {transcription.duration} seconds")
143
print(f"Language: {transcription.language}")
144

145
# Print segments with timestamps
146
for segment in transcription.segments:
147
    print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")
148

149
# Print words with timestamps
150
for word in transcription.words:
151
    print(f"{word.word} ({word.start:.2f}s - {word.end:.2f}s)")
152

153
# SRT subtitle format
154
with open("video_audio.mp4", "rb") as audio_file:
155
    srt_transcription = client.audio.transcriptions.create(
156
        model="whisper-1",
157
        file=audio_file,
158
        response_format="srt"
159
    )
160

161
# Save as subtitle file
162
with open("subtitles.srt", "w") as srt_file:
163
    srt_file.write(srt_transcription)
164

165
# With context prompt for technical terms
166
with open("technical_presentation.m4a", "rb") as audio_file:
167
    transcription = client.audio.transcriptions.create(
168
        model="whisper-1",
169
        file=audio_file,
170
        prompt="This presentation discusses machine learning, neural networks, and artificial intelligence."
171
    )
172
```
173

174
### Audio Translation
175

176
Translate audio in any language to English using Whisper's translation capabilities.
177

178
```python { .api }
179
def create(
180
    self,
181
    *,
182
    file: FileTypes,
183
    model: Union[str, AudioModel],
184
    prompt: str | NotGiven = NOT_GIVEN,
185
    response_format: Union[Literal["json", "text", "srt", "verbose_json", "vtt"], NotGiven] = NOT_GIVEN,
186
    temperature: float | NotGiven = NOT_GIVEN,
187
    # Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.
188
    # The extra values given here take precedence over values defined on the client or passed to this method.
189
    extra_headers: Headers | None = None,
190
    extra_query: Query | None = None,
191
    extra_body: Body | None = None,
192
    timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,
193
) -> Translation | TranslationVerbose | str: ...
194
```
195

196
Usage examples:
197

198
```python
199
# Basic translation (any language to English)
200
with open("spanish_audio.mp3", "rb") as audio_file:
201
    translation = client.audio.translations.create(
202
        model="whisper-1",
203
        file=audio_file
204
    )
205

206
print("English translation:", translation.text)
207

208
# Translation with detailed output
209
with open("german_podcast.wav", "rb") as audio_file:
210
    translation = client.audio.translations.create(
211
        model="whisper-1",
212
        file=audio_file,
213
        response_format="verbose_json"
214
    )
215

216
print(f"Original language detected: {translation.language}")
217
print(f"Translation: {translation.text}")
218
print(f"Duration: {translation.duration} seconds")
219

220
# Translation for subtitles
221
with open("french_movie.mp4", "rb") as audio_file:
222
    vtt_translation = client.audio.translations.create(
223
        model="whisper-1", 
224
        file=audio_file,
225
        response_format="vtt"
226
    )
227

228
# Save VTT subtitle file
229
with open("english_subtitles.vtt", "w") as vtt_file:
230
    vtt_file.write(vtt_translation)
231

232
# Translation with context
233
with open("japanese_lecture.mp3", "rb") as audio_file:
234
    translation = client.audio.translations.create(
235
        model="whisper-1",
236
        file=audio_file,
237
        prompt="This is a university lecture about physics and quantum mechanics."
238
    )
239
```
240

241
### Advanced Audio Processing
242

243
Handle various audio formats, file sizes, and processing options for optimal results.
244

245
Usage examples:
246

247
```python
248
import os
249
from pathlib import Path
250

251
# Handle multiple audio formats
252
audio_formats = [".mp3", ".wav", ".m4a", ".flac", ".ogg"]
253
audio_dir = Path("audio_files/")
254

255
for audio_file in audio_dir.iterdir():
256
    if audio_file.suffix.lower() in audio_formats:
257
        print(f"Processing {audio_file.name}...")
258
        
259
        with open(audio_file, "rb") as file:
260
            transcription = client.audio.transcriptions.create(
261
                model="whisper-1",
262
                file=file
263
            )
264
        
265
        # Save transcription
266
        output_file = audio_dir / f"{audio_file.stem}_transcription.txt"
267
        with open(output_file, "w") as f:
268
            f.write(transcription.text)
269

270
# Handle large audio files (split if necessary)
271
def transcribe_large_audio(file_path, max_size_mb=25):
272
    """Transcribe audio files, splitting if they exceed size limit"""
273
    file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
274
    
275
    if file_size_mb <= max_size_mb:
276
        # File is small enough, transcribe directly
277
        with open(file_path, "rb") as audio_file:
278
            transcription = client.audio.transcriptions.create(
279
                model="whisper-1",
280
                file=audio_file
281
            )
282
        return transcription.text
283
    else:
284
        print(f"File too large ({file_size_mb:.1f}MB), please split first")
285
        return None
286

287
# Temperature control for consistency
288
audio_files = ["recording1.mp3", "recording2.mp3", "recording3.mp3"]
289

290
for audio_file in audio_files:
291
    with open(audio_file, "rb") as file:
292
        # Low temperature for consistent output
293
        transcription = client.audio.transcriptions.create(
294
            model="whisper-1",
295
            file=file,
296
            temperature=0.0  # Most deterministic
297
        )
298
    
299
    print(f"{audio_file}: {transcription.text}")
300

301
# Streaming TTS for real-time applications
302
def stream_tts(text_generator, voice="alloy"):
303
    """Stream TTS for dynamically generated text"""
304
    
305
    for text_chunk in text_generator:
306
        if text_chunk.strip():  # Skip empty chunks
307
            response = client.audio.speech.create(
308
                model="tts-1",
309
                voice=voice,
310
                input=text_chunk,
311
                response_format="mp3"
312
            )
313
            
314
            # Stream or save each chunk
315
            chunk_filename = f"chunk_{hash(text_chunk)}.mp3"
316
            response.stream_to_file(chunk_filename)
317
            
318
            yield chunk_filename
319

320
# Example text generator
321
def generate_story():
322
    sentences = [
323
        "Once upon a time, in a distant galaxy.",
324
        "There lived a brave astronaut named Alex.",
325
        "Alex discovered a mysterious planet.",
326
        "The planet was filled with strange creatures."
327
    ]
328
    for sentence in sentences:
329
        yield sentence
330

331
# Generate streaming TTS
332
for audio_file in stream_tts(generate_story()):
333
    print(f"Generated: {audio_file}")
334
```
335

336
### File Handling and Utilities
337

338
Efficient file management and audio processing utilities for various use cases.
339

340
```python { .api }
341
FileTypes = Union[
342
    bytes,              # Raw audio bytes
343
    IO[bytes],          # File-like object
344
    str,                # File path
345
    os.PathLike[str]    # Path object
346
]
347
```
348

349
Usage examples:
350

351
```python
352
import io
353
import base64
354
from pathlib import Path
355

356
# File path transcription
357
audio_path = Path("meeting_recording.wav")
358
with open(audio_path, "rb") as audio_file:
359
    transcription = client.audio.transcriptions.create(
360
        model="whisper-1",
361
        file=audio_file
362
    )
363

364
# Bytes transcription
365
with open("interview.mp3", "rb") as f:
366
    audio_bytes = f.read()
367

368
transcription = client.audio.transcriptions.create(
369
    model="whisper-1",
370
    file=audio_bytes
371
)
372

373
# In-memory audio processing
374
audio_buffer = io.BytesIO()
375

376
# Generate TTS to buffer
377
response = client.audio.speech.create(
378
    model="tts-1",
379
    voice="alloy",
380
    input="This will be stored in memory."
381
)
382

383
# Write to buffer
384
for chunk in response.iter_bytes():
385
    audio_buffer.write(chunk)
386

387
# Reset buffer position for reading
388
audio_buffer.seek(0)
389

390
# Transcribe from buffer
391
transcription = client.audio.transcriptions.create(
392
    model="whisper-1",
393
    file=audio_buffer
394
)
395

396
print("Round-trip transcription:", transcription.text)
397

398
# Base64 audio handling
399
def audio_to_base64(file_path):
400
    """Convert audio file to base64 string"""
401
    with open(file_path, "rb") as f:
402
        return base64.b64encode(f.read()).decode()
403

404
def base64_to_audio(base64_str, output_path):
405
    """Convert base64 string to audio file"""
406
    audio_bytes = base64.b64decode(base64_str)
407
    with open(output_path, "wb") as f:
408
        f.write(audio_bytes)
409

410
# Example usage
411
base64_audio = audio_to_base64("original.mp3")
412
base64_to_audio(base64_audio, "restored.mp3")
413

414
# Batch processing utility
415
def process_audio_batch(audio_files, operation="transcribe"):
416
    """Process multiple audio files in batch"""
417
    results = []
418
    
419
    for audio_file in audio_files:
420
        try:
421
            with open(audio_file, "rb") as file:
422
                if operation == "transcribe":
423
                    result = client.audio.transcriptions.create(
424
                        model="whisper-1",
425
                        file=file
426
                    )
427
                    results.append({
428
                        "file": audio_file,
429
                        "text": result.text
430
                    })
431
                elif operation == "translate":
432
                    result = client.audio.translations.create(
433
                        model="whisper-1",
434
                        file=file
435
                    )
436
                    results.append({
437
                        "file": audio_file,
438
                        "translation": result.text
439
                    })
440
        except Exception as e:
441
            results.append({
442
                "file": audio_file,
443
                "error": str(e)
444
            })
445
    
446
    return results
447

448
# Process multiple files
449
audio_files = ["file1.mp3", "file2.wav", "file3.m4a"]
450
batch_results = process_audio_batch(audio_files, "transcribe")
451

452
for result in batch_results:
453
    if "error" in result:
454
        print(f"Error processing {result['file']}: {result['error']}")
455
    else:
456
        print(f"{result['file']}: {result['text']}")
457
```
458

459
## Types
460

461
### Core Response Types
462

463
```python { .api }
464
class Transcription(BaseModel):
465
    text: str
466
    language: Optional[str]
467
    duration: Optional[float]
468
    words: Optional[List[TranscriptionWord]]
469
    segments: Optional[List[TranscriptionSegment]]
470

471
class TranscriptionWord(BaseModel):
472
    word: str
473
    start: float
474
    end: float
475

476
class TranscriptionSegment(BaseModel):
477
    id: int
478
    seek: int
479
    start: float
480
    end: float
481
    text: str
482
    tokens: List[int]
483
    temperature: float
484
    avg_logprob: float
485
    compression_ratio: float
486
    no_speech_prob: float
487

488
class Translation(BaseModel):
489
    text: str
490
    language: Optional[str]
491
    duration: Optional[float]
492
    segments: Optional[List[TranscriptionSegment]]
493

494
class TranslationVerbose(BaseModel):
495
    text: str
496
    language: Optional[str]
497
    duration: Optional[float]
498
    segments: Optional[List[TranscriptionSegment]]
499

500
class HttpxBinaryResponseContent:
501
    def stream_to_file(self, file: Union[str, os.PathLike[str]]) -> None: ...
502
    def iter_bytes(self, chunk_size: int = 1024) -> Iterator[bytes]: ...
503
```
504

505
### Parameter Types
506

507
```python { .api }
508
# Speech synthesis parameters
509
SpeechCreateParams = TypedDict('SpeechCreateParams', {
510
    'input': Required[str],
511
    'model': Required[Union[str, SpeechModel]],
512
    'voice': Required[Union[str, AudioVoice]],
513
    'instructions': NotRequired[str],
514
    'response_format': NotRequired[AudioFormat],
515
    'speed': NotRequired[float],
516
    'stream_format': NotRequired[Literal["sse", "audio"]],
517
    'extra_headers': NotRequired[Headers],
518
    'extra_query': NotRequired[Query],
519
    'extra_body': NotRequired[Body],
520
    'timeout': NotRequired[float],
521
}, total=False)
522

523
# Transcription parameters
524
TranscriptionCreateParams = TypedDict('TranscriptionCreateParams', {
525
    'file': Required[FileTypes],
526
    'model': Required[Union[str, AudioModel]],
527
    'chunking_strategy': NotRequired[Optional[ChunkingStrategy]],
528
    'include': NotRequired[List[TranscriptionInclude]],
529
    'language': NotRequired[str],
530
    'prompt': NotRequired[str],
531
    'response_format': NotRequired[AudioResponseFormat],
532
    'stream': NotRequired[bool],
533
    'temperature': NotRequired[float],
534
    'timestamp_granularities': NotRequired[List[TimestampGranularity]],
535
    'extra_headers': NotRequired[Headers],
536
    'extra_query': NotRequired[Query],
537
    'extra_body': NotRequired[Body],
538
    'timeout': NotRequired[float],
539
}, total=False)
540

541
# Translation parameters  
542
TranslationCreateParams = TypedDict('TranslationCreateParams', {
543
    'file': Required[FileTypes],
544
    'model': Required[Union[str, AudioModel]],
545
    'prompt': NotRequired[str],
546
    'response_format': NotRequired[AudioResponseFormat], 
547
    'temperature': NotRequired[float],
548
    'extra_headers': NotRequired[Headers],
549
    'extra_query': NotRequired[Query],
550
    'extra_body': NotRequired[Body],
551
    'timeout': NotRequired[float],
552
}, total=False)
553
```
554

555
### Model and Format Types
556

557
```python { .api }
558
# TTS Models
559
SpeechModel = Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts"]
560

561
# Audio processing models
562
AudioModel = Literal["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "whisper-1"]
563

564
# Voice options
565
AudioVoice = Literal[
566
    "alloy", "ash", "ballad", "coral", 
567
    "echo", "sage", "shimmer", "verse", "marin", "cedar"
568
]
569

570
# Audio formats
571
AudioFormat = Literal["mp3", "opus", "aac", "flac", "wav", "pcm"]
572

573
# Response formats
574
AudioResponseFormat = Literal["json", "text", "srt", "verbose_json", "vtt"]
575

576
# Timestamp and streaming options
577
TimestampGranularity = Literal["word", "segment"]
578
TranscriptionInclude = Literal["logprobs"]
579

580
# Chunking strategy types
581
ChunkingStrategy = Union[Literal["auto"], Dict[str, Any]]  # server_vad object
582

583
# Streaming support
584
TranscriptionStreamEvent = Dict[str, Any]
585
Stream = Iterator[TranscriptionStreamEvent]
586

587
# File type union
588
FileTypes = Union[
589
    bytes,              # Raw bytes
590
    IO[bytes],          # File-like object  
591
    str,                # File path string
592
    os.PathLike[str]    # Path object
593
]
594
```
595

596
### Configuration Types
597

598
```python { .api }
599
# Parameter ranges and limits
600
class AudioLimits:
601
    # File size limits
602
    max_file_size: int = 25 * 1024 * 1024  # 25MB
603
    
604
    # Supported formats
605
    supported_formats: List[str] = [
606
        "flac", "m4a", "mp3", "mp4", "mpeg", "mpga", 
607
        "oga", "ogg", "wav", "webm"
608
    ]
609
    
610
    # TTS speed range
611
    speed_range: Tuple[float, float] = (0.25, 4.0)
612
    
613
    # Temperature range
614
    temperature_range: Tuple[float, float] = (0.0, 1.0)
615
    
616
    # Max input text length for TTS
617
    max_tts_input: int = 4096  # characters
618
```
619

620
## Best Practices
621

622
### Text-to-Speech
623

624
- Choose appropriate voice for your use case (alloy for general use, nova for conversational)
625
- Use `tts-1-hd` for higher quality when latency is less important
626
- Adjust speed based on content type (slower for technical content)
627
- Break long text into chunks for better processing
628
- Use appropriate audio format (mp3 for web, wav for processing)
629

630
### Transcription
631

632
- Provide language hint when known for better accuracy
633
- Use context prompts for technical terms or proper nouns
634
- Choose appropriate response format (verbose_json for detailed analysis)
635
- Ensure audio quality is good (clear speech, minimal background noise)
636
- Split large files before uploading (25MB limit)
637

638
### Translation
639

640
- Whisper automatically detects source language
641
- Works best with clear, well-enunciated speech
642
- Context prompts help with domain-specific terminology
643
- Consider transcription + separate translation for very long content
644

645
### Performance and Cost
646

647
- Batch similar requests when possible
648
- Cache results for repeated content
649
- Use appropriate model (tts-1 vs tts-1-hd) based on quality needs
650
- Consider preprocessing audio (noise reduction, normalization)
651
- Monitor usage and implement rate limiting for production applications