- Spec files
pypi-openai
Describes: pkg:pypi/openai@1.106.x
- Description
- Official Python library for the OpenAI API providing chat completions, embeddings, audio, images, and more
- Author
- tessl
- Last updated
audio.md docs/
1# Audio APIs23Comprehensive audio processing including text-to-speech synthesis, speech-to-text transcription, and audio translation capabilities using Whisper and TTS models.45## Capabilities67### Text-to-Speech (TTS)89Generate high-quality audio from text input using various voice options and audio formats.1011```python { .api }12def create(13self,14*,15input: str,16model: Union[str, SpeechModel],17voice: Union[str, Literal["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"]],18instructions: str | NotGiven = NOT_GIVEN,19response_format: Literal["mp3", "opus", "aac", "flac", "wav", "pcm"] | NotGiven = NOT_GIVEN,20speed: float | NotGiven = NOT_GIVEN,21stream_format: Literal["sse", "audio"] | NotGiven = NOT_GIVEN,22# Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.23# The extra values given here take precedence over values defined on the client or passed to this method.24extra_headers: Headers | None = None,25extra_query: Query | None = None,26extra_body: Body | None = None,27timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,28) -> HttpxBinaryResponseContent: ...29```3031Usage examples:3233```python34from openai import OpenAI35import io3637client = OpenAI()3839# Basic text-to-speech40response = client.audio.speech.create(41model="tts-1",42voice="alloy",43input="Hello! This is a text-to-speech example using OpenAI's API."44)4546# Save to file47response.stream_to_file("speech.mp3")4849# Different voices50voices = ["alloy", "ash", "ballad", "coral", "echo", "sage"]51text = "This is a voice comparison test."5253for voice in voices:54response = client.audio.speech.create(55model="tts-1",56voice=voice,57input=text58)59response.stream_to_file(f"voice_{voice}.mp3")60print(f"Generated audio with {voice} voice")6162# High-quality TTS model63response = client.audio.speech.create(64model="tts-1-hd",65voice="nova",66input="This is high-definition text-to-speech synthesis.",67response_format="wav"68)6970response.stream_to_file("hd_speech.wav")7172# Custom speed and format73response = client.audio.speech.create(74model="tts-1",75voice="shimmer",76input="This speech will be faster than normal.",77speed=1.25, # 25% faster78response_format="opus"79)8081response.stream_to_file("fast_speech.opus")82```8384### Audio Transcription8586Convert audio files to text using Whisper models with support for multiple languages and formats.8788```python { .api }89def create(90self,91*,92file: FileTypes,93model: Union[str, AudioModel],94chunking_strategy: Optional[transcription_create_params.ChunkingStrategy] | NotGiven = NOT_GIVEN,95include: List[TranscriptionInclude] | NotGiven = NOT_GIVEN,96language: str | NotGiven = NOT_GIVEN,97prompt: str | NotGiven = NOT_GIVEN,98response_format: Union[AudioResponseFormat, NotGiven] = NOT_GIVEN,99stream: Optional[Literal[False]] | Literal[True] | NotGiven = NOT_GIVEN,100temperature: float | NotGiven = NOT_GIVEN,101timestamp_granularities: List[Literal["word", "segment"]] | NotGiven = NOT_GIVEN,102# Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.103# The extra values given here take precedence over values defined on the client or passed to this method.104extra_headers: Headers | None = None,105extra_query: Query | None = None,106extra_body: Body | None = None,107timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,108) -> str | Transcription | TranscriptionVerbose | Stream[TranscriptionStreamEvent]: ...109```110111Usage examples:112113```python114# Basic transcription115with open("audio_file.mp3", "rb") as audio_file:116transcription = client.audio.transcriptions.create(117model="whisper-1",118file=audio_file119)120121print(transcription.text)122123# Specify language for better accuracy124with open("french_audio.mp3", "rb") as audio_file:125transcription = client.audio.transcriptions.create(126model="whisper-1",127file=audio_file,128language="fr" # French129)130131print("French transcription:", transcription.text)132133# Detailed transcription with timestamps134with open("interview.wav", "rb") as audio_file:135transcription = client.audio.transcriptions.create(136model="whisper-1",137file=audio_file,138response_format="verbose_json",139timestamp_granularities=["word", "segment"]140)141142print(f"Duration: {transcription.duration} seconds")143print(f"Language: {transcription.language}")144145# Print segments with timestamps146for segment in transcription.segments:147print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")148149# Print words with timestamps150for word in transcription.words:151print(f"{word.word} ({word.start:.2f}s - {word.end:.2f}s)")152153# SRT subtitle format154with open("video_audio.mp4", "rb") as audio_file:155srt_transcription = client.audio.transcriptions.create(156model="whisper-1",157file=audio_file,158response_format="srt"159)160161# Save as subtitle file162with open("subtitles.srt", "w") as srt_file:163srt_file.write(srt_transcription)164165# With context prompt for technical terms166with open("technical_presentation.m4a", "rb") as audio_file:167transcription = client.audio.transcriptions.create(168model="whisper-1",169file=audio_file,170prompt="This presentation discusses machine learning, neural networks, and artificial intelligence."171)172```173174### Audio Translation175176Translate audio in any language to English using Whisper's translation capabilities.177178```python { .api }179def create(180self,181*,182file: FileTypes,183model: Union[str, AudioModel],184prompt: str | NotGiven = NOT_GIVEN,185response_format: Union[Literal["json", "text", "srt", "verbose_json", "vtt"], NotGiven] = NOT_GIVEN,186temperature: float | NotGiven = NOT_GIVEN,187# Use the following arguments if you need to pass additional parameters to the API that aren't available via kwargs.188# The extra values given here take precedence over values defined on the client or passed to this method.189extra_headers: Headers | None = None,190extra_query: Query | None = None,191extra_body: Body | None = None,192timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,193) -> Translation | TranslationVerbose | str: ...194```195196Usage examples:197198```python199# Basic translation (any language to English)200with open("spanish_audio.mp3", "rb") as audio_file:201translation = client.audio.translations.create(202model="whisper-1",203file=audio_file204)205206print("English translation:", translation.text)207208# Translation with detailed output209with open("german_podcast.wav", "rb") as audio_file:210translation = client.audio.translations.create(211model="whisper-1",212file=audio_file,213response_format="verbose_json"214)215216print(f"Original language detected: {translation.language}")217print(f"Translation: {translation.text}")218print(f"Duration: {translation.duration} seconds")219220# Translation for subtitles221with open("french_movie.mp4", "rb") as audio_file:222vtt_translation = client.audio.translations.create(223model="whisper-1",224file=audio_file,225response_format="vtt"226)227228# Save VTT subtitle file229with open("english_subtitles.vtt", "w") as vtt_file:230vtt_file.write(vtt_translation)231232# Translation with context233with open("japanese_lecture.mp3", "rb") as audio_file:234translation = client.audio.translations.create(235model="whisper-1",236file=audio_file,237prompt="This is a university lecture about physics and quantum mechanics."238)239```240241### Advanced Audio Processing242243Handle various audio formats, file sizes, and processing options for optimal results.244245Usage examples:246247```python248import os249from pathlib import Path250251# Handle multiple audio formats252audio_formats = [".mp3", ".wav", ".m4a", ".flac", ".ogg"]253audio_dir = Path("audio_files/")254255for audio_file in audio_dir.iterdir():256if audio_file.suffix.lower() in audio_formats:257print(f"Processing {audio_file.name}...")258259with open(audio_file, "rb") as file:260transcription = client.audio.transcriptions.create(261model="whisper-1",262file=file263)264265# Save transcription266output_file = audio_dir / f"{audio_file.stem}_transcription.txt"267with open(output_file, "w") as f:268f.write(transcription.text)269270# Handle large audio files (split if necessary)271def transcribe_large_audio(file_path, max_size_mb=25):272"""Transcribe audio files, splitting if they exceed size limit"""273file_size_mb = os.path.getsize(file_path) / (1024 * 1024)274275if file_size_mb <= max_size_mb:276# File is small enough, transcribe directly277with open(file_path, "rb") as audio_file:278transcription = client.audio.transcriptions.create(279model="whisper-1",280file=audio_file281)282return transcription.text283else:284print(f"File too large ({file_size_mb:.1f}MB), please split first")285return None286287# Temperature control for consistency288audio_files = ["recording1.mp3", "recording2.mp3", "recording3.mp3"]289290for audio_file in audio_files:291with open(audio_file, "rb") as file:292# Low temperature for consistent output293transcription = client.audio.transcriptions.create(294model="whisper-1",295file=file,296temperature=0.0 # Most deterministic297)298299print(f"{audio_file}: {transcription.text}")300301# Streaming TTS for real-time applications302def stream_tts(text_generator, voice="alloy"):303"""Stream TTS for dynamically generated text"""304305for text_chunk in text_generator:306if text_chunk.strip(): # Skip empty chunks307response = client.audio.speech.create(308model="tts-1",309voice=voice,310input=text_chunk,311response_format="mp3"312)313314# Stream or save each chunk315chunk_filename = f"chunk_{hash(text_chunk)}.mp3"316response.stream_to_file(chunk_filename)317318yield chunk_filename319320# Example text generator321def generate_story():322sentences = [323"Once upon a time, in a distant galaxy.",324"There lived a brave astronaut named Alex.",325"Alex discovered a mysterious planet.",326"The planet was filled with strange creatures."327]328for sentence in sentences:329yield sentence330331# Generate streaming TTS332for audio_file in stream_tts(generate_story()):333print(f"Generated: {audio_file}")334```335336### File Handling and Utilities337338Efficient file management and audio processing utilities for various use cases.339340```python { .api }341FileTypes = Union[342bytes, # Raw audio bytes343IO[bytes], # File-like object344str, # File path345os.PathLike[str] # Path object346]347```348349Usage examples:350351```python352import io353import base64354from pathlib import Path355356# File path transcription357audio_path = Path("meeting_recording.wav")358with open(audio_path, "rb") as audio_file:359transcription = client.audio.transcriptions.create(360model="whisper-1",361file=audio_file362)363364# Bytes transcription365with open("interview.mp3", "rb") as f:366audio_bytes = f.read()367368transcription = client.audio.transcriptions.create(369model="whisper-1",370file=audio_bytes371)372373# In-memory audio processing374audio_buffer = io.BytesIO()375376# Generate TTS to buffer377response = client.audio.speech.create(378model="tts-1",379voice="alloy",380input="This will be stored in memory."381)382383# Write to buffer384for chunk in response.iter_bytes():385audio_buffer.write(chunk)386387# Reset buffer position for reading388audio_buffer.seek(0)389390# Transcribe from buffer391transcription = client.audio.transcriptions.create(392model="whisper-1",393file=audio_buffer394)395396print("Round-trip transcription:", transcription.text)397398# Base64 audio handling399def audio_to_base64(file_path):400"""Convert audio file to base64 string"""401with open(file_path, "rb") as f:402return base64.b64encode(f.read()).decode()403404def base64_to_audio(base64_str, output_path):405"""Convert base64 string to audio file"""406audio_bytes = base64.b64decode(base64_str)407with open(output_path, "wb") as f:408f.write(audio_bytes)409410# Example usage411base64_audio = audio_to_base64("original.mp3")412base64_to_audio(base64_audio, "restored.mp3")413414# Batch processing utility415def process_audio_batch(audio_files, operation="transcribe"):416"""Process multiple audio files in batch"""417results = []418419for audio_file in audio_files:420try:421with open(audio_file, "rb") as file:422if operation == "transcribe":423result = client.audio.transcriptions.create(424model="whisper-1",425file=file426)427results.append({428"file": audio_file,429"text": result.text430})431elif operation == "translate":432result = client.audio.translations.create(433model="whisper-1",434file=file435)436results.append({437"file": audio_file,438"translation": result.text439})440except Exception as e:441results.append({442"file": audio_file,443"error": str(e)444})445446return results447448# Process multiple files449audio_files = ["file1.mp3", "file2.wav", "file3.m4a"]450batch_results = process_audio_batch(audio_files, "transcribe")451452for result in batch_results:453if "error" in result:454print(f"Error processing {result['file']}: {result['error']}")455else:456print(f"{result['file']}: {result['text']}")457```458459## Types460461### Core Response Types462463```python { .api }464class Transcription(BaseModel):465text: str466language: Optional[str]467duration: Optional[float]468words: Optional[List[TranscriptionWord]]469segments: Optional[List[TranscriptionSegment]]470471class TranscriptionWord(BaseModel):472word: str473start: float474end: float475476class TranscriptionSegment(BaseModel):477id: int478seek: int479start: float480end: float481text: str482tokens: List[int]483temperature: float484avg_logprob: float485compression_ratio: float486no_speech_prob: float487488class Translation(BaseModel):489text: str490language: Optional[str]491duration: Optional[float]492segments: Optional[List[TranscriptionSegment]]493494class TranslationVerbose(BaseModel):495text: str496language: Optional[str]497duration: Optional[float]498segments: Optional[List[TranscriptionSegment]]499500class HttpxBinaryResponseContent:501def stream_to_file(self, file: Union[str, os.PathLike[str]]) -> None: ...502def iter_bytes(self, chunk_size: int = 1024) -> Iterator[bytes]: ...503```504505### Parameter Types506507```python { .api }508# Speech synthesis parameters509SpeechCreateParams = TypedDict('SpeechCreateParams', {510'input': Required[str],511'model': Required[Union[str, SpeechModel]],512'voice': Required[Union[str, AudioVoice]],513'instructions': NotRequired[str],514'response_format': NotRequired[AudioFormat],515'speed': NotRequired[float],516'stream_format': NotRequired[Literal["sse", "audio"]],517'extra_headers': NotRequired[Headers],518'extra_query': NotRequired[Query],519'extra_body': NotRequired[Body],520'timeout': NotRequired[float],521}, total=False)522523# Transcription parameters524TranscriptionCreateParams = TypedDict('TranscriptionCreateParams', {525'file': Required[FileTypes],526'model': Required[Union[str, AudioModel]],527'chunking_strategy': NotRequired[Optional[ChunkingStrategy]],528'include': NotRequired[List[TranscriptionInclude]],529'language': NotRequired[str],530'prompt': NotRequired[str],531'response_format': NotRequired[AudioResponseFormat],532'stream': NotRequired[bool],533'temperature': NotRequired[float],534'timestamp_granularities': NotRequired[List[TimestampGranularity]],535'extra_headers': NotRequired[Headers],536'extra_query': NotRequired[Query],537'extra_body': NotRequired[Body],538'timeout': NotRequired[float],539}, total=False)540541# Translation parameters542TranslationCreateParams = TypedDict('TranslationCreateParams', {543'file': Required[FileTypes],544'model': Required[Union[str, AudioModel]],545'prompt': NotRequired[str],546'response_format': NotRequired[AudioResponseFormat],547'temperature': NotRequired[float],548'extra_headers': NotRequired[Headers],549'extra_query': NotRequired[Query],550'extra_body': NotRequired[Body],551'timeout': NotRequired[float],552}, total=False)553```554555### Model and Format Types556557```python { .api }558# TTS Models559SpeechModel = Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts"]560561# Audio processing models562AudioModel = Literal["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "whisper-1"]563564# Voice options565AudioVoice = Literal[566"alloy", "ash", "ballad", "coral",567"echo", "sage", "shimmer", "verse", "marin", "cedar"568]569570# Audio formats571AudioFormat = Literal["mp3", "opus", "aac", "flac", "wav", "pcm"]572573# Response formats574AudioResponseFormat = Literal["json", "text", "srt", "verbose_json", "vtt"]575576# Timestamp and streaming options577TimestampGranularity = Literal["word", "segment"]578TranscriptionInclude = Literal["logprobs"]579580# Chunking strategy types581ChunkingStrategy = Union[Literal["auto"], Dict[str, Any]] # server_vad object582583# Streaming support584TranscriptionStreamEvent = Dict[str, Any]585Stream = Iterator[TranscriptionStreamEvent]586587# File type union588FileTypes = Union[589bytes, # Raw bytes590IO[bytes], # File-like object591str, # File path string592os.PathLike[str] # Path object593]594```595596### Configuration Types597598```python { .api }599# Parameter ranges and limits600class AudioLimits:601# File size limits602max_file_size: int = 25 * 1024 * 1024 # 25MB603604# Supported formats605supported_formats: List[str] = [606"flac", "m4a", "mp3", "mp4", "mpeg", "mpga",607"oga", "ogg", "wav", "webm"608]609610# TTS speed range611speed_range: Tuple[float, float] = (0.25, 4.0)612613# Temperature range614temperature_range: Tuple[float, float] = (0.0, 1.0)615616# Max input text length for TTS617max_tts_input: int = 4096 # characters618```619620## Best Practices621622### Text-to-Speech623624- Choose appropriate voice for your use case (alloy for general use, nova for conversational)625- Use `tts-1-hd` for higher quality when latency is less important626- Adjust speed based on content type (slower for technical content)627- Break long text into chunks for better processing628- Use appropriate audio format (mp3 for web, wav for processing)629630### Transcription631632- Provide language hint when known for better accuracy633- Use context prompts for technical terms or proper nouns634- Choose appropriate response format (verbose_json for detailed analysis)635- Ensure audio quality is good (clear speech, minimal background noise)636- Split large files before uploading (25MB limit)637638### Translation639640- Whisper automatically detects source language641- Works best with clear, well-enunciated speech642- Context prompts help with domain-specific terminology643- Consider transcription + separate translation for very long content644645### Performance and Cost646647- Batch similar requests when possible648- Cache results for repeated content649- Use appropriate model (tts-1 vs tts-1-hd) based on quality needs650- Consider preprocessing audio (noise reduction, normalization)651- Monitor usage and implement rate limiting for production applications