The official TypeScript library for the OpenAI API
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Core audio capabilities providing text-to-speech generation, speech-to-text transcription with advanced features including speaker diarization, and audio translation to English.
The Audio resource is organized into three sub-resources, each serving distinct audio processing needs:
Generate natural-sounding audio from text input with configurable voices and audio formats.
client.audio.speech.create(params: SpeechCreateParams): Promise<Response>;Convert audio to text with support for multiple languages, speaker diarization, streaming, and detailed metadata including timestamps and confidence scores.
client.audio.transcriptions.create(params: TranscriptionCreateParams): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent>>;Translate audio in any language to English text with optional detailed segment information.
client.audio.translations.create(params: TranslationCreateParams): Promise<Translation | TranslationVerbose>;Text-to-speech audio generation with multiple voice options and configurable audio formats.
Generates audio from text input with configurable voice, format, speed, and model selection.
/**
* Generates audio from the input text
* @param params - Configuration for audio generation
* @returns Response containing audio data as binary stream
*/
speech.create(params: SpeechCreateParams): Promise<Response>;Parameters:
interface SpeechCreateParams {
/** The text to generate audio for (max 4096 characters) */
input: string;
/** TTS model: 'tts-1', 'tts-1-hd', or 'gpt-4o-mini-tts' */
model: SpeechModel;
/** Voice to use: 'alloy', 'ash', 'ballad', 'cedar', 'coral', 'echo', 'fable', 'marin', 'nova', 'onyx', 'sage', 'shimmer', 'verse' */
voice: 'alloy' | 'ash' | 'ballad' | 'cedar' | 'coral' | 'echo' | 'fable' | 'marin' | 'nova' | 'onyx' | 'sage' | 'shimmer' | 'verse';
/** Audio format: 'mp3', 'opus', 'aac', 'flac', 'wav', 'pcm' (default: 'mp3') */
response_format?: 'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm';
/** Speed from 0.25 to 4.0 (default: 1.0) */
speed?: number;
/** Voice control instructions (not supported with tts-1 or tts-1-hd) */
instructions?: string;
/** Stream format: 'sse' or 'audio' ('sse' not supported for tts-1/tts-1-hd) */
stream_format?: 'sse' | 'audio';
}Union type for available text-to-speech models:
type SpeechModel = 'tts-1' | 'tts-1-hd' | 'gpt-4o-mini-tts';tts-1 - Low latency, natural sounding (default for real-time applications)tts-1-hd - Higher quality audio with increased latencygpt-4o-mini-tts - Latest TTS model with advanced voice controlGenerate audio in MP3 format with the default voice:
import fs from 'fs';
const response = await client.audio.speech.create({
model: 'tts-1-hd',
voice: 'alloy',
input: 'The quick brown fox jumps over the lazy dog.',
});
const audioBuffer = await response.blob();
fs.writeFileSync('output.mp3', audioBuffer);Generate the same text with different voices to find the best fit:
const text = 'Welcome to our audio service.';
const voices = ['alloy', 'echo', 'sage', 'shimmer', 'nova'] as const;
for (const voice of voices) {
const response = await client.audio.speech.create({
model: 'tts-1-hd',
voice: voice,
input: text,
response_format: 'mp3',
});
const buffer = await response.blob();
fs.writeFileSync(`voice_${voice}.mp3`, buffer);
}Generate high-fidelity audio at a slower pace:
const response = await client.audio.speech.create({
model: 'tts-1-hd',
voice: 'sage',
input: 'This is a carefully paced announcement.',
response_format: 'flac', // Lossless format for best quality
speed: 0.8, // Slower than normal
});
const audioFile = await response.arrayBuffer();
fs.writeFileSync('announcement.flac', Buffer.from(audioFile));Generate audio in different formats for various use cases:
const formats = ['mp3', 'opus', 'aac', 'wav'] as const;
const input = 'Testing different audio formats.';
for (const format of formats) {
const response = await client.audio.speech.create({
model: 'tts-1',
voice: 'shimmer',
input: input,
response_format: format as any,
});
const buffer = await response.blob();
fs.writeFileSync(`output.${format}`, buffer);
}Use advanced voice control (requires gpt-4o-mini-tts):
const response = await client.audio.speech.create({
model: 'gpt-4o-mini-tts',
voice: 'sage',
input: 'This announcement should sound urgent and professional.',
instructions: 'Speak with urgency and authority, using a professional tone.',
speed: 1.1,
});
const buffer = await response.blob();Convert audio to text with support for speaker diarization, streaming, and detailed metadata.
Transcribes audio to text with options for verbose output, diarization, and real-time streaming.
/**
* Transcribes audio into the input language
* @param params - Transcription configuration
* @returns Transcribed text or detailed transcription object, optionally streamed
*/
transcriptions.create(
params: TranscriptionCreateParams
): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent> | string>;Parameters:
interface TranscriptionCreateParamsBase {
/** Audio file to transcribe (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
file: Uploadable;
/** Model: 'gpt-4o-transcribe', 'gpt-4o-mini-transcribe', 'whisper-1', or 'gpt-4o-transcribe-diarize' */
model: AudioModel;
/** Response format: 'json', 'verbose_json', 'diarized_json', 'text', 'srt', or 'vtt' */
response_format?: AudioResponseFormat;
/** Enable streaming (not supported for whisper-1) */
stream?: boolean;
/** Language code in ISO-639-1 format (e.g., 'en', 'fr', 'es') */
language?: string;
/** Text to guide style or continue previous segment */
prompt?: string;
/** Sampling temperature 0-1 (default: 0, uses log probability) */
temperature?: number;
/** Chunking strategy: 'auto' or manual VAD configuration */
chunking_strategy?: 'auto' | VadConfig | null;
/** Include additional information: 'logprobs' */
include?: Array<'logprobs'>;
/** Timestamp granularities: 'word', 'segment', or both */
timestamp_granularities?: Array<'word' | 'segment'>;
/** Speaker names for diarization (up to 4 speakers) */
known_speaker_names?: Array<string>;
/** Audio samples of known speakers (2-10 seconds each) */
known_speaker_references?: Array<string>;
}
interface TranscriptionCreateParamsNonStreaming extends TranscriptionCreateParamsBase {
stream?: false | null;
}
interface TranscriptionCreateParamsStreaming extends TranscriptionCreateParamsBase {
stream: true;
}VAD Configuration:
interface VadConfig {
type: 'server_vad';
prefix_padding_ms?: number; // Audio to include before VAD detection (default: 300ms)
silence_duration_ms?: number; // Duration of silence to detect stop (default: 1800ms)
threshold?: number; // Sensitivity 0.0-1.0 (default: 0.5)
}Basic transcription response with text content:
interface Transcription {
text: string;
logprobs?: Array<Transcription.Logprob>; // Only with logprobs include
usage?: Transcription.Tokens | Transcription.Duration;
}Detailed transcription with timestamps, segments, and word-level timing:
interface TranscriptionVerbose {
text: string;
duration: number; // Duration in seconds
language: string; // Detected language code
segments?: Array<TranscriptionSegment>; // Segment details with timestamps
words?: Array<TranscriptionWord>; // Word-level timing information
usage?: TranscriptionVerbose.Usage;
}Individual segment with detailed timing and confidence metrics:
interface TranscriptionSegment {
id: number;
start: number; // Start time in seconds
end: number; // End time in seconds
text: string;
temperature: number;
avg_logprob: number; // Average log probability
compression_ratio: number;
no_speech_prob: number; // Probability of silence
tokens: Array<number>;
seek: number;
}Word-level timing information for precise synchronization:
interface TranscriptionWord {
word: string;
start: number; // Start time in seconds
end: number; // End time in seconds
}Speaker-identified transcription with segment attribution:
interface TranscriptionDiarized {
text: string;
duration: number;
task: 'transcribe';
segments: Array<TranscriptionDiarizedSegment>; // Annotated with speaker labels
usage?: TranscriptionDiarized.Tokens | TranscriptionDiarized.Duration;
}Segment with speaker identification:
interface TranscriptionDiarizedSegment {
id: string;
text: string;
start: number; // Start time in seconds
end: number; // End time in seconds
speaker: string; // Speaker label ('A', 'B', etc., or known speaker name)
type: 'transcript.text.segment';
}Union type for streaming transcription events:
type TranscriptionStreamEvent =
| TranscriptionTextDeltaEvent
| TranscriptionTextDoneEvent
| TranscriptionTextSegmentEvent;Streaming event with incremental text:
interface TranscriptionTextDeltaEvent {
type: 'transcript.text.delta';
delta: string; // Incremental text
logprobs?: Array<TranscriptionTextDeltaEvent.Logprob>;
segment_id?: string; // For diarized segments
}Final completion event with full transcription:
interface TranscriptionTextDoneEvent {
type: 'transcript.text.done';
text: string; // Complete transcription
logprobs?: Array<TranscriptionTextDoneEvent.Logprob>;
usage?: TranscriptionTextDoneEvent.Usage;
}Diarized segment completion event:
interface TranscriptionTextSegmentEvent {
type: 'transcript.text.segment';
id: string;
text: string;
speaker: string; // Speaker label
start: number;
end: number;
}Transcribe an audio file to text:
import fs from 'fs';
const audioFile = fs.createReadStream('speech.mp3');
const transcription = await client.audio.transcriptions.create({
file: audioFile,
model: 'gpt-4o-transcribe',
});
console.log('Transcribed text:', transcription.text);Improve accuracy by specifying the language:
const frenchAudio = fs.createReadStream('french_speech.mp3');
const transcription = await client.audio.transcriptions.create({
file: frenchAudio,
model: 'gpt-4o-transcribe',
language: 'fr', // ISO-639-1 language code
prompt: 'This is a technical discussion about software development.', // Style guide
});
console.log('French transcription:', transcription.text);Get detailed segment and word-level timing information:
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream('podcast.mp3'),
model: 'gpt-4o-transcribe',
response_format: 'verbose_json',
timestamp_granularities: ['word', 'segment'],
});
// Access word-level timing
if (transcription.words) {
transcription.words.forEach(word => {
console.log(`${word.word}: ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s`);
});
}
// Access segments
if (transcription.segments) {
transcription.segments.forEach(segment => {
console.log(`[${segment.start.toFixed(2)}s - ${segment.end.toFixed(2)}s] ${segment.text}`);
console.log(` Confidence: ${(1 - segment.no_speech_prob).toFixed(3)}`);
});
}Identify and separate different speakers:
const audioFile = fs.createReadStream('multi_speaker_audio.mp3');
const diarization = await client.audio.transcriptions.create({
file: audioFile,
model: 'gpt-4o-transcribe-diarize',
response_format: 'diarized_json',
});
// View segments with speaker identification
diarization.segments.forEach(segment => {
console.log(`[${segment.start.toFixed(2)}s] Speaker ${segment.speaker}: ${segment.text}`);
});Provide reference audio for known speakers to improve identification:
const mainAudio = fs.createReadStream('meeting_recording.mp3');
const diarization = await client.audio.transcriptions.create({
file: mainAudio,
model: 'gpt-4o-transcribe-diarize',
response_format: 'diarized_json',
known_speaker_names: ['John', 'Sarah', 'Mike'],
known_speaker_references: [
'data:audio/mp3;base64,//NExAAR...', // John's voice sample
'data:audio/mp3;base64,//NExAAR...', // Sarah's voice sample
'data:audio/mp3;base64,//NExAAR...', // Mike's voice sample
],
});
// Speakers now labeled by name instead of letters
diarization.segments.forEach(segment => {
console.log(`${segment.speaker}: "${segment.text}"`);
});Real-time transcription as audio arrives:
const audioStream = fs.createReadStream('live_audio.mp3');
const stream = await client.audio.transcriptions.create({
file: audioStream,
model: 'gpt-4o-transcribe',
stream: true,
response_format: 'json',
});
let fullText = '';
for await (const event of stream) {
if (event.type === 'transcript.text.delta') {
// Incremental text arrives
process.stdout.write(event.delta);
fullText += event.delta;
} else if (event.type === 'transcript.text.done') {
// Final complete text
console.log('\nFinal transcription:', event.text);
}
}Real-time speaker-identified transcription:
const audioStream = fs.createReadStream('streaming_meeting.mp3');
const stream = await client.audio.transcriptions.create({
file: audioStream,
model: 'gpt-4o-transcribe-diarize',
stream: true,
response_format: 'diarized_json',
});
const speakers: { [key: string]: string } = {};
for await (const event of stream) {
if (event.type === 'transcript.text.segment') {
// Complete segment with speaker information
if (!speakers[event.speaker]) {
console.log(`\n[New Speaker: ${event.speaker}]`);
speakers[event.speaker] = event.speaker;
}
console.log(`${event.speaker} [${event.start.toFixed(2)}s-${event.end.toFixed(2)}s]: ${event.text}`);
}
}Analyze transcription confidence and quality:
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream('audio.mp3'),
model: 'gpt-4o-transcribe',
response_format: 'verbose_json',
include: ['logprobs'],
});
// Analyze segment quality
if (transcription.segments) {
transcription.segments.forEach(segment => {
const confidence = 1 - segment.no_speech_prob;
const quality = segment.compression_ratio < 2.4 ? 'good' : 'degraded';
console.log(`Segment "${segment.text}"`);
console.log(` Confidence: ${(confidence * 100).toFixed(1)}%`);
console.log(` Quality: ${quality}`);
console.log(` Compression ratio: ${segment.compression_ratio.toFixed(3)}`);
});
}Configure voice activity detection for better segment boundaries:
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream('long_audio.mp3'),
model: 'gpt-4o-transcribe-diarize',
response_format: 'diarized_json',
chunking_strategy: {
type: 'server_vad',
threshold: 0.6, // Higher sensitivity in noisy environments
silence_duration_ms: 1200, // Shorter pause = new segment
prefix_padding_ms: 500,
},
});
console.log('Segments:', transcription.segments.length);Translate audio from any language to English text.
Translates audio to English with optional detailed segment information.
/**
* Translates audio into English
* @param params - Translation configuration
* @returns Translated English text or detailed translation object
*/
translations.create(
params: TranslationCreateParams
): Promise<Translation | TranslationVerbose | string>;Parameters:
interface TranslationCreateParams {
/** Audio file to translate (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
file: Uploadable;
/** Model: 'whisper-1' (currently the only available translation model) */
model: AudioModel;
/** Response format: 'json', 'verbose_json', 'text', 'srt', or 'vtt' (default: 'json') */
response_format?: 'json' | 'text' | 'srt' | 'verbose_json' | 'vtt';
/** Text to guide style or continue previous segment (should be in English) */
prompt?: string;
/** Sampling temperature 0-1 (default: 0, uses log probability) */
temperature?: number;
}Basic translation response with English text:
interface Translation {
text: string;
}Detailed translation with segment information:
interface TranslationVerbose {
text: string; // Translated English text
duration: number; // Duration in seconds
language: string; // Always 'english' for output
segments?: Array<TranscriptionSegment>; // Segment details with timestamps
}Translate audio to English:
import fs from 'fs';
const spanishAudio = fs.createReadStream('spanish_interview.mp3');
const translation = await client.audio.translations.create({
file: spanishAudio,
model: 'whisper-1',
});
console.log('English translation:', translation.text);Guide the translation style with a prompt:
const frenchPodcast = fs.createReadStream('french_podcast.mp3');
const translation = await client.audio.translations.create({
file: frenchPodcast,
model: 'whisper-1',
prompt: 'This is a formal technical discussion. Use precise technical terminology.',
response_format: 'json',
});
console.log('Professional translation:', translation.text);Get segment-level information for synchronized translation display:
const italianAudio = fs.createReadStream('italian_video.mp3');
const translation = await client.audio.translations.create({
file: italianAudio,
model: 'whisper-1',
response_format: 'verbose_json',
});
console.log(`Full translation: ${translation.text}`);
console.log(`Duration: ${translation.duration} seconds`);
// Display segments with timing for subtitle generation
if (translation.segments) {
translation.segments.forEach(segment => {
const start = segment.start.toFixed(2);
const end = segment.end.toFixed(2);
console.log(`[${start}s - ${end}s] ${segment.text}`);
});
}Export translations in subtitle formats for video:
const germanAudio = fs.createReadStream('german_video.mp3');
// Get SRT format (SubRip)
const srtTranslation = await client.audio.translations.create({
file: germanAudio,
model: 'whisper-1',
response_format: 'srt',
});
fs.writeFileSync('english_subtitles.srt', srtTranslation);
// Get VTT format (WebVTT)
const vttTranslation = await client.audio.translations.create({
file: germanAudio,
model: 'whisper-1',
response_format: 'vtt',
});
fs.writeFileSync('english_subtitles.vtt', vttTranslation);Translate multiple audio files:
const files = ['french_audio.mp3', 'spanish_audio.mp3', 'german_audio.mp3'];
const translations = {};
for (const file of files) {
const audioStream = fs.createReadStream(file);
const result = await client.audio.translations.create({
file: audioStream,
model: 'whisper-1',
});
translations[file] = result.text;
}
// Save all translations
fs.writeFileSync('translations.json', JSON.stringify(translations, null, 2));Supported audio models for transcription and translation:
type AudioModel =
| 'whisper-1'
| 'gpt-4o-transcribe'
| 'gpt-4o-mini-transcribe'
| 'gpt-4o-transcribe-diarize';whisper-1 - Reliable transcription and translation model, optimized for various audio qualitiesgpt-4o-transcribe - Advanced transcription with improved accuracy and language detectiongpt-4o-mini-transcribe - Lightweight variant for efficient transcriptiongpt-4o-transcribe-diarize - Speaker identification and diarization capabilitiesOutput format options for transcriptions and translations:
type AudioResponseFormat =
| 'json'
| 'text'
| 'srt'
| 'verbose_json'
| 'vtt'
| 'diarized_json';json - Structured JSON response with text content (default)text - Plain text without additional metadatasrt - SubRip subtitle format (timing + text)verbose_json - Detailed JSON with segments, timing, and confidence scoresvtt - WebVTT subtitle format (timing + text)diarized_json - JSON with speaker identification and segment timingHandle common audio processing errors:
import { BadRequestError, APIError } from 'openai';
try {
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream('audio.mp3'),
model: 'gpt-4o-transcribe',
});
} catch (error) {
if (error instanceof BadRequestError) {
console.error('Invalid file format or parameters:', error.message);
} else if (error instanceof APIError) {
console.error('API error:', error.message);
}
}Work with different file input types:
import { toFile } from 'openai';
// From file system
const fromDisk = fs.createReadStream('audio.mp3');
// From Buffer
const buffer = await fs.promises.readFile('audio.mp3');
const fromBuffer = await toFile(buffer, 'audio.mp3', { type: 'audio/mpeg' });
// From URL (requires fetch)
const response = await fetch('https://example.com/audio.mp3');
const blob = await response.blob();
const fromUrl = await toFile(blob, 'audio.mp3', { type: 'audio/mpeg' });
// Use with any sub-resource
const transcription = await client.audio.transcriptions.create({
file: fromBuffer,
model: 'gpt-4o-transcribe',
});Control request behavior and timeouts:
const transcription = await client.audio.transcriptions.create(
{
file: fs.createReadStream('audio.mp3'),
model: 'gpt-4o-transcribe',
},
{
timeout: 30000, // 30 second timeout
maxRetries: 2,
}
);Chain multiple audio operations for complete audio processing:
// 1. Transcribe audio
const transcription = await client.audio.transcriptions.create({
file: fs.createReadStream('mixed_language.mp3'),
model: 'gpt-4o-transcribe',
response_format: 'verbose_json',
timestamp_granularities: ['word'],
});
// 2. Translate the transcribed content using chat completion
const translation = await client.chat.completions.create({
model: 'gpt-4o',
messages: [{
role: 'user',
content: `Translate to Spanish:\n\n${transcription.text}`,
}],
});
// 3. Generate speech from translated text
const speech = await client.audio.speech.create({
model: 'tts-1-hd',
voice: 'nova',
input: translation.choices[0].message.content || '',
});
const audioBuffer = await speech.blob();
fs.writeFileSync('translated_speech.mp3', audioBuffer);Install with Tessl CLI
npx tessl i tessl/npm-openai