0
# Audio
1
2
Core audio capabilities providing text-to-speech generation, speech-to-text transcription with advanced features including speaker diarization, and audio translation to English.
3
4
## Capabilities
5
6
The Audio resource is organized into three sub-resources, each serving distinct audio processing needs:
7
8
### Speech Generation
9
10
Generate natural-sounding audio from text input with configurable voices and audio formats.
11
12
```typescript { .api }
13
client.audio.speech.create(params: SpeechCreateParams): Promise<Response>;
14
```
15
16
[Speech](#speech-sub-resource)
17
18
### Transcription
19
20
Convert audio to text with support for multiple languages, speaker diarization, streaming, and detailed metadata including timestamps and confidence scores.
21
22
```typescript { .api }
23
client.audio.transcriptions.create(params: TranscriptionCreateParams): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent>>;
24
```
25
26
[Transcriptions](#transcriptions-sub-resource)
27
28
### Translation
29
30
Translate audio in any language to English text with optional detailed segment information.
31
32
```typescript { .api }
33
client.audio.translations.create(params: TranslationCreateParams): Promise<Translation | TranslationVerbose>;
34
```
35
36
[Translations](#translations-sub-resource)
37
38
---
39
40
## Speech Sub-Resource
41
42
Text-to-speech audio generation with multiple voice options and configurable audio formats.
43
44
### Methods
45
46
#### speech.create()
47
48
Generates audio from text input with configurable voice, format, speed, and model selection.
49
50
```typescript { .api }
51
/**
52
* Generates audio from the input text
53
* @param params - Configuration for audio generation
54
* @returns Response containing audio data as binary stream
55
*/
56
speech.create(params: SpeechCreateParams): Promise<Response>;
57
```
58
59
**Parameters:**
60
61
```typescript { .api }
62
interface SpeechCreateParams {
63
/** The text to generate audio for (max 4096 characters) */
64
input: string;
65
66
/** TTS model: 'tts-1', 'tts-1-hd', or 'gpt-4o-mini-tts' */
67
model: SpeechModel;
68
69
/** Voice to use: 'alloy', 'ash', 'ballad', 'cedar', 'coral', 'echo', 'fable', 'marin', 'nova', 'onyx', 'sage', 'shimmer', 'verse' */
70
voice: 'alloy' | 'ash' | 'ballad' | 'cedar' | 'coral' | 'echo' | 'fable' | 'marin' | 'nova' | 'onyx' | 'sage' | 'shimmer' | 'verse';
71
72
/** Audio format: 'mp3', 'opus', 'aac', 'flac', 'wav', 'pcm' (default: 'mp3') */
73
response_format?: 'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm';
74
75
/** Speed from 0.25 to 4.0 (default: 1.0) */
76
speed?: number;
77
78
/** Voice control instructions (not supported with tts-1 or tts-1-hd) */
79
instructions?: string;
80
81
/** Stream format: 'sse' or 'audio' ('sse' not supported for tts-1/tts-1-hd) */
82
stream_format?: 'sse' | 'audio';
83
}
84
```
85
86
### Types
87
88
#### SpeechModel { .api }
89
90
Union type for available text-to-speech models:
91
92
```typescript { .api }
93
type SpeechModel = 'tts-1' | 'tts-1-hd' | 'gpt-4o-mini-tts';
94
```
95
96
- `tts-1` - Low latency, natural sounding (default for real-time applications)
97
- `tts-1-hd` - Higher quality audio with increased latency
98
- `gpt-4o-mini-tts` - Latest TTS model with advanced voice control
99
100
### Examples
101
102
#### Basic Text-to-Speech
103
104
Generate audio in MP3 format with the default voice:
105
106
```typescript
107
import fs from 'fs';
108
109
const response = await client.audio.speech.create({
110
model: 'tts-1-hd',
111
voice: 'alloy',
112
input: 'The quick brown fox jumps over the lazy dog.',
113
});
114
115
const audioBuffer = await response.blob();
116
fs.writeFileSync('output.mp3', audioBuffer);
117
```
118
119
#### Different Voice Options
120
121
Generate the same text with different voices to find the best fit:
122
123
```typescript
124
const text = 'Welcome to our audio service.';
125
const voices = ['alloy', 'echo', 'sage', 'shimmer', 'nova'] as const;
126
127
for (const voice of voices) {
128
const response = await client.audio.speech.create({
129
model: 'tts-1-hd',
130
voice: voice,
131
input: text,
132
response_format: 'mp3',
133
});
134
135
const buffer = await response.blob();
136
fs.writeFileSync(`voice_${voice}.mp3`, buffer);
137
}
138
```
139
140
#### High-Quality Audio with Custom Speed
141
142
Generate high-fidelity audio at a slower pace:
143
144
```typescript
145
const response = await client.audio.speech.create({
146
model: 'tts-1-hd',
147
voice: 'sage',
148
input: 'This is a carefully paced announcement.',
149
response_format: 'flac', // Lossless format for best quality
150
speed: 0.8, // Slower than normal
151
});
152
153
const audioFile = await response.arrayBuffer();
154
fs.writeFileSync('announcement.flac', Buffer.from(audioFile));
155
```
156
157
#### Multiple Audio Formats
158
159
Generate audio in different formats for various use cases:
160
161
```typescript
162
const formats = ['mp3', 'opus', 'aac', 'wav'] as const;
163
const input = 'Testing different audio formats.';
164
165
for (const format of formats) {
166
const response = await client.audio.speech.create({
167
model: 'tts-1',
168
voice: 'shimmer',
169
input: input,
170
response_format: format as any,
171
});
172
173
const buffer = await response.blob();
174
fs.writeFileSync(`output.${format}`, buffer);
175
}
176
```
177
178
#### Voice Control with Instructions
179
180
Use advanced voice control (requires gpt-4o-mini-tts):
181
182
```typescript
183
const response = await client.audio.speech.create({
184
model: 'gpt-4o-mini-tts',
185
voice: 'sage',
186
input: 'This announcement should sound urgent and professional.',
187
instructions: 'Speak with urgency and authority, using a professional tone.',
188
speed: 1.1,
189
});
190
191
const buffer = await response.blob();
192
```
193
194
---
195
196
## Transcriptions Sub-Resource
197
198
Convert audio to text with support for speaker diarization, streaming, and detailed metadata.
199
200
### Methods
201
202
#### transcriptions.create()
203
204
Transcribes audio to text with options for verbose output, diarization, and real-time streaming.
205
206
```typescript { .api }
207
/**
208
* Transcribes audio into the input language
209
* @param params - Transcription configuration
210
* @returns Transcribed text or detailed transcription object, optionally streamed
211
*/
212
transcriptions.create(
213
params: TranscriptionCreateParams
214
): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent> | string>;
215
```
216
217
**Parameters:**
218
219
```typescript { .api }
220
interface TranscriptionCreateParamsBase {
221
/** Audio file to transcribe (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
222
file: Uploadable;
223
224
/** Model: 'gpt-4o-transcribe', 'gpt-4o-mini-transcribe', 'whisper-1', or 'gpt-4o-transcribe-diarize' */
225
model: AudioModel;
226
227
/** Response format: 'json', 'verbose_json', 'diarized_json', 'text', 'srt', or 'vtt' */
228
response_format?: AudioResponseFormat;
229
230
/** Enable streaming (not supported for whisper-1) */
231
stream?: boolean;
232
233
/** Language code in ISO-639-1 format (e.g., 'en', 'fr', 'es') */
234
language?: string;
235
236
/** Text to guide style or continue previous segment */
237
prompt?: string;
238
239
/** Sampling temperature 0-1 (default: 0, uses log probability) */
240
temperature?: number;
241
242
/** Chunking strategy: 'auto' or manual VAD configuration */
243
chunking_strategy?: 'auto' | VadConfig | null;
244
245
/** Include additional information: 'logprobs' */
246
include?: Array<'logprobs'>;
247
248
/** Timestamp granularities: 'word', 'segment', or both */
249
timestamp_granularities?: Array<'word' | 'segment'>;
250
251
/** Speaker names for diarization (up to 4 speakers) */
252
known_speaker_names?: Array<string>;
253
254
/** Audio samples of known speakers (2-10 seconds each) */
255
known_speaker_references?: Array<string>;
256
}
257
258
interface TranscriptionCreateParamsNonStreaming extends TranscriptionCreateParamsBase {
259
stream?: false | null;
260
}
261
262
interface TranscriptionCreateParamsStreaming extends TranscriptionCreateParamsBase {
263
stream: true;
264
}
265
```
266
267
**VAD Configuration:**
268
269
```typescript { .api }
270
interface VadConfig {
271
type: 'server_vad';
272
prefix_padding_ms?: number; // Audio to include before VAD detection (default: 300ms)
273
silence_duration_ms?: number; // Duration of silence to detect stop (default: 1800ms)
274
threshold?: number; // Sensitivity 0.0-1.0 (default: 0.5)
275
}
276
```
277
278
### Types
279
280
#### Transcription { .api }
281
282
Basic transcription response with text content:
283
284
```typescript { .api }
285
interface Transcription {
286
text: string;
287
logprobs?: Array<Transcription.Logprob>; // Only with logprobs include
288
usage?: Transcription.Tokens | Transcription.Duration;
289
}
290
```
291
292
#### TranscriptionVerbose { .api }
293
294
Detailed transcription with timestamps, segments, and word-level timing:
295
296
```typescript { .api }
297
interface TranscriptionVerbose {
298
text: string;
299
duration: number; // Duration in seconds
300
language: string; // Detected language code
301
segments?: Array<TranscriptionSegment>; // Segment details with timestamps
302
words?: Array<TranscriptionWord>; // Word-level timing information
303
usage?: TranscriptionVerbose.Usage;
304
}
305
```
306
307
#### TranscriptionSegment { .api }
308
309
Individual segment with detailed timing and confidence metrics:
310
311
```typescript { .api }
312
interface TranscriptionSegment {
313
id: number;
314
start: number; // Start time in seconds
315
end: number; // End time in seconds
316
text: string;
317
temperature: number;
318
avg_logprob: number; // Average log probability
319
compression_ratio: number;
320
no_speech_prob: number; // Probability of silence
321
tokens: Array<number>;
322
seek: number;
323
}
324
```
325
326
#### TranscriptionWord { .api }
327
328
Word-level timing information for precise synchronization:
329
330
```typescript { .api }
331
interface TranscriptionWord {
332
word: string;
333
start: number; // Start time in seconds
334
end: number; // End time in seconds
335
}
336
```
337
338
#### TranscriptionDiarized { .api }
339
340
Speaker-identified transcription with segment attribution:
341
342
```typescript { .api }
343
interface TranscriptionDiarized {
344
text: string;
345
duration: number;
346
task: 'transcribe';
347
segments: Array<TranscriptionDiarizedSegment>; // Annotated with speaker labels
348
usage?: TranscriptionDiarized.Tokens | TranscriptionDiarized.Duration;
349
}
350
```
351
352
#### TranscriptionDiarizedSegment { .api }
353
354
Segment with speaker identification:
355
356
```typescript { .api }
357
interface TranscriptionDiarizedSegment {
358
id: string;
359
text: string;
360
start: number; // Start time in seconds
361
end: number; // End time in seconds
362
speaker: string; // Speaker label ('A', 'B', etc., or known speaker name)
363
type: 'transcript.text.segment';
364
}
365
```
366
367
#### TranscriptionStreamEvent { .api }
368
369
Union type for streaming transcription events:
370
371
```typescript { .api }
372
type TranscriptionStreamEvent =
373
| TranscriptionTextDeltaEvent
374
| TranscriptionTextDoneEvent
375
| TranscriptionTextSegmentEvent;
376
```
377
378
#### TranscriptionTextDeltaEvent { .api }
379
380
Streaming event with incremental text:
381
382
```typescript { .api }
383
interface TranscriptionTextDeltaEvent {
384
type: 'transcript.text.delta';
385
delta: string; // Incremental text
386
logprobs?: Array<TranscriptionTextDeltaEvent.Logprob>;
387
segment_id?: string; // For diarized segments
388
}
389
```
390
391
#### TranscriptionTextDoneEvent { .api }
392
393
Final completion event with full transcription:
394
395
```typescript { .api }
396
interface TranscriptionTextDoneEvent {
397
type: 'transcript.text.done';
398
text: string; // Complete transcription
399
logprobs?: Array<TranscriptionTextDoneEvent.Logprob>;
400
usage?: TranscriptionTextDoneEvent.Usage;
401
}
402
```
403
404
#### TranscriptionTextSegmentEvent { .api }
405
406
Diarized segment completion event:
407
408
```typescript { .api }
409
interface TranscriptionTextSegmentEvent {
410
type: 'transcript.text.segment';
411
id: string;
412
text: string;
413
speaker: string; // Speaker label
414
start: number;
415
end: number;
416
}
417
```
418
419
### Examples
420
421
#### Basic Transcription
422
423
Transcribe an audio file to text:
424
425
```typescript
426
import fs from 'fs';
427
428
const audioFile = fs.createReadStream('speech.mp3');
429
430
const transcription = await client.audio.transcriptions.create({
431
file: audioFile,
432
model: 'gpt-4o-transcribe',
433
});
434
435
console.log('Transcribed text:', transcription.text);
436
```
437
438
#### Transcription with Language Specification
439
440
Improve accuracy by specifying the language:
441
442
```typescript
443
const frenchAudio = fs.createReadStream('french_speech.mp3');
444
445
const transcription = await client.audio.transcriptions.create({
446
file: frenchAudio,
447
model: 'gpt-4o-transcribe',
448
language: 'fr', // ISO-639-1 language code
449
prompt: 'This is a technical discussion about software development.', // Style guide
450
});
451
452
console.log('French transcription:', transcription.text);
453
```
454
455
#### Verbose Output with Timestamps
456
457
Get detailed segment and word-level timing information:
458
459
```typescript
460
const transcription = await client.audio.transcriptions.create({
461
file: fs.createReadStream('podcast.mp3'),
462
model: 'gpt-4o-transcribe',
463
response_format: 'verbose_json',
464
timestamp_granularities: ['word', 'segment'],
465
});
466
467
// Access word-level timing
468
if (transcription.words) {
469
transcription.words.forEach(word => {
470
console.log(`${word.word}: ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s`);
471
});
472
}
473
474
// Access segments
475
if (transcription.segments) {
476
transcription.segments.forEach(segment => {
477
console.log(`[${segment.start.toFixed(2)}s - ${segment.end.toFixed(2)}s] ${segment.text}`);
478
console.log(` Confidence: ${(1 - segment.no_speech_prob).toFixed(3)}`);
479
});
480
}
481
```
482
483
#### Speaker Diarization
484
485
Identify and separate different speakers:
486
487
```typescript
488
const audioFile = fs.createReadStream('multi_speaker_audio.mp3');
489
490
const diarization = await client.audio.transcriptions.create({
491
file: audioFile,
492
model: 'gpt-4o-transcribe-diarize',
493
response_format: 'diarized_json',
494
});
495
496
// View segments with speaker identification
497
diarization.segments.forEach(segment => {
498
console.log(`[${segment.start.toFixed(2)}s] Speaker ${segment.speaker}: ${segment.text}`);
499
});
500
```
501
502
#### Diarization with Known Speakers
503
504
Provide reference audio for known speakers to improve identification:
505
506
```typescript
507
const mainAudio = fs.createReadStream('meeting_recording.mp3');
508
509
const diarization = await client.audio.transcriptions.create({
510
file: mainAudio,
511
model: 'gpt-4o-transcribe-diarize',
512
response_format: 'diarized_json',
513
known_speaker_names: ['John', 'Sarah', 'Mike'],
514
known_speaker_references: [
515
'data:audio/mp3;base64,//NExAAR...', // John's voice sample
516
'data:audio/mp3;base64,//NExAAR...', // Sarah's voice sample
517
'data:audio/mp3;base64,//NExAAR...', // Mike's voice sample
518
],
519
});
520
521
// Speakers now labeled by name instead of letters
522
diarization.segments.forEach(segment => {
523
console.log(`${segment.speaker}: "${segment.text}"`);
524
});
525
```
526
527
#### Streaming Transcription
528
529
Real-time transcription as audio arrives:
530
531
```typescript
532
const audioStream = fs.createReadStream('live_audio.mp3');
533
534
const stream = await client.audio.transcriptions.create({
535
file: audioStream,
536
model: 'gpt-4o-transcribe',
537
stream: true,
538
response_format: 'json',
539
});
540
541
let fullText = '';
542
543
for await (const event of stream) {
544
if (event.type === 'transcript.text.delta') {
545
// Incremental text arrives
546
process.stdout.write(event.delta);
547
fullText += event.delta;
548
} else if (event.type === 'transcript.text.done') {
549
// Final complete text
550
console.log('\nFinal transcription:', event.text);
551
}
552
}
553
```
554
555
#### Streaming with Diarization
556
557
Real-time speaker-identified transcription:
558
559
```typescript
560
const audioStream = fs.createReadStream('streaming_meeting.mp3');
561
562
const stream = await client.audio.transcriptions.create({
563
file: audioStream,
564
model: 'gpt-4o-transcribe-diarize',
565
stream: true,
566
response_format: 'diarized_json',
567
});
568
569
const speakers: { [key: string]: string } = {};
570
571
for await (const event of stream) {
572
if (event.type === 'transcript.text.segment') {
573
// Complete segment with speaker information
574
if (!speakers[event.speaker]) {
575
console.log(`\n[New Speaker: ${event.speaker}]`);
576
speakers[event.speaker] = event.speaker;
577
}
578
console.log(`${event.speaker} [${event.start.toFixed(2)}s-${event.end.toFixed(2)}s]: ${event.text}`);
579
}
580
}
581
```
582
583
#### Confidence Scores and Quality Metrics
584
585
Analyze transcription confidence and quality:
586
587
```typescript
588
const transcription = await client.audio.transcriptions.create({
589
file: fs.createReadStream('audio.mp3'),
590
model: 'gpt-4o-transcribe',
591
response_format: 'verbose_json',
592
include: ['logprobs'],
593
});
594
595
// Analyze segment quality
596
if (transcription.segments) {
597
transcription.segments.forEach(segment => {
598
const confidence = 1 - segment.no_speech_prob;
599
const quality = segment.compression_ratio < 2.4 ? 'good' : 'degraded';
600
601
console.log(`Segment "${segment.text}"`);
602
console.log(` Confidence: ${(confidence * 100).toFixed(1)}%`);
603
console.log(` Quality: ${quality}`);
604
console.log(` Compression ratio: ${segment.compression_ratio.toFixed(3)}`);
605
});
606
}
607
```
608
609
#### Custom Chunking Strategy
610
611
Configure voice activity detection for better segment boundaries:
612
613
```typescript
614
const transcription = await client.audio.transcriptions.create({
615
file: fs.createReadStream('long_audio.mp3'),
616
model: 'gpt-4o-transcribe-diarize',
617
response_format: 'diarized_json',
618
chunking_strategy: {
619
type: 'server_vad',
620
threshold: 0.6, // Higher sensitivity in noisy environments
621
silence_duration_ms: 1200, // Shorter pause = new segment
622
prefix_padding_ms: 500,
623
},
624
});
625
626
console.log('Segments:', transcription.segments.length);
627
```
628
629
---
630
631
## Translations Sub-Resource
632
633
Translate audio from any language to English text.
634
635
### Methods
636
637
#### translations.create()
638
639
Translates audio to English with optional detailed segment information.
640
641
```typescript { .api }
642
/**
643
* Translates audio into English
644
* @param params - Translation configuration
645
* @returns Translated English text or detailed translation object
646
*/
647
translations.create(
648
params: TranslationCreateParams
649
): Promise<Translation | TranslationVerbose | string>;
650
```
651
652
**Parameters:**
653
654
```typescript { .api }
655
interface TranslationCreateParams {
656
/** Audio file to translate (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
657
file: Uploadable;
658
659
/** Model: 'whisper-1' (currently the only available translation model) */
660
model: AudioModel;
661
662
/** Response format: 'json', 'verbose_json', 'text', 'srt', or 'vtt' (default: 'json') */
663
response_format?: 'json' | 'text' | 'srt' | 'verbose_json' | 'vtt';
664
665
/** Text to guide style or continue previous segment (should be in English) */
666
prompt?: string;
667
668
/** Sampling temperature 0-1 (default: 0, uses log probability) */
669
temperature?: number;
670
}
671
```
672
673
### Types
674
675
#### Translation { .api }
676
677
Basic translation response with English text:
678
679
```typescript { .api }
680
interface Translation {
681
text: string;
682
}
683
```
684
685
#### TranslationVerbose { .api }
686
687
Detailed translation with segment information:
688
689
```typescript { .api }
690
interface TranslationVerbose {
691
text: string; // Translated English text
692
duration: number; // Duration in seconds
693
language: string; // Always 'english' for output
694
segments?: Array<TranscriptionSegment>; // Segment details with timestamps
695
}
696
```
697
698
### Examples
699
700
#### Basic Translation
701
702
Translate audio to English:
703
704
```typescript
705
import fs from 'fs';
706
707
const spanishAudio = fs.createReadStream('spanish_interview.mp3');
708
709
const translation = await client.audio.translations.create({
710
file: spanishAudio,
711
model: 'whisper-1',
712
});
713
714
console.log('English translation:', translation.text);
715
```
716
717
#### Translation with Style Guidance
718
719
Guide the translation style with a prompt:
720
721
```typescript
722
const frenchPodcast = fs.createReadStream('french_podcast.mp3');
723
724
const translation = await client.audio.translations.create({
725
file: frenchPodcast,
726
model: 'whisper-1',
727
prompt: 'This is a formal technical discussion. Use precise technical terminology.',
728
response_format: 'json',
729
});
730
731
console.log('Professional translation:', translation.text);
732
```
733
734
#### Verbose Translation with Segments
735
736
Get segment-level information for synchronized translation display:
737
738
```typescript
739
const italianAudio = fs.createReadStream('italian_video.mp3');
740
741
const translation = await client.audio.translations.create({
742
file: italianAudio,
743
model: 'whisper-1',
744
response_format: 'verbose_json',
745
});
746
747
console.log(`Full translation: ${translation.text}`);
748
console.log(`Duration: ${translation.duration} seconds`);
749
750
// Display segments with timing for subtitle generation
751
if (translation.segments) {
752
translation.segments.forEach(segment => {
753
const start = segment.start.toFixed(2);
754
const end = segment.end.toFixed(2);
755
console.log(`[${start}s - ${end}s] ${segment.text}`);
756
});
757
}
758
```
759
760
#### Translation to Subtitle Formats
761
762
Export translations in subtitle formats for video:
763
764
```typescript
765
const germanAudio = fs.createReadStream('german_video.mp3');
766
767
// Get SRT format (SubRip)
768
const srtTranslation = await client.audio.translations.create({
769
file: germanAudio,
770
model: 'whisper-1',
771
response_format: 'srt',
772
});
773
774
fs.writeFileSync('english_subtitles.srt', srtTranslation);
775
776
// Get VTT format (WebVTT)
777
const vttTranslation = await client.audio.translations.create({
778
file: germanAudio,
779
model: 'whisper-1',
780
response_format: 'vtt',
781
});
782
783
fs.writeFileSync('english_subtitles.vtt', vttTranslation);
784
```
785
786
#### Batch Audio Translation
787
788
Translate multiple audio files:
789
790
```typescript
791
const files = ['french_audio.mp3', 'spanish_audio.mp3', 'german_audio.mp3'];
792
const translations = {};
793
794
for (const file of files) {
795
const audioStream = fs.createReadStream(file);
796
797
const result = await client.audio.translations.create({
798
file: audioStream,
799
model: 'whisper-1',
800
});
801
802
translations[file] = result.text;
803
}
804
805
// Save all translations
806
fs.writeFileSync('translations.json', JSON.stringify(translations, null, 2));
807
```
808
809
---
810
811
## AudioModel { .api }
812
813
Supported audio models for transcription and translation:
814
815
```typescript { .api }
816
type AudioModel =
817
| 'whisper-1'
818
| 'gpt-4o-transcribe'
819
| 'gpt-4o-mini-transcribe'
820
| 'gpt-4o-transcribe-diarize';
821
```
822
823
- `whisper-1` - Reliable transcription and translation model, optimized for various audio qualities
824
- `gpt-4o-transcribe` - Advanced transcription with improved accuracy and language detection
825
- `gpt-4o-mini-transcribe` - Lightweight variant for efficient transcription
826
- `gpt-4o-transcribe-diarize` - Speaker identification and diarization capabilities
827
828
## AudioResponseFormat { .api }
829
830
Output format options for transcriptions and translations:
831
832
```typescript { .api }
833
type AudioResponseFormat =
834
| 'json'
835
| 'text'
836
| 'srt'
837
| 'verbose_json'
838
| 'vtt'
839
| 'diarized_json';
840
```
841
842
- `json` - Structured JSON response with text content (default)
843
- `text` - Plain text without additional metadata
844
- `srt` - SubRip subtitle format (timing + text)
845
- `verbose_json` - Detailed JSON with segments, timing, and confidence scores
846
- `vtt` - WebVTT subtitle format (timing + text)
847
- `diarized_json` - JSON with speaker identification and segment timing
848
849
---
850
851
## Common Patterns
852
853
### Error Handling
854
855
Handle common audio processing errors:
856
857
```typescript
858
import { BadRequestError, APIError } from 'openai';
859
860
try {
861
const transcription = await client.audio.transcriptions.create({
862
file: fs.createReadStream('audio.mp3'),
863
model: 'gpt-4o-transcribe',
864
});
865
} catch (error) {
866
if (error instanceof BadRequestError) {
867
console.error('Invalid file format or parameters:', error.message);
868
} else if (error instanceof APIError) {
869
console.error('API error:', error.message);
870
}
871
}
872
```
873
874
### File Handling
875
876
Work with different file input types:
877
878
```typescript
879
import { toFile } from 'openai';
880
881
// From file system
882
const fromDisk = fs.createReadStream('audio.mp3');
883
884
// From Buffer
885
const buffer = await fs.promises.readFile('audio.mp3');
886
const fromBuffer = await toFile(buffer, 'audio.mp3', { type: 'audio/mpeg' });
887
888
// From URL (requires fetch)
889
const response = await fetch('https://example.com/audio.mp3');
890
const blob = await response.blob();
891
const fromUrl = await toFile(blob, 'audio.mp3', { type: 'audio/mpeg' });
892
893
// Use with any sub-resource
894
const transcription = await client.audio.transcriptions.create({
895
file: fromBuffer,
896
model: 'gpt-4o-transcribe',
897
});
898
```
899
900
### Request Options
901
902
Control request behavior and timeouts:
903
904
```typescript
905
const transcription = await client.audio.transcriptions.create(
906
{
907
file: fs.createReadStream('audio.mp3'),
908
model: 'gpt-4o-transcribe',
909
},
910
{
911
timeout: 30000, // 30 second timeout
912
maxRetries: 2,
913
}
914
);
915
```
916
917
### Combining Audio Operations
918
919
Chain multiple audio operations for complete audio processing:
920
921
```typescript
922
// 1. Transcribe audio
923
const transcription = await client.audio.transcriptions.create({
924
file: fs.createReadStream('mixed_language.mp3'),
925
model: 'gpt-4o-transcribe',
926
response_format: 'verbose_json',
927
timestamp_granularities: ['word'],
928
});
929
930
// 2. Translate the transcribed content using chat completion
931
const translation = await client.chat.completions.create({
932
model: 'gpt-4o',
933
messages: [{
934
role: 'user',
935
content: `Translate to Spanish:\n\n${transcription.text}`,
936
}],
937
});
938
939
// 3. Generate speech from translated text
940
const speech = await client.audio.speech.create({
941
model: 'tts-1-hd',
942
voice: 'nova',
943
input: translation.choices[0].message.content || '',
944
});
945
946
const audioBuffer = await speech.blob();
947
fs.writeFileSync('translated_speech.mp3', audioBuffer);
948
```
949