Tessl Tile for npm/openai@6.9.1

or run

npx @tessl/cli init

audio.mddocs/

0
# Audio
1

2
Core audio capabilities providing text-to-speech generation, speech-to-text transcription with advanced features including speaker diarization, and audio translation to English.
3

4
## Capabilities
5

6
The Audio resource is organized into three sub-resources, each serving distinct audio processing needs:
7

8
### Speech Generation
9

10
Generate natural-sounding audio from text input with configurable voices and audio formats.
11

12
```typescript { .api }
13
client.audio.speech.create(params: SpeechCreateParams): Promise<Response>;
14
```
15

16
[Speech](#speech-sub-resource)
17

18
### Transcription
19

20
Convert audio to text with support for multiple languages, speaker diarization, streaming, and detailed metadata including timestamps and confidence scores.
21

22
```typescript { .api }
23
client.audio.transcriptions.create(params: TranscriptionCreateParams): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent>>;
24
```
25

26
[Transcriptions](#transcriptions-sub-resource)
27

28
### Translation
29

30
Translate audio in any language to English text with optional detailed segment information.
31

32
```typescript { .api }
33
client.audio.translations.create(params: TranslationCreateParams): Promise<Translation | TranslationVerbose>;
34
```
35

36
[Translations](#translations-sub-resource)
37

38
---
39

40
## Speech Sub-Resource
41

42
Text-to-speech audio generation with multiple voice options and configurable audio formats.
43

44
### Methods
45

46
#### speech.create()
47

48
Generates audio from text input with configurable voice, format, speed, and model selection.
49

50
```typescript { .api }
51
/**
52
 * Generates audio from the input text
53
 * @param params - Configuration for audio generation
54
 * @returns Response containing audio data as binary stream
55
 */
56
speech.create(params: SpeechCreateParams): Promise<Response>;
57
```
58

59
**Parameters:**
60

61
```typescript { .api }
62
interface SpeechCreateParams {
63
  /** The text to generate audio for (max 4096 characters) */
64
  input: string;
65

66
  /** TTS model: 'tts-1', 'tts-1-hd', or 'gpt-4o-mini-tts' */
67
  model: SpeechModel;
68

69
  /** Voice to use: 'alloy', 'ash', 'ballad', 'cedar', 'coral', 'echo', 'fable', 'marin', 'nova', 'onyx', 'sage', 'shimmer', 'verse' */
70
  voice: 'alloy' | 'ash' | 'ballad' | 'cedar' | 'coral' | 'echo' | 'fable' | 'marin' | 'nova' | 'onyx' | 'sage' | 'shimmer' | 'verse';
71

72
  /** Audio format: 'mp3', 'opus', 'aac', 'flac', 'wav', 'pcm' (default: 'mp3') */
73
  response_format?: 'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm';
74

75
  /** Speed from 0.25 to 4.0 (default: 1.0) */
76
  speed?: number;
77

78
  /** Voice control instructions (not supported with tts-1 or tts-1-hd) */
79
  instructions?: string;
80

81
  /** Stream format: 'sse' or 'audio' ('sse' not supported for tts-1/tts-1-hd) */
82
  stream_format?: 'sse' | 'audio';
83
}
84
```
85

86
### Types
87

88
#### SpeechModel { .api }
89

90
Union type for available text-to-speech models:
91

92
```typescript { .api }
93
type SpeechModel = 'tts-1' | 'tts-1-hd' | 'gpt-4o-mini-tts';
94
```
95

96
- `tts-1` - Low latency, natural sounding (default for real-time applications)
97
- `tts-1-hd` - Higher quality audio with increased latency
98
- `gpt-4o-mini-tts` - Latest TTS model with advanced voice control
99

100
### Examples
101

102
#### Basic Text-to-Speech
103

104
Generate audio in MP3 format with the default voice:
105

106
```typescript
107
import fs from 'fs';
108

109
const response = await client.audio.speech.create({
110
  model: 'tts-1-hd',
111
  voice: 'alloy',
112
  input: 'The quick brown fox jumps over the lazy dog.',
113
});
114

115
const audioBuffer = await response.blob();
116
fs.writeFileSync('output.mp3', audioBuffer);
117
```
118

119
#### Different Voice Options
120

121
Generate the same text with different voices to find the best fit:
122

123
```typescript
124
const text = 'Welcome to our audio service.';
125
const voices = ['alloy', 'echo', 'sage', 'shimmer', 'nova'] as const;
126

127
for (const voice of voices) {
128
  const response = await client.audio.speech.create({
129
    model: 'tts-1-hd',
130
    voice: voice,
131
    input: text,
132
    response_format: 'mp3',
133
  });
134

135
  const buffer = await response.blob();
136
  fs.writeFileSync(`voice_${voice}.mp3`, buffer);
137
}
138
```
139

140
#### High-Quality Audio with Custom Speed
141

142
Generate high-fidelity audio at a slower pace:
143

144
```typescript
145
const response = await client.audio.speech.create({
146
  model: 'tts-1-hd',
147
  voice: 'sage',
148
  input: 'This is a carefully paced announcement.',
149
  response_format: 'flac', // Lossless format for best quality
150
  speed: 0.8, // Slower than normal
151
});
152

153
const audioFile = await response.arrayBuffer();
154
fs.writeFileSync('announcement.flac', Buffer.from(audioFile));
155
```
156

157
#### Multiple Audio Formats
158

159
Generate audio in different formats for various use cases:
160

161
```typescript
162
const formats = ['mp3', 'opus', 'aac', 'wav'] as const;
163
const input = 'Testing different audio formats.';
164

165
for (const format of formats) {
166
  const response = await client.audio.speech.create({
167
    model: 'tts-1',
168
    voice: 'shimmer',
169
    input: input,
170
    response_format: format as any,
171
  });
172

173
  const buffer = await response.blob();
174
  fs.writeFileSync(`output.${format}`, buffer);
175
}
176
```
177

178
#### Voice Control with Instructions
179

180
Use advanced voice control (requires gpt-4o-mini-tts):
181

182
```typescript
183
const response = await client.audio.speech.create({
184
  model: 'gpt-4o-mini-tts',
185
  voice: 'sage',
186
  input: 'This announcement should sound urgent and professional.',
187
  instructions: 'Speak with urgency and authority, using a professional tone.',
188
  speed: 1.1,
189
});
190

191
const buffer = await response.blob();
192
```
193

194
---
195

196
## Transcriptions Sub-Resource
197

198
Convert audio to text with support for speaker diarization, streaming, and detailed metadata.
199

200
### Methods
201

202
#### transcriptions.create()
203

204
Transcribes audio to text with options for verbose output, diarization, and real-time streaming.
205

206
```typescript { .api }
207
/**
208
 * Transcribes audio into the input language
209
 * @param params - Transcription configuration
210
 * @returns Transcribed text or detailed transcription object, optionally streamed
211
 */
212
transcriptions.create(
213
  params: TranscriptionCreateParams
214
): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent> | string>;
215
```
216

217
**Parameters:**
218

219
```typescript { .api }
220
interface TranscriptionCreateParamsBase {
221
  /** Audio file to transcribe (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
222
  file: Uploadable;
223

224
  /** Model: 'gpt-4o-transcribe', 'gpt-4o-mini-transcribe', 'whisper-1', or 'gpt-4o-transcribe-diarize' */
225
  model: AudioModel;
226

227
  /** Response format: 'json', 'verbose_json', 'diarized_json', 'text', 'srt', or 'vtt' */
228
  response_format?: AudioResponseFormat;
229

230
  /** Enable streaming (not supported for whisper-1) */
231
  stream?: boolean;
232

233
  /** Language code in ISO-639-1 format (e.g., 'en', 'fr', 'es') */
234
  language?: string;
235

236
  /** Text to guide style or continue previous segment */
237
  prompt?: string;
238

239
  /** Sampling temperature 0-1 (default: 0, uses log probability) */
240
  temperature?: number;
241

242
  /** Chunking strategy: 'auto' or manual VAD configuration */
243
  chunking_strategy?: 'auto' | VadConfig | null;
244

245
  /** Include additional information: 'logprobs' */
246
  include?: Array<'logprobs'>;
247

248
  /** Timestamp granularities: 'word', 'segment', or both */
249
  timestamp_granularities?: Array<'word' | 'segment'>;
250

251
  /** Speaker names for diarization (up to 4 speakers) */
252
  known_speaker_names?: Array<string>;
253

254
  /** Audio samples of known speakers (2-10 seconds each) */
255
  known_speaker_references?: Array<string>;
256
}
257

258
interface TranscriptionCreateParamsNonStreaming extends TranscriptionCreateParamsBase {
259
  stream?: false | null;
260
}
261

262
interface TranscriptionCreateParamsStreaming extends TranscriptionCreateParamsBase {
263
  stream: true;
264
}
265
```
266

267
**VAD Configuration:**
268

269
```typescript { .api }
270
interface VadConfig {
271
  type: 'server_vad';
272
  prefix_padding_ms?: number; // Audio to include before VAD detection (default: 300ms)
273
  silence_duration_ms?: number; // Duration of silence to detect stop (default: 1800ms)
274
  threshold?: number; // Sensitivity 0.0-1.0 (default: 0.5)
275
}
276
```
277

278
### Types
279

280
#### Transcription { .api }
281

282
Basic transcription response with text content:
283

284
```typescript { .api }
285
interface Transcription {
286
  text: string;
287
  logprobs?: Array<Transcription.Logprob>; // Only with logprobs include
288
  usage?: Transcription.Tokens | Transcription.Duration;
289
}
290
```
291

292
#### TranscriptionVerbose { .api }
293

294
Detailed transcription with timestamps, segments, and word-level timing:
295

296
```typescript { .api }
297
interface TranscriptionVerbose {
298
  text: string;
299
  duration: number; // Duration in seconds
300
  language: string; // Detected language code
301
  segments?: Array<TranscriptionSegment>; // Segment details with timestamps
302
  words?: Array<TranscriptionWord>; // Word-level timing information
303
  usage?: TranscriptionVerbose.Usage;
304
}
305
```
306

307
#### TranscriptionSegment { .api }
308

309
Individual segment with detailed timing and confidence metrics:
310

311
```typescript { .api }
312
interface TranscriptionSegment {
313
  id: number;
314
  start: number; // Start time in seconds
315
  end: number; // End time in seconds
316
  text: string;
317
  temperature: number;
318
  avg_logprob: number; // Average log probability
319
  compression_ratio: number;
320
  no_speech_prob: number; // Probability of silence
321
  tokens: Array<number>;
322
  seek: number;
323
}
324
```
325

326
#### TranscriptionWord { .api }
327

328
Word-level timing information for precise synchronization:
329

330
```typescript { .api }
331
interface TranscriptionWord {
332
  word: string;
333
  start: number; // Start time in seconds
334
  end: number; // End time in seconds
335
}
336
```
337

338
#### TranscriptionDiarized { .api }
339

340
Speaker-identified transcription with segment attribution:
341

342
```typescript { .api }
343
interface TranscriptionDiarized {
344
  text: string;
345
  duration: number;
346
  task: 'transcribe';
347
  segments: Array<TranscriptionDiarizedSegment>; // Annotated with speaker labels
348
  usage?: TranscriptionDiarized.Tokens | TranscriptionDiarized.Duration;
349
}
350
```
351

352
#### TranscriptionDiarizedSegment { .api }
353

354
Segment with speaker identification:
355

356
```typescript { .api }
357
interface TranscriptionDiarizedSegment {
358
  id: string;
359
  text: string;
360
  start: number; // Start time in seconds
361
  end: number; // End time in seconds
362
  speaker: string; // Speaker label ('A', 'B', etc., or known speaker name)
363
  type: 'transcript.text.segment';
364
}
365
```
366

367
#### TranscriptionStreamEvent { .api }
368

369
Union type for streaming transcription events:
370

371
```typescript { .api }
372
type TranscriptionStreamEvent =
373
  | TranscriptionTextDeltaEvent
374
  | TranscriptionTextDoneEvent
375
  | TranscriptionTextSegmentEvent;
376
```
377

378
#### TranscriptionTextDeltaEvent { .api }
379

380
Streaming event with incremental text:
381

382
```typescript { .api }
383
interface TranscriptionTextDeltaEvent {
384
  type: 'transcript.text.delta';
385
  delta: string; // Incremental text
386
  logprobs?: Array<TranscriptionTextDeltaEvent.Logprob>;
387
  segment_id?: string; // For diarized segments
388
}
389
```
390

391
#### TranscriptionTextDoneEvent { .api }
392

393
Final completion event with full transcription:
394

395
```typescript { .api }
396
interface TranscriptionTextDoneEvent {
397
  type: 'transcript.text.done';
398
  text: string; // Complete transcription
399
  logprobs?: Array<TranscriptionTextDoneEvent.Logprob>;
400
  usage?: TranscriptionTextDoneEvent.Usage;
401
}
402
```
403

404
#### TranscriptionTextSegmentEvent { .api }
405

406
Diarized segment completion event:
407

408
```typescript { .api }
409
interface TranscriptionTextSegmentEvent {
410
  type: 'transcript.text.segment';
411
  id: string;
412
  text: string;
413
  speaker: string; // Speaker label
414
  start: number;
415
  end: number;
416
}
417
```
418

419
### Examples
420

421
#### Basic Transcription
422

423
Transcribe an audio file to text:
424

425
```typescript
426
import fs from 'fs';
427

428
const audioFile = fs.createReadStream('speech.mp3');
429

430
const transcription = await client.audio.transcriptions.create({
431
  file: audioFile,
432
  model: 'gpt-4o-transcribe',
433
});
434

435
console.log('Transcribed text:', transcription.text);
436
```
437

438
#### Transcription with Language Specification
439

440
Improve accuracy by specifying the language:
441

442
```typescript
443
const frenchAudio = fs.createReadStream('french_speech.mp3');
444

445
const transcription = await client.audio.transcriptions.create({
446
  file: frenchAudio,
447
  model: 'gpt-4o-transcribe',
448
  language: 'fr', // ISO-639-1 language code
449
  prompt: 'This is a technical discussion about software development.', // Style guide
450
});
451

452
console.log('French transcription:', transcription.text);
453
```
454

455
#### Verbose Output with Timestamps
456

457
Get detailed segment and word-level timing information:
458

459
```typescript
460
const transcription = await client.audio.transcriptions.create({
461
  file: fs.createReadStream('podcast.mp3'),
462
  model: 'gpt-4o-transcribe',
463
  response_format: 'verbose_json',
464
  timestamp_granularities: ['word', 'segment'],
465
});
466

467
// Access word-level timing
468
if (transcription.words) {
469
  transcription.words.forEach(word => {
470
    console.log(`${word.word}: ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s`);
471
  });
472
}
473

474
// Access segments
475
if (transcription.segments) {
476
  transcription.segments.forEach(segment => {
477
    console.log(`[${segment.start.toFixed(2)}s - ${segment.end.toFixed(2)}s] ${segment.text}`);
478
    console.log(`  Confidence: ${(1 - segment.no_speech_prob).toFixed(3)}`);
479
  });
480
}
481
```
482

483
#### Speaker Diarization
484

485
Identify and separate different speakers:
486

487
```typescript
488
const audioFile = fs.createReadStream('multi_speaker_audio.mp3');
489

490
const diarization = await client.audio.transcriptions.create({
491
  file: audioFile,
492
  model: 'gpt-4o-transcribe-diarize',
493
  response_format: 'diarized_json',
494
});
495

496
// View segments with speaker identification
497
diarization.segments.forEach(segment => {
498
  console.log(`[${segment.start.toFixed(2)}s] Speaker ${segment.speaker}: ${segment.text}`);
499
});
500
```
501

502
#### Diarization with Known Speakers
503

504
Provide reference audio for known speakers to improve identification:
505

506
```typescript
507
const mainAudio = fs.createReadStream('meeting_recording.mp3');
508

509
const diarization = await client.audio.transcriptions.create({
510
  file: mainAudio,
511
  model: 'gpt-4o-transcribe-diarize',
512
  response_format: 'diarized_json',
513
  known_speaker_names: ['John', 'Sarah', 'Mike'],
514
  known_speaker_references: [
515
    'data:audio/mp3;base64,//NExAAR...',  // John's voice sample
516
    'data:audio/mp3;base64,//NExAAR...',  // Sarah's voice sample
517
    'data:audio/mp3;base64,//NExAAR...',  // Mike's voice sample
518
  ],
519
});
520

521
// Speakers now labeled by name instead of letters
522
diarization.segments.forEach(segment => {
523
  console.log(`${segment.speaker}: "${segment.text}"`);
524
});
525
```
526

527
#### Streaming Transcription
528

529
Real-time transcription as audio arrives:
530

531
```typescript
532
const audioStream = fs.createReadStream('live_audio.mp3');
533

534
const stream = await client.audio.transcriptions.create({
535
  file: audioStream,
536
  model: 'gpt-4o-transcribe',
537
  stream: true,
538
  response_format: 'json',
539
});
540

541
let fullText = '';
542

543
for await (const event of stream) {
544
  if (event.type === 'transcript.text.delta') {
545
    // Incremental text arrives
546
    process.stdout.write(event.delta);
547
    fullText += event.delta;
548
  } else if (event.type === 'transcript.text.done') {
549
    // Final complete text
550
    console.log('\nFinal transcription:', event.text);
551
  }
552
}
553
```
554

555
#### Streaming with Diarization
556

557
Real-time speaker-identified transcription:
558

559
```typescript
560
const audioStream = fs.createReadStream('streaming_meeting.mp3');
561

562
const stream = await client.audio.transcriptions.create({
563
  file: audioStream,
564
  model: 'gpt-4o-transcribe-diarize',
565
  stream: true,
566
  response_format: 'diarized_json',
567
});
568

569
const speakers: { [key: string]: string } = {};
570

571
for await (const event of stream) {
572
  if (event.type === 'transcript.text.segment') {
573
    // Complete segment with speaker information
574
    if (!speakers[event.speaker]) {
575
      console.log(`\n[New Speaker: ${event.speaker}]`);
576
      speakers[event.speaker] = event.speaker;
577
    }
578
    console.log(`${event.speaker} [${event.start.toFixed(2)}s-${event.end.toFixed(2)}s]: ${event.text}`);
579
  }
580
}
581
```
582

583
#### Confidence Scores and Quality Metrics
584

585
Analyze transcription confidence and quality:
586

587
```typescript
588
const transcription = await client.audio.transcriptions.create({
589
  file: fs.createReadStream('audio.mp3'),
590
  model: 'gpt-4o-transcribe',
591
  response_format: 'verbose_json',
592
  include: ['logprobs'],
593
});
594

595
// Analyze segment quality
596
if (transcription.segments) {
597
  transcription.segments.forEach(segment => {
598
    const confidence = 1 - segment.no_speech_prob;
599
    const quality = segment.compression_ratio < 2.4 ? 'good' : 'degraded';
600

601
    console.log(`Segment "${segment.text}"`);
602
    console.log(`  Confidence: ${(confidence * 100).toFixed(1)}%`);
603
    console.log(`  Quality: ${quality}`);
604
    console.log(`  Compression ratio: ${segment.compression_ratio.toFixed(3)}`);
605
  });
606
}
607
```
608

609
#### Custom Chunking Strategy
610

611
Configure voice activity detection for better segment boundaries:
612

613
```typescript
614
const transcription = await client.audio.transcriptions.create({
615
  file: fs.createReadStream('long_audio.mp3'),
616
  model: 'gpt-4o-transcribe-diarize',
617
  response_format: 'diarized_json',
618
  chunking_strategy: {
619
    type: 'server_vad',
620
    threshold: 0.6, // Higher sensitivity in noisy environments
621
    silence_duration_ms: 1200, // Shorter pause = new segment
622
    prefix_padding_ms: 500,
623
  },
624
});
625

626
console.log('Segments:', transcription.segments.length);
627
```
628

629
---
630

631
## Translations Sub-Resource
632

633
Translate audio from any language to English text.
634

635
### Methods
636

637
#### translations.create()
638

639
Translates audio to English with optional detailed segment information.
640

641
```typescript { .api }
642
/**
643
 * Translates audio into English
644
 * @param params - Translation configuration
645
 * @returns Translated English text or detailed translation object
646
 */
647
translations.create(
648
  params: TranslationCreateParams
649
): Promise<Translation | TranslationVerbose | string>;
650
```
651

652
**Parameters:**
653

654
```typescript { .api }
655
interface TranslationCreateParams {
656
  /** Audio file to translate (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */
657
  file: Uploadable;
658

659
  /** Model: 'whisper-1' (currently the only available translation model) */
660
  model: AudioModel;
661

662
  /** Response format: 'json', 'verbose_json', 'text', 'srt', or 'vtt' (default: 'json') */
663
  response_format?: 'json' | 'text' | 'srt' | 'verbose_json' | 'vtt';
664

665
  /** Text to guide style or continue previous segment (should be in English) */
666
  prompt?: string;
667

668
  /** Sampling temperature 0-1 (default: 0, uses log probability) */
669
  temperature?: number;
670
}
671
```
672

673
### Types
674

675
#### Translation { .api }
676

677
Basic translation response with English text:
678

679
```typescript { .api }
680
interface Translation {
681
  text: string;
682
}
683
```
684

685
#### TranslationVerbose { .api }
686

687
Detailed translation with segment information:
688

689
```typescript { .api }
690
interface TranslationVerbose {
691
  text: string; // Translated English text
692
  duration: number; // Duration in seconds
693
  language: string; // Always 'english' for output
694
  segments?: Array<TranscriptionSegment>; // Segment details with timestamps
695
}
696
```
697

698
### Examples
699

700
#### Basic Translation
701

702
Translate audio to English:
703

704
```typescript
705
import fs from 'fs';
706

707
const spanishAudio = fs.createReadStream('spanish_interview.mp3');
708

709
const translation = await client.audio.translations.create({
710
  file: spanishAudio,
711
  model: 'whisper-1',
712
});
713

714
console.log('English translation:', translation.text);
715
```
716

717
#### Translation with Style Guidance
718

719
Guide the translation style with a prompt:
720

721
```typescript
722
const frenchPodcast = fs.createReadStream('french_podcast.mp3');
723

724
const translation = await client.audio.translations.create({
725
  file: frenchPodcast,
726
  model: 'whisper-1',
727
  prompt: 'This is a formal technical discussion. Use precise technical terminology.',
728
  response_format: 'json',
729
});
730

731
console.log('Professional translation:', translation.text);
732
```
733

734
#### Verbose Translation with Segments
735

736
Get segment-level information for synchronized translation display:
737

738
```typescript
739
const italianAudio = fs.createReadStream('italian_video.mp3');
740

741
const translation = await client.audio.translations.create({
742
  file: italianAudio,
743
  model: 'whisper-1',
744
  response_format: 'verbose_json',
745
});
746

747
console.log(`Full translation: ${translation.text}`);
748
console.log(`Duration: ${translation.duration} seconds`);
749

750
// Display segments with timing for subtitle generation
751
if (translation.segments) {
752
  translation.segments.forEach(segment => {
753
    const start = segment.start.toFixed(2);
754
    const end = segment.end.toFixed(2);
755
    console.log(`[${start}s - ${end}s] ${segment.text}`);
756
  });
757
}
758
```
759

760
#### Translation to Subtitle Formats
761

762
Export translations in subtitle formats for video:
763

764
```typescript
765
const germanAudio = fs.createReadStream('german_video.mp3');
766

767
// Get SRT format (SubRip)
768
const srtTranslation = await client.audio.translations.create({
769
  file: germanAudio,
770
  model: 'whisper-1',
771
  response_format: 'srt',
772
});
773

774
fs.writeFileSync('english_subtitles.srt', srtTranslation);
775

776
// Get VTT format (WebVTT)
777
const vttTranslation = await client.audio.translations.create({
778
  file: germanAudio,
779
  model: 'whisper-1',
780
  response_format: 'vtt',
781
});
782

783
fs.writeFileSync('english_subtitles.vtt', vttTranslation);
784
```
785

786
#### Batch Audio Translation
787

788
Translate multiple audio files:
789

790
```typescript
791
const files = ['french_audio.mp3', 'spanish_audio.mp3', 'german_audio.mp3'];
792
const translations = {};
793

794
for (const file of files) {
795
  const audioStream = fs.createReadStream(file);
796

797
  const result = await client.audio.translations.create({
798
    file: audioStream,
799
    model: 'whisper-1',
800
  });
801

802
  translations[file] = result.text;
803
}
804

805
// Save all translations
806
fs.writeFileSync('translations.json', JSON.stringify(translations, null, 2));
807
```
808

809
---
810

811
## AudioModel { .api }
812

813
Supported audio models for transcription and translation:
814

815
```typescript { .api }
816
type AudioModel =
817
  | 'whisper-1'
818
  | 'gpt-4o-transcribe'
819
  | 'gpt-4o-mini-transcribe'
820
  | 'gpt-4o-transcribe-diarize';
821
```
822

823
- `whisper-1` - Reliable transcription and translation model, optimized for various audio qualities
824
- `gpt-4o-transcribe` - Advanced transcription with improved accuracy and language detection
825
- `gpt-4o-mini-transcribe` - Lightweight variant for efficient transcription
826
- `gpt-4o-transcribe-diarize` - Speaker identification and diarization capabilities
827

828
## AudioResponseFormat { .api }
829

830
Output format options for transcriptions and translations:
831

832
```typescript { .api }
833
type AudioResponseFormat =
834
  | 'json'
835
  | 'text'
836
  | 'srt'
837
  | 'verbose_json'
838
  | 'vtt'
839
  | 'diarized_json';
840
```
841

842
- `json` - Structured JSON response with text content (default)
843
- `text` - Plain text without additional metadata
844
- `srt` - SubRip subtitle format (timing + text)
845
- `verbose_json` - Detailed JSON with segments, timing, and confidence scores
846
- `vtt` - WebVTT subtitle format (timing + text)
847
- `diarized_json` - JSON with speaker identification and segment timing
848

849
---
850

851
## Common Patterns
852

853
### Error Handling
854

855
Handle common audio processing errors:
856

857
```typescript
858
import { BadRequestError, APIError } from 'openai';
859

860
try {
861
  const transcription = await client.audio.transcriptions.create({
862
    file: fs.createReadStream('audio.mp3'),
863
    model: 'gpt-4o-transcribe',
864
  });
865
} catch (error) {
866
  if (error instanceof BadRequestError) {
867
    console.error('Invalid file format or parameters:', error.message);
868
  } else if (error instanceof APIError) {
869
    console.error('API error:', error.message);
870
  }
871
}
872
```
873

874
### File Handling
875

876
Work with different file input types:
877

878
```typescript
879
import { toFile } from 'openai';
880

881
// From file system
882
const fromDisk = fs.createReadStream('audio.mp3');
883

884
// From Buffer
885
const buffer = await fs.promises.readFile('audio.mp3');
886
const fromBuffer = await toFile(buffer, 'audio.mp3', { type: 'audio/mpeg' });
887

888
// From URL (requires fetch)
889
const response = await fetch('https://example.com/audio.mp3');
890
const blob = await response.blob();
891
const fromUrl = await toFile(blob, 'audio.mp3', { type: 'audio/mpeg' });
892

893
// Use with any sub-resource
894
const transcription = await client.audio.transcriptions.create({
895
  file: fromBuffer,
896
  model: 'gpt-4o-transcribe',
897
});
898
```
899

900
### Request Options
901

902
Control request behavior and timeouts:
903

904
```typescript
905
const transcription = await client.audio.transcriptions.create(
906
  {
907
    file: fs.createReadStream('audio.mp3'),
908
    model: 'gpt-4o-transcribe',
909
  },
910
  {
911
    timeout: 30000, // 30 second timeout
912
    maxRetries: 2,
913
  }
914
);
915
```
916

917
### Combining Audio Operations
918

919
Chain multiple audio operations for complete audio processing:
920

921
```typescript
922
// 1. Transcribe audio
923
const transcription = await client.audio.transcriptions.create({
924
  file: fs.createReadStream('mixed_language.mp3'),
925
  model: 'gpt-4o-transcribe',
926
  response_format: 'verbose_json',
927
  timestamp_granularities: ['word'],
928
});
929

930
// 2. Translate the transcribed content using chat completion
931
const translation = await client.chat.completions.create({
932
  model: 'gpt-4o',
933
  messages: [{
934
    role: 'user',
935
    content: `Translate to Spanish:\n\n${transcription.text}`,
936
  }],
937
});
938

939
// 3. Generate speech from translated text
940
const speech = await client.audio.speech.create({
941
  model: 'tts-1-hd',
942
  voice: 'nova',
943
  input: translation.choices[0].message.content || '',
944
});
945

946
const audioBuffer = await speech.blob();
947
fs.writeFileSync('translated_speech.mp3', audioBuffer);
948
```
949

Version

Tile

Files

audio.mddocs/

Version

Tile

Files

audio.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

audio.mddocs/