or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

assistants.mdaudio.mdbatches-evals.mdchat-completions.mdclient-configuration.mdcontainers.mdconversations.mdembeddings.mdfiles-uploads.mdfine-tuning.mdhelpers-audio.mdhelpers-zod.mdimages.mdindex.mdrealtime.mdresponses-api.mdvector-stores.mdvideos.md

audio.mddocs/

0

# Audio

1

2

Core audio capabilities providing text-to-speech generation, speech-to-text transcription with advanced features including speaker diarization, and audio translation to English.

3

4

## Capabilities

5

6

The Audio resource is organized into three sub-resources, each serving distinct audio processing needs:

7

8

### Speech Generation

9

10

Generate natural-sounding audio from text input with configurable voices and audio formats.

11

12

```typescript { .api }

13

client.audio.speech.create(params: SpeechCreateParams): Promise<Response>;

14

```

15

16

[Speech](#speech-sub-resource)

17

18

### Transcription

19

20

Convert audio to text with support for multiple languages, speaker diarization, streaming, and detailed metadata including timestamps and confidence scores.

21

22

```typescript { .api }

23

client.audio.transcriptions.create(params: TranscriptionCreateParams): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent>>;

24

```

25

26

[Transcriptions](#transcriptions-sub-resource)

27

28

### Translation

29

30

Translate audio in any language to English text with optional detailed segment information.

31

32

```typescript { .api }

33

client.audio.translations.create(params: TranslationCreateParams): Promise<Translation | TranslationVerbose>;

34

```

35

36

[Translations](#translations-sub-resource)

37

38

---

39

40

## Speech Sub-Resource

41

42

Text-to-speech audio generation with multiple voice options and configurable audio formats.

43

44

### Methods

45

46

#### speech.create()

47

48

Generates audio from text input with configurable voice, format, speed, and model selection.

49

50

```typescript { .api }

51

/**

52

* Generates audio from the input text

53

* @param params - Configuration for audio generation

54

* @returns Response containing audio data as binary stream

55

*/

56

speech.create(params: SpeechCreateParams): Promise<Response>;

57

```

58

59

**Parameters:**

60

61

```typescript { .api }

62

interface SpeechCreateParams {

63

/** The text to generate audio for (max 4096 characters) */

64

input: string;

65

66

/** TTS model: 'tts-1', 'tts-1-hd', or 'gpt-4o-mini-tts' */

67

model: SpeechModel;

68

69

/** Voice to use: 'alloy', 'ash', 'ballad', 'cedar', 'coral', 'echo', 'fable', 'marin', 'nova', 'onyx', 'sage', 'shimmer', 'verse' */

70

voice: 'alloy' | 'ash' | 'ballad' | 'cedar' | 'coral' | 'echo' | 'fable' | 'marin' | 'nova' | 'onyx' | 'sage' | 'shimmer' | 'verse';

71

72

/** Audio format: 'mp3', 'opus', 'aac', 'flac', 'wav', 'pcm' (default: 'mp3') */

73

response_format?: 'mp3' | 'opus' | 'aac' | 'flac' | 'wav' | 'pcm';

74

75

/** Speed from 0.25 to 4.0 (default: 1.0) */

76

speed?: number;

77

78

/** Voice control instructions (not supported with tts-1 or tts-1-hd) */

79

instructions?: string;

80

81

/** Stream format: 'sse' or 'audio' ('sse' not supported for tts-1/tts-1-hd) */

82

stream_format?: 'sse' | 'audio';

83

}

84

```

85

86

### Types

87

88

#### SpeechModel { .api }

89

90

Union type for available text-to-speech models:

91

92

```typescript { .api }

93

type SpeechModel = 'tts-1' | 'tts-1-hd' | 'gpt-4o-mini-tts';

94

```

95

96

- `tts-1` - Low latency, natural sounding (default for real-time applications)

97

- `tts-1-hd` - Higher quality audio with increased latency

98

- `gpt-4o-mini-tts` - Latest TTS model with advanced voice control

99

100

### Examples

101

102

#### Basic Text-to-Speech

103

104

Generate audio in MP3 format with the default voice:

105

106

```typescript

107

import fs from 'fs';

108

109

const response = await client.audio.speech.create({

110

model: 'tts-1-hd',

111

voice: 'alloy',

112

input: 'The quick brown fox jumps over the lazy dog.',

113

});

114

115

const audioBuffer = await response.blob();

116

fs.writeFileSync('output.mp3', audioBuffer);

117

```

118

119

#### Different Voice Options

120

121

Generate the same text with different voices to find the best fit:

122

123

```typescript

124

const text = 'Welcome to our audio service.';

125

const voices = ['alloy', 'echo', 'sage', 'shimmer', 'nova'] as const;

126

127

for (const voice of voices) {

128

const response = await client.audio.speech.create({

129

model: 'tts-1-hd',

130

voice: voice,

131

input: text,

132

response_format: 'mp3',

133

});

134

135

const buffer = await response.blob();

136

fs.writeFileSync(`voice_${voice}.mp3`, buffer);

137

}

138

```

139

140

#### High-Quality Audio with Custom Speed

141

142

Generate high-fidelity audio at a slower pace:

143

144

```typescript

145

const response = await client.audio.speech.create({

146

model: 'tts-1-hd',

147

voice: 'sage',

148

input: 'This is a carefully paced announcement.',

149

response_format: 'flac', // Lossless format for best quality

150

speed: 0.8, // Slower than normal

151

});

152

153

const audioFile = await response.arrayBuffer();

154

fs.writeFileSync('announcement.flac', Buffer.from(audioFile));

155

```

156

157

#### Multiple Audio Formats

158

159

Generate audio in different formats for various use cases:

160

161

```typescript

162

const formats = ['mp3', 'opus', 'aac', 'wav'] as const;

163

const input = 'Testing different audio formats.';

164

165

for (const format of formats) {

166

const response = await client.audio.speech.create({

167

model: 'tts-1',

168

voice: 'shimmer',

169

input: input,

170

response_format: format as any,

171

});

172

173

const buffer = await response.blob();

174

fs.writeFileSync(`output.${format}`, buffer);

175

}

176

```

177

178

#### Voice Control with Instructions

179

180

Use advanced voice control (requires gpt-4o-mini-tts):

181

182

```typescript

183

const response = await client.audio.speech.create({

184

model: 'gpt-4o-mini-tts',

185

voice: 'sage',

186

input: 'This announcement should sound urgent and professional.',

187

instructions: 'Speak with urgency and authority, using a professional tone.',

188

speed: 1.1,

189

});

190

191

const buffer = await response.blob();

192

```

193

194

---

195

196

## Transcriptions Sub-Resource

197

198

Convert audio to text with support for speaker diarization, streaming, and detailed metadata.

199

200

### Methods

201

202

#### transcriptions.create()

203

204

Transcribes audio to text with options for verbose output, diarization, and real-time streaming.

205

206

```typescript { .api }

207

/**

208

* Transcribes audio into the input language

209

* @param params - Transcription configuration

210

* @returns Transcribed text or detailed transcription object, optionally streamed

211

*/

212

transcriptions.create(

213

params: TranscriptionCreateParams

214

): Promise<Transcription | TranscriptionVerbose | TranscriptionDiarized | Stream<TranscriptionStreamEvent> | string>;

215

```

216

217

**Parameters:**

218

219

```typescript { .api }

220

interface TranscriptionCreateParamsBase {

221

/** Audio file to transcribe (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */

222

file: Uploadable;

223

224

/** Model: 'gpt-4o-transcribe', 'gpt-4o-mini-transcribe', 'whisper-1', or 'gpt-4o-transcribe-diarize' */

225

model: AudioModel;

226

227

/** Response format: 'json', 'verbose_json', 'diarized_json', 'text', 'srt', or 'vtt' */

228

response_format?: AudioResponseFormat;

229

230

/** Enable streaming (not supported for whisper-1) */

231

stream?: boolean;

232

233

/** Language code in ISO-639-1 format (e.g., 'en', 'fr', 'es') */

234

language?: string;

235

236

/** Text to guide style or continue previous segment */

237

prompt?: string;

238

239

/** Sampling temperature 0-1 (default: 0, uses log probability) */

240

temperature?: number;

241

242

/** Chunking strategy: 'auto' or manual VAD configuration */

243

chunking_strategy?: 'auto' | VadConfig | null;

244

245

/** Include additional information: 'logprobs' */

246

include?: Array<'logprobs'>;

247

248

/** Timestamp granularities: 'word', 'segment', or both */

249

timestamp_granularities?: Array<'word' | 'segment'>;

250

251

/** Speaker names for diarization (up to 4 speakers) */

252

known_speaker_names?: Array<string>;

253

254

/** Audio samples of known speakers (2-10 seconds each) */

255

known_speaker_references?: Array<string>;

256

}

257

258

interface TranscriptionCreateParamsNonStreaming extends TranscriptionCreateParamsBase {

259

stream?: false | null;

260

}

261

262

interface TranscriptionCreateParamsStreaming extends TranscriptionCreateParamsBase {

263

stream: true;

264

}

265

```

266

267

**VAD Configuration:**

268

269

```typescript { .api }

270

interface VadConfig {

271

type: 'server_vad';

272

prefix_padding_ms?: number; // Audio to include before VAD detection (default: 300ms)

273

silence_duration_ms?: number; // Duration of silence to detect stop (default: 1800ms)

274

threshold?: number; // Sensitivity 0.0-1.0 (default: 0.5)

275

}

276

```

277

278

### Types

279

280

#### Transcription { .api }

281

282

Basic transcription response with text content:

283

284

```typescript { .api }

285

interface Transcription {

286

text: string;

287

logprobs?: Array<Transcription.Logprob>; // Only with logprobs include

288

usage?: Transcription.Tokens | Transcription.Duration;

289

}

290

```

291

292

#### TranscriptionVerbose { .api }

293

294

Detailed transcription with timestamps, segments, and word-level timing:

295

296

```typescript { .api }

297

interface TranscriptionVerbose {

298

text: string;

299

duration: number; // Duration in seconds

300

language: string; // Detected language code

301

segments?: Array<TranscriptionSegment>; // Segment details with timestamps

302

words?: Array<TranscriptionWord>; // Word-level timing information

303

usage?: TranscriptionVerbose.Usage;

304

}

305

```

306

307

#### TranscriptionSegment { .api }

308

309

Individual segment with detailed timing and confidence metrics:

310

311

```typescript { .api }

312

interface TranscriptionSegment {

313

id: number;

314

start: number; // Start time in seconds

315

end: number; // End time in seconds

316

text: string;

317

temperature: number;

318

avg_logprob: number; // Average log probability

319

compression_ratio: number;

320

no_speech_prob: number; // Probability of silence

321

tokens: Array<number>;

322

seek: number;

323

}

324

```

325

326

#### TranscriptionWord { .api }

327

328

Word-level timing information for precise synchronization:

329

330

```typescript { .api }

331

interface TranscriptionWord {

332

word: string;

333

start: number; // Start time in seconds

334

end: number; // End time in seconds

335

}

336

```

337

338

#### TranscriptionDiarized { .api }

339

340

Speaker-identified transcription with segment attribution:

341

342

```typescript { .api }

343

interface TranscriptionDiarized {

344

text: string;

345

duration: number;

346

task: 'transcribe';

347

segments: Array<TranscriptionDiarizedSegment>; // Annotated with speaker labels

348

usage?: TranscriptionDiarized.Tokens | TranscriptionDiarized.Duration;

349

}

350

```

351

352

#### TranscriptionDiarizedSegment { .api }

353

354

Segment with speaker identification:

355

356

```typescript { .api }

357

interface TranscriptionDiarizedSegment {

358

id: string;

359

text: string;

360

start: number; // Start time in seconds

361

end: number; // End time in seconds

362

speaker: string; // Speaker label ('A', 'B', etc., or known speaker name)

363

type: 'transcript.text.segment';

364

}

365

```

366

367

#### TranscriptionStreamEvent { .api }

368

369

Union type for streaming transcription events:

370

371

```typescript { .api }

372

type TranscriptionStreamEvent =

373

| TranscriptionTextDeltaEvent

374

| TranscriptionTextDoneEvent

375

| TranscriptionTextSegmentEvent;

376

```

377

378

#### TranscriptionTextDeltaEvent { .api }

379

380

Streaming event with incremental text:

381

382

```typescript { .api }

383

interface TranscriptionTextDeltaEvent {

384

type: 'transcript.text.delta';

385

delta: string; // Incremental text

386

logprobs?: Array<TranscriptionTextDeltaEvent.Logprob>;

387

segment_id?: string; // For diarized segments

388

}

389

```

390

391

#### TranscriptionTextDoneEvent { .api }

392

393

Final completion event with full transcription:

394

395

```typescript { .api }

396

interface TranscriptionTextDoneEvent {

397

type: 'transcript.text.done';

398

text: string; // Complete transcription

399

logprobs?: Array<TranscriptionTextDoneEvent.Logprob>;

400

usage?: TranscriptionTextDoneEvent.Usage;

401

}

402

```

403

404

#### TranscriptionTextSegmentEvent { .api }

405

406

Diarized segment completion event:

407

408

```typescript { .api }

409

interface TranscriptionTextSegmentEvent {

410

type: 'transcript.text.segment';

411

id: string;

412

text: string;

413

speaker: string; // Speaker label

414

start: number;

415

end: number;

416

}

417

```

418

419

### Examples

420

421

#### Basic Transcription

422

423

Transcribe an audio file to text:

424

425

```typescript

426

import fs from 'fs';

427

428

const audioFile = fs.createReadStream('speech.mp3');

429

430

const transcription = await client.audio.transcriptions.create({

431

file: audioFile,

432

model: 'gpt-4o-transcribe',

433

});

434

435

console.log('Transcribed text:', transcription.text);

436

```

437

438

#### Transcription with Language Specification

439

440

Improve accuracy by specifying the language:

441

442

```typescript

443

const frenchAudio = fs.createReadStream('french_speech.mp3');

444

445

const transcription = await client.audio.transcriptions.create({

446

file: frenchAudio,

447

model: 'gpt-4o-transcribe',

448

language: 'fr', // ISO-639-1 language code

449

prompt: 'This is a technical discussion about software development.', // Style guide

450

});

451

452

console.log('French transcription:', transcription.text);

453

```

454

455

#### Verbose Output with Timestamps

456

457

Get detailed segment and word-level timing information:

458

459

```typescript

460

const transcription = await client.audio.transcriptions.create({

461

file: fs.createReadStream('podcast.mp3'),

462

model: 'gpt-4o-transcribe',

463

response_format: 'verbose_json',

464

timestamp_granularities: ['word', 'segment'],

465

});

466

467

// Access word-level timing

468

if (transcription.words) {

469

transcription.words.forEach(word => {

470

console.log(`${word.word}: ${word.start.toFixed(2)}s - ${word.end.toFixed(2)}s`);

471

});

472

}

473

474

// Access segments

475

if (transcription.segments) {

476

transcription.segments.forEach(segment => {

477

console.log(`[${segment.start.toFixed(2)}s - ${segment.end.toFixed(2)}s] ${segment.text}`);

478

console.log(` Confidence: ${(1 - segment.no_speech_prob).toFixed(3)}`);

479

});

480

}

481

```

482

483

#### Speaker Diarization

484

485

Identify and separate different speakers:

486

487

```typescript

488

const audioFile = fs.createReadStream('multi_speaker_audio.mp3');

489

490

const diarization = await client.audio.transcriptions.create({

491

file: audioFile,

492

model: 'gpt-4o-transcribe-diarize',

493

response_format: 'diarized_json',

494

});

495

496

// View segments with speaker identification

497

diarization.segments.forEach(segment => {

498

console.log(`[${segment.start.toFixed(2)}s] Speaker ${segment.speaker}: ${segment.text}`);

499

});

500

```

501

502

#### Diarization with Known Speakers

503

504

Provide reference audio for known speakers to improve identification:

505

506

```typescript

507

const mainAudio = fs.createReadStream('meeting_recording.mp3');

508

509

const diarization = await client.audio.transcriptions.create({

510

file: mainAudio,

511

model: 'gpt-4o-transcribe-diarize',

512

response_format: 'diarized_json',

513

known_speaker_names: ['John', 'Sarah', 'Mike'],

514

known_speaker_references: [

515

'data:audio/mp3;base64,//NExAAR...', // John's voice sample

516

'data:audio/mp3;base64,//NExAAR...', // Sarah's voice sample

517

'data:audio/mp3;base64,//NExAAR...', // Mike's voice sample

518

],

519

});

520

521

// Speakers now labeled by name instead of letters

522

diarization.segments.forEach(segment => {

523

console.log(`${segment.speaker}: "${segment.text}"`);

524

});

525

```

526

527

#### Streaming Transcription

528

529

Real-time transcription as audio arrives:

530

531

```typescript

532

const audioStream = fs.createReadStream('live_audio.mp3');

533

534

const stream = await client.audio.transcriptions.create({

535

file: audioStream,

536

model: 'gpt-4o-transcribe',

537

stream: true,

538

response_format: 'json',

539

});

540

541

let fullText = '';

542

543

for await (const event of stream) {

544

if (event.type === 'transcript.text.delta') {

545

// Incremental text arrives

546

process.stdout.write(event.delta);

547

fullText += event.delta;

548

} else if (event.type === 'transcript.text.done') {

549

// Final complete text

550

console.log('\nFinal transcription:', event.text);

551

}

552

}

553

```

554

555

#### Streaming with Diarization

556

557

Real-time speaker-identified transcription:

558

559

```typescript

560

const audioStream = fs.createReadStream('streaming_meeting.mp3');

561

562

const stream = await client.audio.transcriptions.create({

563

file: audioStream,

564

model: 'gpt-4o-transcribe-diarize',

565

stream: true,

566

response_format: 'diarized_json',

567

});

568

569

const speakers: { [key: string]: string } = {};

570

571

for await (const event of stream) {

572

if (event.type === 'transcript.text.segment') {

573

// Complete segment with speaker information

574

if (!speakers[event.speaker]) {

575

console.log(`\n[New Speaker: ${event.speaker}]`);

576

speakers[event.speaker] = event.speaker;

577

}

578

console.log(`${event.speaker} [${event.start.toFixed(2)}s-${event.end.toFixed(2)}s]: ${event.text}`);

579

}

580

}

581

```

582

583

#### Confidence Scores and Quality Metrics

584

585

Analyze transcription confidence and quality:

586

587

```typescript

588

const transcription = await client.audio.transcriptions.create({

589

file: fs.createReadStream('audio.mp3'),

590

model: 'gpt-4o-transcribe',

591

response_format: 'verbose_json',

592

include: ['logprobs'],

593

});

594

595

// Analyze segment quality

596

if (transcription.segments) {

597

transcription.segments.forEach(segment => {

598

const confidence = 1 - segment.no_speech_prob;

599

const quality = segment.compression_ratio < 2.4 ? 'good' : 'degraded';

600

601

console.log(`Segment "${segment.text}"`);

602

console.log(` Confidence: ${(confidence * 100).toFixed(1)}%`);

603

console.log(` Quality: ${quality}`);

604

console.log(` Compression ratio: ${segment.compression_ratio.toFixed(3)}`);

605

});

606

}

607

```

608

609

#### Custom Chunking Strategy

610

611

Configure voice activity detection for better segment boundaries:

612

613

```typescript

614

const transcription = await client.audio.transcriptions.create({

615

file: fs.createReadStream('long_audio.mp3'),

616

model: 'gpt-4o-transcribe-diarize',

617

response_format: 'diarized_json',

618

chunking_strategy: {

619

type: 'server_vad',

620

threshold: 0.6, // Higher sensitivity in noisy environments

621

silence_duration_ms: 1200, // Shorter pause = new segment

622

prefix_padding_ms: 500,

623

},

624

});

625

626

console.log('Segments:', transcription.segments.length);

627

```

628

629

---

630

631

## Translations Sub-Resource

632

633

Translate audio from any language to English text.

634

635

### Methods

636

637

#### translations.create()

638

639

Translates audio to English with optional detailed segment information.

640

641

```typescript { .api }

642

/**

643

* Translates audio into English

644

* @param params - Translation configuration

645

* @returns Translated English text or detailed translation object

646

*/

647

translations.create(

648

params: TranslationCreateParams

649

): Promise<Translation | TranslationVerbose | string>;

650

```

651

652

**Parameters:**

653

654

```typescript { .api }

655

interface TranslationCreateParams {

656

/** Audio file to translate (flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm) */

657

file: Uploadable;

658

659

/** Model: 'whisper-1' (currently the only available translation model) */

660

model: AudioModel;

661

662

/** Response format: 'json', 'verbose_json', 'text', 'srt', or 'vtt' (default: 'json') */

663

response_format?: 'json' | 'text' | 'srt' | 'verbose_json' | 'vtt';

664

665

/** Text to guide style or continue previous segment (should be in English) */

666

prompt?: string;

667

668

/** Sampling temperature 0-1 (default: 0, uses log probability) */

669

temperature?: number;

670

}

671

```

672

673

### Types

674

675

#### Translation { .api }

676

677

Basic translation response with English text:

678

679

```typescript { .api }

680

interface Translation {

681

text: string;

682

}

683

```

684

685

#### TranslationVerbose { .api }

686

687

Detailed translation with segment information:

688

689

```typescript { .api }

690

interface TranslationVerbose {

691

text: string; // Translated English text

692

duration: number; // Duration in seconds

693

language: string; // Always 'english' for output

694

segments?: Array<TranscriptionSegment>; // Segment details with timestamps

695

}

696

```

697

698

### Examples

699

700

#### Basic Translation

701

702

Translate audio to English:

703

704

```typescript

705

import fs from 'fs';

706

707

const spanishAudio = fs.createReadStream('spanish_interview.mp3');

708

709

const translation = await client.audio.translations.create({

710

file: spanishAudio,

711

model: 'whisper-1',

712

});

713

714

console.log('English translation:', translation.text);

715

```

716

717

#### Translation with Style Guidance

718

719

Guide the translation style with a prompt:

720

721

```typescript

722

const frenchPodcast = fs.createReadStream('french_podcast.mp3');

723

724

const translation = await client.audio.translations.create({

725

file: frenchPodcast,

726

model: 'whisper-1',

727

prompt: 'This is a formal technical discussion. Use precise technical terminology.',

728

response_format: 'json',

729

});

730

731

console.log('Professional translation:', translation.text);

732

```

733

734

#### Verbose Translation with Segments

735

736

Get segment-level information for synchronized translation display:

737

738

```typescript

739

const italianAudio = fs.createReadStream('italian_video.mp3');

740

741

const translation = await client.audio.translations.create({

742

file: italianAudio,

743

model: 'whisper-1',

744

response_format: 'verbose_json',

745

});

746

747

console.log(`Full translation: ${translation.text}`);

748

console.log(`Duration: ${translation.duration} seconds`);

749

750

// Display segments with timing for subtitle generation

751

if (translation.segments) {

752

translation.segments.forEach(segment => {

753

const start = segment.start.toFixed(2);

754

const end = segment.end.toFixed(2);

755

console.log(`[${start}s - ${end}s] ${segment.text}`);

756

});

757

}

758

```

759

760

#### Translation to Subtitle Formats

761

762

Export translations in subtitle formats for video:

763

764

```typescript

765

const germanAudio = fs.createReadStream('german_video.mp3');

766

767

// Get SRT format (SubRip)

768

const srtTranslation = await client.audio.translations.create({

769

file: germanAudio,

770

model: 'whisper-1',

771

response_format: 'srt',

772

});

773

774

fs.writeFileSync('english_subtitles.srt', srtTranslation);

775

776

// Get VTT format (WebVTT)

777

const vttTranslation = await client.audio.translations.create({

778

file: germanAudio,

779

model: 'whisper-1',

780

response_format: 'vtt',

781

});

782

783

fs.writeFileSync('english_subtitles.vtt', vttTranslation);

784

```

785

786

#### Batch Audio Translation

787

788

Translate multiple audio files:

789

790

```typescript

791

const files = ['french_audio.mp3', 'spanish_audio.mp3', 'german_audio.mp3'];

792

const translations = {};

793

794

for (const file of files) {

795

const audioStream = fs.createReadStream(file);

796

797

const result = await client.audio.translations.create({

798

file: audioStream,

799

model: 'whisper-1',

800

});

801

802

translations[file] = result.text;

803

}

804

805

// Save all translations

806

fs.writeFileSync('translations.json', JSON.stringify(translations, null, 2));

807

```

808

809

---

810

811

## AudioModel { .api }

812

813

Supported audio models for transcription and translation:

814

815

```typescript { .api }

816

type AudioModel =

817

| 'whisper-1'

818

| 'gpt-4o-transcribe'

819

| 'gpt-4o-mini-transcribe'

820

| 'gpt-4o-transcribe-diarize';

821

```

822

823

- `whisper-1` - Reliable transcription and translation model, optimized for various audio qualities

824

- `gpt-4o-transcribe` - Advanced transcription with improved accuracy and language detection

825

- `gpt-4o-mini-transcribe` - Lightweight variant for efficient transcription

826

- `gpt-4o-transcribe-diarize` - Speaker identification and diarization capabilities

827

828

## AudioResponseFormat { .api }

829

830

Output format options for transcriptions and translations:

831

832

```typescript { .api }

833

type AudioResponseFormat =

834

| 'json'

835

| 'text'

836

| 'srt'

837

| 'verbose_json'

838

| 'vtt'

839

| 'diarized_json';

840

```

841

842

- `json` - Structured JSON response with text content (default)

843

- `text` - Plain text without additional metadata

844

- `srt` - SubRip subtitle format (timing + text)

845

- `verbose_json` - Detailed JSON with segments, timing, and confidence scores

846

- `vtt` - WebVTT subtitle format (timing + text)

847

- `diarized_json` - JSON with speaker identification and segment timing

848

849

---

850

851

## Common Patterns

852

853

### Error Handling

854

855

Handle common audio processing errors:

856

857

```typescript

858

import { BadRequestError, APIError } from 'openai';

859

860

try {

861

const transcription = await client.audio.transcriptions.create({

862

file: fs.createReadStream('audio.mp3'),

863

model: 'gpt-4o-transcribe',

864

});

865

} catch (error) {

866

if (error instanceof BadRequestError) {

867

console.error('Invalid file format or parameters:', error.message);

868

} else if (error instanceof APIError) {

869

console.error('API error:', error.message);

870

}

871

}

872

```

873

874

### File Handling

875

876

Work with different file input types:

877

878

```typescript

879

import { toFile } from 'openai';

880

881

// From file system

882

const fromDisk = fs.createReadStream('audio.mp3');

883

884

// From Buffer

885

const buffer = await fs.promises.readFile('audio.mp3');

886

const fromBuffer = await toFile(buffer, 'audio.mp3', { type: 'audio/mpeg' });

887

888

// From URL (requires fetch)

889

const response = await fetch('https://example.com/audio.mp3');

890

const blob = await response.blob();

891

const fromUrl = await toFile(blob, 'audio.mp3', { type: 'audio/mpeg' });

892

893

// Use with any sub-resource

894

const transcription = await client.audio.transcriptions.create({

895

file: fromBuffer,

896

model: 'gpt-4o-transcribe',

897

});

898

```

899

900

### Request Options

901

902

Control request behavior and timeouts:

903

904

```typescript

905

const transcription = await client.audio.transcriptions.create(

906

{

907

file: fs.createReadStream('audio.mp3'),

908

model: 'gpt-4o-transcribe',

909

},

910

{

911

timeout: 30000, // 30 second timeout

912

maxRetries: 2,

913

}

914

);

915

```

916

917

### Combining Audio Operations

918

919

Chain multiple audio operations for complete audio processing:

920

921

```typescript

922

// 1. Transcribe audio

923

const transcription = await client.audio.transcriptions.create({

924

file: fs.createReadStream('mixed_language.mp3'),

925

model: 'gpt-4o-transcribe',

926

response_format: 'verbose_json',

927

timestamp_granularities: ['word'],

928

});

929

930

// 2. Translate the transcribed content using chat completion

931

const translation = await client.chat.completions.create({

932

model: 'gpt-4o',

933

messages: [{

934

role: 'user',

935

content: `Translate to Spanish:\n\n${transcription.text}`,

936

}],

937

});

938

939

// 3. Generate speech from translated text

940

const speech = await client.audio.speech.create({

941

model: 'tts-1-hd',

942

voice: 'nova',

943

input: translation.choices[0].message.content || '',

944

});

945

946

const audioBuffer = await speech.blob();

947

fs.writeFileSync('translated_speech.mp3', audioBuffer);

948

```

949