or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

assistants.mdaudio.mdbatches.mdchat-completions.mdchatkit.mdclient-initialization.mdcompletions.mdcontainers.mdconversations.mdembeddings.mdevals.mdfiles.mdfine-tuning.mdimages.mdindex.mdmodels.mdmoderations.mdrealtime.mdresponses.mdruns.mdthreads-messages.mduploads.mdvector-stores.mdvideos.mdwebhooks.md
KNOWN_ISSUES.md

audio.mddocs/

0

# Audio

1

2

Convert audio to text (transcription and translation) and text to speech using Whisper and TTS models. Supports multiple audio formats and languages.

3

4

## Capabilities

5

6

### Transcription

7

8

Convert audio to text in the original language using the Whisper model.

9

10

```python { .api }

11

def create(

12

self,

13

*,

14

file: FileTypes,

15

model: str | AudioModel,

16

chunking_strategy: dict | str | Omit = omit,

17

include: list[str] | Omit = omit,

18

known_speaker_names: list[str] | Omit = omit,

19

known_speaker_references: list[str] | Omit = omit,

20

language: str | Omit = omit,

21

prompt: str | Omit = omit,

22

response_format: Literal["json", "text", "srt", "verbose_json", "vtt", "diarized_json"] | Omit = omit,

23

stream: bool | Omit = omit,

24

temperature: float | Omit = omit,

25

timestamp_granularities: list[Literal["word", "segment"]] | Omit = omit,

26

extra_headers: dict[str, str] | None = None,

27

extra_query: dict[str, object] | None = None,

28

extra_body: dict[str, object] | None = None,

29

timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,

30

) -> Transcription | TranscriptionVerbose:

31

"""

32

Transcribe audio to text in the original language.

33

34

Args:

35

file: Audio file to transcribe. Supported formats: flac, mp3, mp4, mpeg,

36

mpga, m4a, ogg, wav, webm. Max file size: 25 MB.

37

Can be file path string, file object, or tuple.

38

39

model: Model ID. Options:

40

- "gpt-4o-transcribe": Advanced transcription with streaming support

41

- "gpt-4o-mini-transcribe": Faster, cost-effective transcription

42

- "gpt-4o-transcribe-diarize": Speaker diarization model

43

- "whisper-1": Powered by open source Whisper V2 model

44

45

chunking_strategy: Controls how audio is cut into chunks. Options:

46

- "auto": Server normalizes loudness and uses voice activity detection (VAD)

47

- {"type": "server_vad", ...}: Manually configure VAD parameters

48

- If unset: Audio transcribed as a single block

49

- Required for gpt-4o-transcribe-diarize with inputs >30 seconds

50

51

include: Additional information to include. Options:

52

- "logprobs": Returns log probabilities for confidence analysis

53

- Only works with response_format="json"

54

- Only supported for gpt-4o-transcribe and gpt-4o-mini-transcribe

55

- Not supported with gpt-4o-transcribe-diarize

56

57

known_speaker_names: List of speaker names for diarization (e.g., ["customer", "agent"]).

58

Corresponds to audio samples in known_speaker_references. Up to 4 speakers.

59

Used with gpt-4o-transcribe-diarize model.

60

61

known_speaker_references: List of audio samples (as data URLs) containing known speaker

62

references. Each sample must be 2-10 seconds. Matches known_speaker_names.

63

Used with gpt-4o-transcribe-diarize model.

64

65

language: Language of the audio in ISO-639-1 format (e.g., "en", "fr", "de").

66

Providing language improves accuracy and latency.

67

68

prompt: Optional text to guide the model's style or continue previous segment.

69

Should match audio language.

70

71

response_format: Output format. Options:

72

- "json": JSON with text (default)

73

- "text": Plain text only

74

- "srt": SubRip subtitle format

75

- "verbose_json": JSON with segments, timestamps, confidence

76

- "vtt": WebVTT subtitle format

77

- "diarized_json": JSON with speaker annotations (for gpt-4o-transcribe-diarize)

78

Note: gpt-4o-transcribe/mini only support "json". gpt-4o-transcribe-diarize

79

supports "json", "text", and "diarized_json" (required for speaker annotations).

80

81

stream: If true, model response will be streamed using server-sent events.

82

Returns Stream[TranscriptionStreamEvent]. Not supported for whisper-1.

83

84

temperature: Sampling temperature between 0 and 1. Higher values increase

85

randomness. Default is 0.

86

87

timestamp_granularities: Timestamp precision options.

88

- ["segment"]: Segment-level timestamps (default)

89

- ["word"]: Word-level timestamps

90

- ["segment", "word"]: Both levels

91

92

extra_headers: Additional HTTP headers.

93

extra_query: Additional query parameters.

94

extra_body: Additional JSON fields.

95

timeout: Request timeout in seconds.

96

97

Returns:

98

Transcription: Basic response with text

99

TranscriptionVerbose: Detailed response with segments and timestamps

100

101

Raises:

102

BadRequestError: Invalid file format or size

103

AuthenticationError: Invalid API key

104

"""

105

```

106

107

Usage examples:

108

109

```python

110

from openai import OpenAI

111

112

client = OpenAI()

113

114

# Basic transcription

115

with open("audio.mp3", "rb") as audio_file:

116

transcript = client.audio.transcriptions.create(

117

model="whisper-1",

118

file=audio_file

119

)

120

print(transcript.text)

121

122

# With language hint for better accuracy

123

with open("french_audio.mp3", "rb") as audio_file:

124

transcript = client.audio.transcriptions.create(

125

model="whisper-1",

126

file=audio_file,

127

language="fr"

128

)

129

130

# Verbose JSON with detailed information

131

with open("audio.mp3", "rb") as audio_file:

132

transcript = client.audio.transcriptions.create(

133

model="whisper-1",

134

file=audio_file,

135

response_format="verbose_json",

136

timestamp_granularities=["word", "segment"]

137

)

138

139

print(f"Duration: {transcript.duration}")

140

print(f"Language: {transcript.language}")

141

142

for segment in transcript.segments:

143

print(f"[{segment.start:.2f}s - {segment.end:.2f}s]: {segment.text}")

144

145

for word in transcript.words:

146

print(f"{word.word} ({word.start:.2f}s)")

147

148

# SRT subtitle format

149

with open("video_audio.mp3", "rb") as audio_file:

150

srt = client.audio.transcriptions.create(

151

model="whisper-1",

152

file=audio_file,

153

response_format="srt"

154

)

155

# Save to file

156

with open("subtitles.srt", "w") as f:

157

f.write(srt)

158

159

# With prompt for context/style

160

with open("continuation.mp3", "rb") as audio_file:

161

transcript = client.audio.transcriptions.create(

162

model="whisper-1",

163

file=audio_file,

164

prompt="Previous text for context..."

165

)

166

167

# Using file_from_path helper

168

from openai import OpenAI

169

from openai import file_from_path

170

171

client = OpenAI()

172

173

transcript = client.audio.transcriptions.create(

174

model="whisper-1",

175

file=file_from_path("audio.mp3")

176

)

177

178

# Advanced: Using gpt-4o-transcribe with streaming

179

with open("audio.mp3", "rb") as audio_file:

180

stream = client.audio.transcriptions.create(

181

model="gpt-4o-transcribe",

182

file=audio_file,

183

stream=True

184

)

185

for event in stream:

186

print(event.text, end="", flush=True)

187

188

# Advanced: Speaker diarization with gpt-4o-transcribe-diarize

189

with open("meeting.mp3", "rb") as audio_file:

190

transcript = client.audio.transcriptions.create(

191

model="gpt-4o-transcribe-diarize",

192

file=audio_file,

193

response_format="diarized_json",

194

chunking_strategy="auto"

195

)

196

for segment in transcript.segments:

197

print(f"[{segment.speaker}]: {segment.text}")

198

199

# Advanced: With known speaker references

200

with open("call.mp3", "rb") as audio_file:

201

transcript = client.audio.transcriptions.create(

202

model="gpt-4o-transcribe-diarize",

203

file=audio_file,

204

response_format="diarized_json",

205

known_speaker_names=["customer", "agent"],

206

known_speaker_references=[

207

"data:audio/mp3;base64,...", # Customer voice sample

208

"data:audio/mp3;base64,..." # Agent voice sample

209

]

210

)

211

212

# Advanced: Using include for confidence scores

213

with open("audio.mp3", "rb") as audio_file:

214

transcript = client.audio.transcriptions.create(

215

model="gpt-4o-transcribe",

216

file=audio_file,

217

response_format="json",

218

include=["logprobs"]

219

)

220

# Access logprobs for confidence analysis

221

```

222

223

### Translation

224

225

Translate audio to English text using the Whisper model.

226

227

```python { .api }

228

def create(

229

self,

230

*,

231

file: FileTypes,

232

model: str | AudioModel,

233

prompt: str | Omit = omit,

234

response_format: Literal["json", "text", "srt", "verbose_json", "vtt"] | Omit = omit,

235

temperature: float | Omit = omit,

236

extra_headers: dict[str, str] | None = None,

237

extra_query: dict[str, object] | None = None,

238

extra_body: dict[str, object] | None = None,

239

timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,

240

) -> Translation | TranslationVerbose | str:

241

"""

242

Translate audio to English text.

243

244

Args:

245

file: Audio file to translate. Supported formats: flac, mp3, mp4, mpeg,

246

mpga, m4a, ogg, wav, webm. Max file size: 25 MB.

247

248

model: Model ID. Currently only "whisper-1" is available.

249

250

prompt: Optional text to guide the model's style.

251

252

response_format: Output format. Options:

253

- "json": JSON with text (default)

254

- "text": Plain text only

255

- "srt": SubRip subtitle format

256

- "verbose_json": JSON with segments and details

257

- "vtt": WebVTT subtitle format

258

259

temperature: Sampling temperature between 0 and 1.

260

261

extra_headers: Additional HTTP headers.

262

extra_query: Additional query parameters.

263

extra_body: Additional JSON fields.

264

timeout: Request timeout in seconds.

265

266

Returns:

267

Translation: Basic response with English text (for json format)

268

TranslationVerbose: Detailed response with segments (for verbose_json format)

269

str: Plain text string (for text, srt, vtt formats)

270

271

Raises:

272

BadRequestError: Invalid file format or size

273

"""

274

```

275

276

Usage example:

277

278

```python

279

from openai import OpenAI

280

281

client = OpenAI()

282

283

# Translate French audio to English

284

with open("french_audio.mp3", "rb") as audio_file:

285

translation = client.audio.translations.create(

286

model="whisper-1",

287

file=audio_file

288

)

289

print(translation.text)

290

291

# Verbose format with segments

292

with open("spanish_audio.mp3", "rb") as audio_file:

293

translation = client.audio.translations.create(

294

model="whisper-1",

295

file=audio_file,

296

response_format="verbose_json"

297

)

298

299

for segment in translation.segments:

300

print(f"[{segment.start:.2f}s]: {segment.text}")

301

```

302

303

### Text-to-Speech

304

305

Convert text to spoken audio using TTS models.

306

307

```python { .api }

308

def create(

309

self,

310

*,

311

input: str,

312

model: str | SpeechModel,

313

voice: Literal["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"],

314

instructions: str | Omit = omit,

315

response_format: Literal["mp3", "opus", "aac", "flac", "wav", "pcm"] | Omit = omit,

316

speed: float | Omit = omit,

317

stream_format: Literal["sse", "audio"] | Omit = omit,

318

extra_headers: dict[str, str] | None = None,

319

extra_query: dict[str, object] | None = None,

320

extra_body: dict[str, object] | None = None,

321

timeout: float | httpx.Timeout | None | NotGiven = NOT_GIVEN,

322

) -> HttpxBinaryResponseContent:

323

"""

324

Convert text to spoken audio.

325

326

Args:

327

input: Text to convert to audio. Max length: 4096 characters.

328

329

model: TTS model to use. Options:

330

- "tts-1": Standard quality, faster, lower cost

331

- "tts-1-hd": High definition quality, slower, higher cost

332

- "gpt-4o-mini-tts": Advanced model with instruction support

333

334

voice: Voice to use for generation. Options:

335

- "alloy": Neutral, balanced

336

- "ash": Clear and articulate

337

- "ballad": Warm and expressive

338

- "coral": Bright and engaging

339

- "echo": Calm and measured

340

- "sage": Wise and authoritative

341

- "shimmer": Soft and gentle

342

- "verse": Dynamic and versatile

343

- "marin": Smooth and professional

344

- "cedar": Rich and grounded

345

346

instructions: Control the voice with additional instructions.

347

Does not work with tts-1 or tts-1-hd. Only supported by gpt-4o-mini-tts.

348

349

response_format: Audio format. Options:

350

- "mp3": Default, good compression

351

- "opus": Best for streaming, lower latency

352

- "aac": Good compression, widely supported

353

- "flac": Lossless compression

354

- "wav": Uncompressed

355

- "pcm": Raw 16-bit PCM audio

356

357

speed: Playback speed between 0.25 and 4.0. Default 1.0.

358

359

stream_format: Format to stream the audio in. Options:

360

- "sse": Server-sent events streaming

361

- "audio": Raw audio streaming

362

Note: "sse" not supported for tts-1 or tts-1-hd.

363

364

extra_headers: Additional HTTP headers.

365

extra_query: Additional query parameters.

366

extra_body: Additional JSON fields.

367

timeout: Request timeout in seconds.

368

369

Returns:

370

HttpxBinaryResponseContent: Audio file content. Use .content for bytes,

371

.read() for streaming, .stream_to_file(path) for direct save.

372

373

Raises:

374

BadRequestError: Invalid parameters or text too long

375

"""

376

```

377

378

Usage examples:

379

380

```python

381

from openai import OpenAI

382

from pathlib import Path

383

384

client = OpenAI()

385

386

# Basic TTS

387

response = client.audio.speech.create(

388

model="tts-1",

389

voice="alloy",

390

input="Hello! This is a test of text to speech."

391

)

392

393

# Save to file

394

speech_file = Path("output.mp3")

395

response.stream_to_file(speech_file)

396

397

# Different voices

398

voices = ["alloy", "ash", "ballad", "coral", "echo", "sage", "shimmer", "verse", "marin", "cedar"]

399

for voice in voices:

400

response = client.audio.speech.create(

401

model="tts-1",

402

voice=voice,

403

input="Testing different voices."

404

)

405

response.stream_to_file(f"voice_{voice}.mp3")

406

407

# High quality audio (using marin or cedar recommended for best quality)

408

response = client.audio.speech.create(

409

model="tts-1-hd",

410

voice="marin",

411

input="High definition audio output."

412

)

413

response.stream_to_file("hd_output.mp3")

414

415

# Streaming optimized format (Opus)

416

response = client.audio.speech.create(

417

model="tts-1",

418

voice="shimmer",

419

input="Optimized for streaming.",

420

response_format="opus"

421

)

422

response.stream_to_file("output.opus")

423

424

# Adjust playback speed

425

response = client.audio.speech.create(

426

model="tts-1",

427

voice="alloy",

428

input="This will play faster.",

429

speed=1.5

430

)

431

response.stream_to_file("fast_speech.mp3")

432

433

# Get raw bytes

434

response = client.audio.speech.create(

435

model="tts-1",

436

voice="echo",

437

input="Getting raw audio bytes."

438

)

439

audio_bytes = response.content

440

# Process bytes as needed

441

442

# Streaming response

443

response = client.audio.speech.create(

444

model="tts-1",

445

voice="ballad",

446

input="Streaming audio data."

447

)

448

449

with open("streaming.mp3", "wb") as f:

450

for chunk in response.iter_bytes():

451

f.write(chunk)

452

453

# Advanced: Using gpt-4o-mini-tts with instructions

454

response = client.audio.speech.create(

455

model="gpt-4o-mini-tts",

456

voice="sage",

457

input="This is a test of voice control.",

458

instructions="Speak in a warm, friendly tone with slight enthusiasm."

459

)

460

response.stream_to_file("instructed_speech.mp3")

461

462

# Advanced: Server-sent events streaming

463

response = client.audio.speech.create(

464

model="gpt-4o-mini-tts",

465

voice="coral",

466

input="Real-time audio streaming.",

467

stream_format="sse"

468

)

469

response.stream_to_file("sse_output.mp3")

470

```

471

472

## Types

473

474

```python { .api }

475

from typing import Literal

476

from pydantic import BaseModel

477

478

# Transcription types

479

class Transcription(BaseModel):

480

text: str

481

482

class TranscriptionVerbose(BaseModel):

483

text: str

484

language: str

485

duration: float

486

segments: list[TranscriptionSegment] | None

487

words: list[TranscriptionWord] | None

488

489

class TranscriptionSegment(BaseModel):

490

id: int

491

seek: int

492

start: float

493

end: float

494

text: str

495

tokens: list[int]

496

temperature: float

497

avg_logprob: float

498

compression_ratio: float

499

no_speech_prob: float

500

501

class TranscriptionWord(BaseModel):

502

word: str

503

start: float

504

end: float

505

506

class TranscriptionDiarized(BaseModel):

507

"""Transcription with speaker diarization."""

508

text: str

509

language: str

510

duration: float

511

segments: list[DiarizedSegment]

512

513

class DiarizedSegment(BaseModel):

514

"""Segment with speaker information."""

515

speaker: str # Speaker identifier

516

start: float

517

end: float

518

text: str

519

520

# Translation types

521

class Translation(BaseModel):

522

text: str

523

524

class TranslationVerbose(BaseModel):

525

text: str

526

language: str

527

duration: float

528

segments: list[TranscriptionSegment] | None

529

530

# Model types

531

AudioModel = Literal["gpt-4o-transcribe", "gpt-4o-mini-transcribe", "gpt-4o-transcribe-diarize", "whisper-1"]

532

SpeechModel = Literal["tts-1", "tts-1-hd", "gpt-4o-mini-tts"]

533

534

# File types

535

FileTypes = Union[

536

FileContent, # File-like object

537

tuple[str | None, FileContent], # (filename, content)

538

tuple[str | None, FileContent, str | None] # (filename, content, content_type)

539

]

540

541

# Response type for TTS

542

class HttpxBinaryResponseContent:

543

content: bytes

544

def read(self) -> bytes: ...

545

def iter_bytes(self, chunk_size: int = None) -> Iterator[bytes]: ...

546

def stream_to_file(self, file_path: str | Path) -> None: ...

547

```

548

549

## Async Usage

550

551

```python

552

import asyncio

553

from openai import AsyncOpenAI

554

555

async def transcribe_audio():

556

client = AsyncOpenAI()

557

558

with open("audio.mp3", "rb") as audio_file:

559

transcript = await client.audio.transcriptions.create(

560

model="whisper-1",

561

file=audio_file

562

)

563

return transcript.text

564

565

async def generate_speech():

566

client = AsyncOpenAI()

567

568

response = await client.audio.speech.create(

569

model="tts-1",

570

voice="alloy",

571

input="Async text to speech"

572

)

573

response.stream_to_file("async_output.mp3")

574

575

# Run async operations

576

text = asyncio.run(transcribe_audio())

577

asyncio.run(generate_speech())

578

```

579