0
# Audio Transcription
1
2
Transcribe audio files to text with support for various audio formats and streaming. The audio API provides accurate speech-to-text conversion with language detection and formatting options.
3
4
## Capabilities
5
6
### Audio Transcription
7
8
Convert audio files to text with customizable options.
9
10
```python { .api }
11
def transcribe(
12
file: Union[str, BinaryIO],
13
model: str,
14
language: Optional[str] = None,
15
prompt: Optional[str] = None,
16
response_format: Optional[str] = None,
17
temperature: Optional[float] = None,
18
timestamp_granularities: Optional[List[str]] = None,
19
**kwargs
20
) -> TranscriptionResponse:
21
"""
22
Transcribe audio to text.
23
24
Parameters:
25
- file: Audio file path (string) or file-like object (BinaryIO)
26
- model: Transcription model identifier
27
- language: Optional language code (e.g., "en", "fr", "es")
28
- prompt: Optional prompt to guide transcription
29
- response_format: Output format ("json", "text", "srt", "vtt")
30
- temperature: Sampling temperature for transcription
31
- timestamp_granularities: Timestamp precision levels
32
33
Returns:
34
TranscriptionResponse with transcribed text and metadata
35
"""
36
```
37
38
### Streaming Transcription
39
40
Transcribe audio in real-time from streaming input.
41
42
```python { .api }
43
def transcribe_stream(
44
stream: Iterator[bytes],
45
model: str,
46
language: Optional[str] = None,
47
**kwargs
48
) -> Iterator[TranscriptionStreamEvents]:
49
"""
50
Transcribe streaming audio.
51
52
Parameters:
53
- stream: Iterator of audio bytes
54
- model: Transcription model identifier
55
- language: Optional language code
56
57
Returns:
58
Iterator of transcription events with partial and final results
59
"""
60
```
61
62
## Usage Examples
63
64
### Basic Audio Transcription
65
66
```python
67
from mistralai import Mistral
68
69
client = Mistral(api_key="your-api-key")
70
71
# Transcribe an audio file
72
with open("recording.mp3", "rb") as audio_file:
73
response = client.audio.transcribe(
74
file=audio_file,
75
model="whisper-1",
76
language="en",
77
response_format="json"
78
)
79
80
print("Transcription:")
81
print(response.text)
82
print(f"Language detected: {response.language}")
83
print(f"Duration: {response.duration} seconds")
84
```
85
86
### Transcription with Timestamps
87
88
```python
89
# Get detailed transcription with timestamps
90
response = client.audio.transcribe(
91
file="meeting_recording.wav",
92
model="whisper-1",
93
response_format="json",
94
timestamp_granularities=["word", "segment"]
95
)
96
97
print("Detailed transcription:")
98
for segment in response.segments:
99
start_time = segment.start
100
end_time = segment.end
101
text = segment.text
102
103
print(f"[{start_time:.2f}s - {end_time:.2f}s]: {text}")
104
105
# Word-level timestamps
106
if hasattr(response, 'words'):
107
print("\nWord-level timing:")
108
for word in response.words[:10]: # First 10 words
109
print(f"'{word.word}' at {word.start:.2f}s")
110
```
111
112
### Multiple Format Output
113
114
```python
115
# Get transcription in different formats
116
formats = ["json", "text", "srt", "vtt"]
117
118
for format in formats:
119
response = client.audio.transcribe(
120
file="presentation.m4a",
121
model="whisper-1",
122
response_format=format
123
)
124
125
# Save to file
126
extension = "txt" if format == "text" else format
127
with open(f"transcription.{extension}", "w") as f:
128
if format == "json":
129
f.write(response.text)
130
else:
131
f.write(response)
132
133
print(f"Saved transcription in {format} format")
134
```
135
136
### Streaming Transcription
137
138
```python
139
import pyaudio
140
import threading
141
import queue
142
143
# Setup audio stream
144
def audio_stream_generator():
145
audio = pyaudio.PyAudio()
146
stream = audio.open(
147
format=pyaudio.paInt16,
148
channels=1,
149
rate=16000,
150
input=True,
151
frames_per_buffer=1024
152
)
153
154
try:
155
while True:
156
data = stream.read(1024)
157
yield data
158
finally:
159
stream.stop_stream()
160
stream.close()
161
audio.terminate()
162
163
# Transcribe streaming audio
164
print("Starting real-time transcription...")
165
stream = client.audio.transcribe_stream(
166
stream=audio_stream_generator(),
167
model="whisper-1",
168
language="en"
169
)
170
171
for event in stream:
172
if event.type == "transcription.partial":
173
print(f"Partial: {event.text}", end="\r")
174
elif event.type == "transcription.completed":
175
print(f"\nFinal: {event.text}")
176
```
177
178
### Batch Audio Processing
179
180
```python
181
import os
182
183
# Process multiple audio files
184
audio_files = ["interview1.mp3", "interview2.wav", "lecture.m4a"]
185
transcriptions = {}
186
187
for audio_file in audio_files:
188
if os.path.exists(audio_file):
189
print(f"Processing {audio_file}...")
190
191
response = client.audio.transcribe(
192
file=audio_file,
193
model="whisper-1",
194
language="auto", # Auto-detect language
195
response_format="json"
196
)
197
198
transcriptions[audio_file] = {
199
"text": response.text,
200
"language": response.language,
201
"duration": response.duration
202
}
203
204
print(f" Completed: {len(response.text)} characters")
205
206
# Save all transcriptions
207
import json
208
with open("all_transcriptions.json", "w") as f:
209
json.dump(transcriptions, f, indent=2)
210
```
211
212
## Types
213
214
### Request Types
215
216
```python { .api }
217
class AudioTranscriptionRequest:
218
file: Union[str, BinaryIO]
219
model: str
220
language: Optional[str]
221
prompt: Optional[str]
222
response_format: Optional[str]
223
temperature: Optional[float]
224
timestamp_granularities: Optional[List[str]]
225
226
class AudioTranscriptionRequestStream:
227
stream: Iterator[bytes]
228
model: str
229
language: Optional[str]
230
```
231
232
### Response Types
233
234
```python { .api }
235
class TranscriptionResponse:
236
text: str
237
language: Optional[str]
238
duration: Optional[float]
239
segments: Optional[List[TranscriptionSegment]]
240
words: Optional[List[TranscriptionWord]]
241
242
class TranscriptionSegment:
243
id: int
244
start: float
245
end: float
246
text: str
247
temperature: Optional[float]
248
avg_logprob: Optional[float]
249
compression_ratio: Optional[float]
250
no_speech_prob: Optional[float]
251
252
class TranscriptionWord:
253
word: str
254
start: float
255
end: float
256
257
class TranscriptionStreamEvents:
258
type: str # "transcription.partial", "transcription.completed", "error"
259
text: Optional[str]
260
language: Optional[str]
261
timestamp: Optional[float]
262
```
263
264
### Stream Event Types
265
266
```python { .api }
267
class TranscriptionStreamEventTypes:
268
PARTIAL = "transcription.partial"
269
COMPLETED = "transcription.completed"
270
ERROR = "error"
271
DONE = "done"
272
```
273
274
## Supported Formats
275
276
### Audio Formats
277
278
- **MP3**: MPEG Audio Layer III
279
- **WAV**: Waveform Audio File Format
280
- **M4A**: MPEG-4 Audio
281
- **FLAC**: Free Lossless Audio Codec
282
- **OGG**: Ogg Vorbis
283
- **WEBM**: WebM Audio
284
285
### Response Formats
286
287
- **json**: Structured JSON with metadata
288
- **text**: Plain text transcription only
289
- **srt**: SubRip subtitle format with timestamps
290
- **vtt**: WebVTT subtitle format
291
292
### Language Support
293
294
Supports many languages including:
295
- English (en)
296
- Spanish (es)
297
- French (fr)
298
- German (de)
299
- Italian (it)
300
- Portuguese (pt)
301
- And many more...
302
303
## Best Practices
304
305
### Audio Quality
306
307
- Use clear, high-quality audio recordings
308
- Minimize background noise and echo
309
- Ensure consistent volume levels
310
- Use appropriate sample rates (16kHz or higher)
311
312
### Performance Optimization
313
314
- Use appropriate models for your use case
315
- Consider batch processing for multiple files
316
- Implement proper error handling for network issues
317
- Cache results for repeated transcriptions
318
319
### Accuracy Improvement
320
321
- Provide context through prompts when helpful
322
- Specify language when known for better accuracy
323
- Use temperature settings to control consistency
324
- Review and correct transcriptions for critical applications