0
# Audio Processing
1
2
Audio decoding, format conversion, and preprocessing utilities for preparing audio data for transcription. These functions handle the conversion of various audio formats into the numpy arrays required by the Whisper models.
3
4
## Capabilities
5
6
### Audio Decoding
7
8
Decode audio files from various formats into numpy arrays suitable for speech recognition processing.
9
10
```python { .api }
11
def decode_audio(
12
input_file: str | BinaryIO,
13
sampling_rate: int = 16000,
14
split_stereo: bool = False,
15
) -> np.ndarray | tuple[np.ndarray, np.ndarray]:
16
"""
17
Decode audio from file or file-like object.
18
19
Uses PyAV library to decode audio with FFmpeg backend, supporting most audio formats
20
without requiring system FFmpeg installation.
21
22
Args:
23
input_file: Path to audio file or file-like object containing audio data
24
sampling_rate: Target sample rate for resampling (default: 16000 Hz)
25
split_stereo: If True, return separate left and right channels for stereo audio
26
27
Returns:
28
- If split_stereo=False: Single numpy array of shape (samples,) containing mono audio
29
- If split_stereo=True: Tuple of (left_channel, right_channel) numpy arrays
30
31
Notes:
32
- Output is always float32 normalized to [-1.0, 1.0] range
33
- Stereo audio is automatically converted to mono unless split_stereo=True
34
- Automatically handles resampling to target sampling rate
35
- Supports all formats supported by FFmpeg/PyAV
36
"""
37
```
38
39
### Array Padding and Trimming
40
41
Utility function for padding or trimming arrays to specific lengths, commonly used for feature processing.
42
43
```python { .api }
44
def pad_or_trim(
45
array: np.ndarray,
46
length: int = 3000,
47
*,
48
axis: int = -1
49
) -> np.ndarray:
50
"""
51
Pad or trim array to specified length along given axis.
52
53
Used internally for preparing mel-spectrogram features to expected input size
54
for the encoder (typically 3000 frames for 30-second audio chunks).
55
56
Args:
57
array: Input numpy array to pad or trim
58
length: Target length for the specified axis
59
axis: Axis along which to pad or trim (default: last axis)
60
61
Returns:
62
Array padded or trimmed to specified length
63
64
Notes:
65
- If array is longer than length, it's trimmed from the end
66
- If array is shorter than length, it's zero-padded at the end
67
- Padding uses numpy's pad function with zeros
68
"""
69
```
70
71
## Usage Examples
72
73
### Basic Audio Decoding
74
75
```python
76
from faster_whisper import decode_audio
77
import numpy as np
78
79
# Decode audio file to mono
80
audio = decode_audio("speech.mp3")
81
print(f"Audio shape: {audio.shape}")
82
print(f"Audio dtype: {audio.dtype}")
83
print(f"Duration: {len(audio) / 16000:.2f} seconds")
84
85
# Decode with custom sample rate
86
audio_8k = decode_audio("speech.mp3", sampling_rate=8000)
87
print(f"8kHz audio shape: {audio_8k.shape}")
88
```
89
90
### Stereo Audio Processing
91
92
```python
93
from faster_whisper import decode_audio
94
95
# Decode stereo audio as separate channels
96
left_channel, right_channel = decode_audio("stereo_audio.wav", split_stereo=True)
97
98
print(f"Left channel shape: {left_channel.shape}")
99
print(f"Right channel shape: {right_channel.shape}")
100
101
# Process each channel separately or combine them
102
combined = (left_channel + right_channel) / 2 # Simple averaging
103
```
104
105
### Working with File-like Objects
106
107
```python
108
from faster_whisper import decode_audio
109
import io
110
import requests
111
112
# Download and decode audio from URL
113
response = requests.get("https://example.com/audio.wav")
114
audio_bytes = io.BytesIO(response.content)
115
116
# Decode from memory
117
audio = decode_audio(audio_bytes, sampling_rate=16000)
118
print(f"Downloaded audio duration: {len(audio) / 16000:.2f}s")
119
```
120
121
### Pre-processing for Transcription
122
123
```python
124
from faster_whisper import WhisperModel, decode_audio
125
import numpy as np
126
127
model = WhisperModel("base")
128
129
# Decode audio manually
130
audio = decode_audio("long_audio.mp3", sampling_rate=16000)
131
132
# Split long audio into chunks for processing
133
chunk_duration = 30 # seconds
134
chunk_samples = chunk_duration * 16000
135
chunks = []
136
137
for start in range(0, len(audio), chunk_samples):
138
chunk = audio[start:start + chunk_samples]
139
if len(chunk) < chunk_samples:
140
# Pad last chunk if necessary
141
chunk = np.pad(chunk, (0, chunk_samples - len(chunk)))
142
chunks.append(chunk)
143
144
# Process each chunk
145
for i, chunk in enumerate(chunks):
146
print(f"Processing chunk {i+1}/{len(chunks)}")
147
segments, info = model.transcribe(chunk)
148
149
for segment in segments:
150
# Adjust timestamps for chunk offset
151
start_time = segment.start + (i * chunk_duration)
152
end_time = segment.end + (i * chunk_duration)
153
print(f"[{start_time:.2f}s -> {end_time:.2f}s] {segment.text}")
154
```
155
156
### Audio Quality Validation
157
158
```python
159
from faster_whisper import decode_audio
160
import numpy as np
161
162
def validate_audio_quality(audio_path):
163
"""Validate audio quality for speech recognition."""
164
audio = decode_audio(audio_path)
165
166
# Basic quality checks
167
duration = len(audio) / 16000
168
rms_level = np.sqrt(np.mean(audio**2))
169
max_amplitude = np.max(np.abs(audio))
170
171
print(f"Duration: {duration:.2f}s")
172
print(f"RMS level: {rms_level:.4f}")
173
print(f"Max amplitude: {max_amplitude:.4f}")
174
175
# Quality warnings
176
if duration < 1.0:
177
print("WARNING: Audio is very short (< 1s)")
178
if rms_level < 0.01:
179
print("WARNING: Audio level is very low")
180
if max_amplitude > 0.95:
181
print("WARNING: Audio may be clipped")
182
183
return audio
184
185
# Validate before transcription
186
audio = validate_audio_quality("input.wav")
187
```
188
189
## Supported Audio Formats
190
191
The `decode_audio` function supports all formats handled by FFmpeg/PyAV, including:
192
193
- **Common formats**: MP3, WAV, FLAC, AAC, OGG, M4A
194
- **Video formats**: MP4, AVI, MKV, WebM (audio track extracted)
195
- **Professional formats**: AIFF, AU, VOC
196
- **Compressed formats**: WMA, APE, TTA
197
198
## Technical Notes
199
200
- **Memory Management**: The decode_audio function includes garbage collection to prevent memory leaks with the resampler
201
- **Performance**: For very large audio files, consider processing in chunks to manage memory usage
202
- **Precision**: Audio is converted to float32 format normalized to [-1.0, 1.0] range
203
- **Resampling**: Uses PyAV's high-quality resampling for sample rate conversion