Tessl Tile for pypi/faster-whisper@1.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

audio-processing.md batched-processing.md core-speech-recognition.md index.md utilities.md voice-activity-detection.md

audio-processing.mddocs/

0
# Audio Processing
1

2
Audio decoding, format conversion, and preprocessing utilities for preparing audio data for transcription. These functions handle the conversion of various audio formats into the numpy arrays required by the Whisper models.
3

4
## Capabilities
5

6
### Audio Decoding
7

8
Decode audio files from various formats into numpy arrays suitable for speech recognition processing.
9

10
```python { .api }
11
def decode_audio(
12
    input_file: str | BinaryIO,
13
    sampling_rate: int = 16000,
14
    split_stereo: bool = False,
15
) -> np.ndarray | tuple[np.ndarray, np.ndarray]:
16
    """
17
    Decode audio from file or file-like object.
18
    
19
    Uses PyAV library to decode audio with FFmpeg backend, supporting most audio formats
20
    without requiring system FFmpeg installation.
21
    
22
    Args:
23
        input_file: Path to audio file or file-like object containing audio data
24
        sampling_rate: Target sample rate for resampling (default: 16000 Hz)
25
        split_stereo: If True, return separate left and right channels for stereo audio
26
        
27
    Returns:
28
        - If split_stereo=False: Single numpy array of shape (samples,) containing mono audio
29
        - If split_stereo=True: Tuple of (left_channel, right_channel) numpy arrays
30
        
31
    Notes:
32
        - Output is always float32 normalized to [-1.0, 1.0] range
33
        - Stereo audio is automatically converted to mono unless split_stereo=True
34
        - Automatically handles resampling to target sampling rate
35
        - Supports all formats supported by FFmpeg/PyAV
36
    """
37
```
38

39
### Array Padding and Trimming
40

41
Utility function for padding or trimming arrays to specific lengths, commonly used for feature processing.
42

43
```python { .api }
44
def pad_or_trim(
45
    array: np.ndarray,
46
    length: int = 3000,
47
    *,
48
    axis: int = -1
49
) -> np.ndarray:
50
    """
51
    Pad or trim array to specified length along given axis.
52
    
53
    Used internally for preparing mel-spectrogram features to expected input size
54
    for the encoder (typically 3000 frames for 30-second audio chunks).
55
    
56
    Args:
57
        array: Input numpy array to pad or trim
58
        length: Target length for the specified axis
59
        axis: Axis along which to pad or trim (default: last axis)
60
        
61
    Returns:
62
        Array padded or trimmed to specified length
63
        
64
    Notes:
65
        - If array is longer than length, it's trimmed from the end
66
        - If array is shorter than length, it's zero-padded at the end
67
        - Padding uses numpy's pad function with zeros
68
    """
69
```
70

71
## Usage Examples
72

73
### Basic Audio Decoding
74

75
```python
76
from faster_whisper import decode_audio
77
import numpy as np
78

79
# Decode audio file to mono
80
audio = decode_audio("speech.mp3")
81
print(f"Audio shape: {audio.shape}")
82
print(f"Audio dtype: {audio.dtype}")
83
print(f"Duration: {len(audio) / 16000:.2f} seconds")
84

85
# Decode with custom sample rate
86
audio_8k = decode_audio("speech.mp3", sampling_rate=8000)
87
print(f"8kHz audio shape: {audio_8k.shape}")
88
```
89

90
### Stereo Audio Processing
91

92
```python
93
from faster_whisper import decode_audio
94

95
# Decode stereo audio as separate channels
96
left_channel, right_channel = decode_audio("stereo_audio.wav", split_stereo=True)
97

98
print(f"Left channel shape: {left_channel.shape}")
99
print(f"Right channel shape: {right_channel.shape}")
100

101
# Process each channel separately or combine them
102
combined = (left_channel + right_channel) / 2  # Simple averaging
103
```
104

105
### Working with File-like Objects
106

107
```python
108
from faster_whisper import decode_audio
109
import io
110
import requests
111

112
# Download and decode audio from URL
113
response = requests.get("https://example.com/audio.wav")
114
audio_bytes = io.BytesIO(response.content)
115

116
# Decode from memory
117
audio = decode_audio(audio_bytes, sampling_rate=16000)
118
print(f"Downloaded audio duration: {len(audio) / 16000:.2f}s")
119
```
120

121
### Pre-processing for Transcription
122

123
```python
124
from faster_whisper import WhisperModel, decode_audio
125
import numpy as np
126

127
model = WhisperModel("base")
128

129
# Decode audio manually
130
audio = decode_audio("long_audio.mp3", sampling_rate=16000)
131

132
# Split long audio into chunks for processing
133
chunk_duration = 30  # seconds
134
chunk_samples = chunk_duration * 16000
135
chunks = []
136

137
for start in range(0, len(audio), chunk_samples):
138
    chunk = audio[start:start + chunk_samples]
139
    if len(chunk) < chunk_samples:
140
        # Pad last chunk if necessary
141
        chunk = np.pad(chunk, (0, chunk_samples - len(chunk)))
142
    chunks.append(chunk)
143

144
# Process each chunk
145
for i, chunk in enumerate(chunks):
146
    print(f"Processing chunk {i+1}/{len(chunks)}")
147
    segments, info = model.transcribe(chunk)
148
    
149
    for segment in segments:
150
        # Adjust timestamps for chunk offset
151
        start_time = segment.start + (i * chunk_duration)
152
        end_time = segment.end + (i * chunk_duration)
153
        print(f"[{start_time:.2f}s -> {end_time:.2f}s] {segment.text}")
154
```
155

156
### Audio Quality Validation
157

158
```python
159
from faster_whisper import decode_audio
160
import numpy as np
161

162
def validate_audio_quality(audio_path):
163
    """Validate audio quality for speech recognition."""
164
    audio = decode_audio(audio_path)
165
    
166
    # Basic quality checks
167
    duration = len(audio) / 16000
168
    rms_level = np.sqrt(np.mean(audio**2))
169
    max_amplitude = np.max(np.abs(audio))
170
    
171
    print(f"Duration: {duration:.2f}s")
172
    print(f"RMS level: {rms_level:.4f}")
173
    print(f"Max amplitude: {max_amplitude:.4f}")
174
    
175
    # Quality warnings
176
    if duration < 1.0:
177
        print("WARNING: Audio is very short (< 1s)")
178
    if rms_level < 0.01:
179
        print("WARNING: Audio level is very low")
180
    if max_amplitude > 0.95:
181
        print("WARNING: Audio may be clipped")
182
    
183
    return audio
184

185
# Validate before transcription
186
audio = validate_audio_quality("input.wav")
187
```
188

189
## Supported Audio Formats
190

191
The `decode_audio` function supports all formats handled by FFmpeg/PyAV, including:
192

193
- **Common formats**: MP3, WAV, FLAC, AAC, OGG, M4A
194
- **Video formats**: MP4, AVI, MKV, WebM (audio track extracted)
195
- **Professional formats**: AIFF, AU, VOC
196
- **Compressed formats**: WMA, APE, TTA
197

198
## Technical Notes
199

200
- **Memory Management**: The decode_audio function includes garbage collection to prevent memory leaks with the resampler
201
- **Performance**: For very large audio files, consider processing in chunks to manage memory usage
202
- **Precision**: Audio is converted to float32 format normalized to [-1.0, 1.0] range
203
- **Resampling**: Uses PyAV's high-quality resampling for sample rate conversion

Version

Tile

Files

audio-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

audio-processing.mddocs/