or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

audio-processing.mdbatched-processing.mdcore-speech-recognition.mdindex.mdutilities.mdvoice-activity-detection.md

audio-processing.mddocs/

0

# Audio Processing

1

2

Audio decoding, format conversion, and preprocessing utilities for preparing audio data for transcription. These functions handle the conversion of various audio formats into the numpy arrays required by the Whisper models.

3

4

## Capabilities

5

6

### Audio Decoding

7

8

Decode audio files from various formats into numpy arrays suitable for speech recognition processing.

9

10

```python { .api }

11

def decode_audio(

12

input_file: str | BinaryIO,

13

sampling_rate: int = 16000,

14

split_stereo: bool = False,

15

) -> np.ndarray | tuple[np.ndarray, np.ndarray]:

16

"""

17

Decode audio from file or file-like object.

18

19

Uses PyAV library to decode audio with FFmpeg backend, supporting most audio formats

20

without requiring system FFmpeg installation.

21

22

Args:

23

input_file: Path to audio file or file-like object containing audio data

24

sampling_rate: Target sample rate for resampling (default: 16000 Hz)

25

split_stereo: If True, return separate left and right channels for stereo audio

26

27

Returns:

28

- If split_stereo=False: Single numpy array of shape (samples,) containing mono audio

29

- If split_stereo=True: Tuple of (left_channel, right_channel) numpy arrays

30

31

Notes:

32

- Output is always float32 normalized to [-1.0, 1.0] range

33

- Stereo audio is automatically converted to mono unless split_stereo=True

34

- Automatically handles resampling to target sampling rate

35

- Supports all formats supported by FFmpeg/PyAV

36

"""

37

```

38

39

### Array Padding and Trimming

40

41

Utility function for padding or trimming arrays to specific lengths, commonly used for feature processing.

42

43

```python { .api }

44

def pad_or_trim(

45

array: np.ndarray,

46

length: int = 3000,

47

*,

48

axis: int = -1

49

) -> np.ndarray:

50

"""

51

Pad or trim array to specified length along given axis.

52

53

Used internally for preparing mel-spectrogram features to expected input size

54

for the encoder (typically 3000 frames for 30-second audio chunks).

55

56

Args:

57

array: Input numpy array to pad or trim

58

length: Target length for the specified axis

59

axis: Axis along which to pad or trim (default: last axis)

60

61

Returns:

62

Array padded or trimmed to specified length

63

64

Notes:

65

- If array is longer than length, it's trimmed from the end

66

- If array is shorter than length, it's zero-padded at the end

67

- Padding uses numpy's pad function with zeros

68

"""

69

```

70

71

## Usage Examples

72

73

### Basic Audio Decoding

74

75

```python

76

from faster_whisper import decode_audio

77

import numpy as np

78

79

# Decode audio file to mono

80

audio = decode_audio("speech.mp3")

81

print(f"Audio shape: {audio.shape}")

82

print(f"Audio dtype: {audio.dtype}")

83

print(f"Duration: {len(audio) / 16000:.2f} seconds")

84

85

# Decode with custom sample rate

86

audio_8k = decode_audio("speech.mp3", sampling_rate=8000)

87

print(f"8kHz audio shape: {audio_8k.shape}")

88

```

89

90

### Stereo Audio Processing

91

92

```python

93

from faster_whisper import decode_audio

94

95

# Decode stereo audio as separate channels

96

left_channel, right_channel = decode_audio("stereo_audio.wav", split_stereo=True)

97

98

print(f"Left channel shape: {left_channel.shape}")

99

print(f"Right channel shape: {right_channel.shape}")

100

101

# Process each channel separately or combine them

102

combined = (left_channel + right_channel) / 2 # Simple averaging

103

```

104

105

### Working with File-like Objects

106

107

```python

108

from faster_whisper import decode_audio

109

import io

110

import requests

111

112

# Download and decode audio from URL

113

response = requests.get("https://example.com/audio.wav")

114

audio_bytes = io.BytesIO(response.content)

115

116

# Decode from memory

117

audio = decode_audio(audio_bytes, sampling_rate=16000)

118

print(f"Downloaded audio duration: {len(audio) / 16000:.2f}s")

119

```

120

121

### Pre-processing for Transcription

122

123

```python

124

from faster_whisper import WhisperModel, decode_audio

125

import numpy as np

126

127

model = WhisperModel("base")

128

129

# Decode audio manually

130

audio = decode_audio("long_audio.mp3", sampling_rate=16000)

131

132

# Split long audio into chunks for processing

133

chunk_duration = 30 # seconds

134

chunk_samples = chunk_duration * 16000

135

chunks = []

136

137

for start in range(0, len(audio), chunk_samples):

138

chunk = audio[start:start + chunk_samples]

139

if len(chunk) < chunk_samples:

140

# Pad last chunk if necessary

141

chunk = np.pad(chunk, (0, chunk_samples - len(chunk)))

142

chunks.append(chunk)

143

144

# Process each chunk

145

for i, chunk in enumerate(chunks):

146

print(f"Processing chunk {i+1}/{len(chunks)}")

147

segments, info = model.transcribe(chunk)

148

149

for segment in segments:

150

# Adjust timestamps for chunk offset

151

start_time = segment.start + (i * chunk_duration)

152

end_time = segment.end + (i * chunk_duration)

153

print(f"[{start_time:.2f}s -> {end_time:.2f}s] {segment.text}")

154

```

155

156

### Audio Quality Validation

157

158

```python

159

from faster_whisper import decode_audio

160

import numpy as np

161

162

def validate_audio_quality(audio_path):

163

"""Validate audio quality for speech recognition."""

164

audio = decode_audio(audio_path)

165

166

# Basic quality checks

167

duration = len(audio) / 16000

168

rms_level = np.sqrt(np.mean(audio**2))

169

max_amplitude = np.max(np.abs(audio))

170

171

print(f"Duration: {duration:.2f}s")

172

print(f"RMS level: {rms_level:.4f}")

173

print(f"Max amplitude: {max_amplitude:.4f}")

174

175

# Quality warnings

176

if duration < 1.0:

177

print("WARNING: Audio is very short (< 1s)")

178

if rms_level < 0.01:

179

print("WARNING: Audio level is very low")

180

if max_amplitude > 0.95:

181

print("WARNING: Audio may be clipped")

182

183

return audio

184

185

# Validate before transcription

186

audio = validate_audio_quality("input.wav")

187

```

188

189

## Supported Audio Formats

190

191

The `decode_audio` function supports all formats handled by FFmpeg/PyAV, including:

192

193

- **Common formats**: MP3, WAV, FLAC, AAC, OGG, M4A

194

- **Video formats**: MP4, AVI, MKV, WebM (audio track extracted)

195

- **Professional formats**: AIFF, AU, VOC

196

- **Compressed formats**: WMA, APE, TTA

197

198

## Technical Notes

199

200

- **Memory Management**: The decode_audio function includes garbage collection to prevent memory leaks with the resampler

201

- **Performance**: For very large audio files, consider processing in chunks to manage memory usage

202

- **Precision**: Audio is converted to float32 format normalized to [-1.0, 1.0] range

203

- **Resampling**: Uses PyAV's high-quality resampling for sample rate conversion