Tessl Tile for pypi/faster-whisper@1.2.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

audio-processing.md batched-processing.md core-speech-recognition.md index.md utilities.md voice-activity-detection.md

voice-activity-detection.mddocs/

0
# Voice Activity Detection
1

2
Voice activity detection functionality using Silero VAD for automatic silence detection and audio segmentation. VAD helps improve transcription accuracy by filtering out silence and focusing processing on speech segments.
3

4
## Capabilities
5

6
### VAD Configuration
7

8
Configure voice activity detection parameters for different audio scenarios and quality requirements.
9

10
```python { .api }
11
@dataclass
12
class VadOptions:
13
    """
14
    Voice Activity Detection options for Silero VAD.
15
    
16
    Attributes:
17
        threshold: Speech threshold (0-1). Probabilities above this are considered speech.
18
                  Higher values are more conservative. Default: 0.5
19
        neg_threshold: Silence threshold for speech end detection. If None, uses threshold.
20
                      Values below this are always silence. Values above are speech only if
21
                      previous sample was speech. Default: None
22
        min_speech_duration_ms: Minimum speech segment duration in milliseconds.
23
                               Shorter segments are discarded. Default: 0
24
        max_speech_duration_s: Maximum speech segment duration in seconds.
25
                              Longer segments are split at silence gaps > 100ms or
26
                              aggressively if no suitable split point. Default: inf
27
        min_silence_duration_ms: Minimum silence duration before ending speech segment.
28
                                Must be silent this long to end segment. Default: 2000
29
        speech_pad_ms: Padding added to both ends of speech segments in milliseconds.
30
                      Helps avoid cutting off speech edges. Default: 400
31
    """
32
    threshold: float = 0.5
33
    neg_threshold: float | None = None
34
    min_speech_duration_ms: int = 0
35
    max_speech_duration_s: float = float("inf")
36
    min_silence_duration_ms: int = 2000
37
    speech_pad_ms: int = 400
38
```
39

40
### Speech Timestamp Detection
41

42
Extract speech timestamps from audio using Silero VAD model for automatic speech segmentation.
43

44
```python { .api }
45
def get_speech_timestamps(
46
    audio: np.ndarray,
47
    vad_options: VadOptions | None = None,
48
    sampling_rate: int = 16000,
49
    **kwargs
50
) -> list[dict]:
51
    """
52
    Get speech timestamps using Silero VAD.
53
    
54
    Args:
55
        audio: Audio data as numpy array (mono, float32)
56
        vad_options: VAD configuration options. If None, uses defaults
57
        sampling_rate: Audio sample rate in Hz
58
        **kwargs: Additional arguments passed to Silero VAD
59
        
60
    Returns:
61
        List of dictionaries with speech segments:
62
        [
63
            {"start": start_sample, "end": end_sample},
64
            {"start": start_sample, "end": end_sample},
65
            ...
66
        ]
67
        
68
    Notes:
69
        - Timestamps are in sample indices, not seconds
70
        - Convert to seconds by dividing by sampling_rate
71
        - Empty list returned if no speech detected
72
    """
73
```
74

75
### Speech Chunk Collection
76

77
Collect and process audio chunks based on detected speech timestamps.
78

79
```python { .api }
80
def collect_chunks(
81
    audio: np.ndarray,
82
    chunks: list[dict],
83
    sampling_rate: int = 16000,
84
    max_duration: float = float("inf")
85
) -> tuple[list[np.ndarray], list[dict[str, float]]]:
86
    """
87
    Collect and merge audio chunks based on speech timestamps.
88
    
89
    Args:
90
        audio: Original audio array
91
        chunks: List of timestamp dictionaries from get_speech_timestamps
92
        sampling_rate: Audio sampling rate in Hz (default: 16000)
93
        max_duration: Maximum duration in seconds for merged chunks (default: inf)
94
        
95
    Returns:
96
        Tuple of (audio_chunks, chunks_metadata)
97
        - audio_chunks: List of audio chunk arrays corresponding to speech segments
98
        - chunks_metadata: List of metadata dictionaries with offset, duration, and segments info
99
        
100
    Notes:
101
        - Merges speech chunks that would exceed max_duration
102
        - Returns empty chunk if no speech timestamps provided
103
        - Metadata includes timing information for each merged chunk
104
    """
105
```
106

107
## Usage Examples
108

109
### Basic VAD Usage
110

111
```python
112
from faster_whisper import decode_audio
113
from faster_whisper.vad import get_speech_timestamps, VadOptions
114

115
# Decode audio
116
audio = decode_audio("interview.mp3", sampling_rate=16000)
117

118
# Get speech timestamps with default settings
119
speech_timestamps = get_speech_timestamps(audio)
120

121
# Convert to seconds and display
122
for i, segment in enumerate(speech_timestamps):
123
    start_sec = segment["start"] / 16000
124
    end_sec = segment["end"] / 16000
125
    duration = end_sec - start_sec
126
    print(f"Speech segment {i+1}: {start_sec:.2f}s - {end_sec:.2f}s ({duration:.2f}s)")
127
```
128

129
### Custom VAD Configuration
130

131
```python
132
from faster_whisper import decode_audio
133
from faster_whisper.vad import get_speech_timestamps, VadOptions
134

135
audio = decode_audio("noisy_audio.wav")
136

137
# Configure VAD for noisy environment
138
vad_options = VadOptions(
139
    threshold=0.6,  # Higher threshold for noisy audio
140
    min_speech_duration_ms=500,  # Ignore very short speech
141
    min_silence_duration_ms=1000,  # Shorter silence gaps
142
    speech_pad_ms=200  # Less padding for tight segments
143
)
144

145
speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)
146

147
print(f"Found {len(speech_timestamps)} speech segments")
148
for segment in speech_timestamps:
149
    start_sec = segment["start"] / 16000
150
    end_sec = segment["end"] / 16000
151
    print(f"  {start_sec:.2f}s - {end_sec:.2f}s")
152
```
153

154
### VAD with Transcription
155

156
```python
157
from faster_whisper import WhisperModel, decode_audio
158
from faster_whisper.vad import VadOptions
159

160
model = WhisperModel("base")
161

162
# Use VAD filtering during transcription
163
vad_options = VadOptions(
164
    threshold=0.5,
165
    min_speech_duration_ms=1000,
166
    max_speech_duration_s=30
167
)
168

169
segments, info = model.transcribe(
170
    "lecture.mp3",
171
    vad_filter=True,
172
    vad_parameters=vad_options,
173
    word_timestamps=True
174
)
175

176
print(f"Duration before VAD: {info.duration:.2f}s")
177
print(f"Duration after VAD: {info.duration_after_vad:.2f}s")
178
print(f"VAD filtered out {info.duration - info.duration_after_vad:.2f}s of silence")
179

180
for segment in segments:
181
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
182
```
183

184
### Processing Long Audio with VAD
185

186
```python
187
from faster_whisper import WhisperModel, decode_audio
188
from faster_whisper.vad import get_speech_timestamps, collect_chunks, VadOptions
189
import numpy as np
190

191
# Process very long audio file efficiently
192
audio = decode_audio("long_podcast.mp3")
193
print(f"Total audio duration: {len(audio) / 16000 / 60:.1f} minutes")
194

195
# Configure VAD for podcast content
196
vad_options = VadOptions(
197
    threshold=0.4,  # Lower threshold for clear speech
198
    min_speech_duration_ms=2000,  # Ignore short utterances
199
    max_speech_duration_s=60,  # Split very long segments
200
    min_silence_duration_ms=3000,  # Allow longer pauses
201
    speech_pad_ms=500  # More padding for natural speech
202
)
203

204
# Get speech segments
205
speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)
206
speech_chunks, chunks_metadata = collect_chunks(audio, speech_timestamps)
207

208
print(f"Found {len(speech_chunks)} speech segments")
209

210
# Transcribe only speech chunks
211
model = WhisperModel("medium")
212
all_segments = []
213

214
for i, (chunk, chunk_metadata) in enumerate(zip(speech_chunks, chunks_metadata)):
215
    print(f"Processing speech chunk {i+1}/{len(speech_chunks)}")
216
    
217
    # Transcribe chunk
218
    segments, info = model.transcribe(chunk)
219
    
220
    # Adjust timestamps to global timeline
221
    chunk_start_sec = chunk_metadata["offset"]
222
    
223
    for segment in segments:
224
        segment.start += chunk_start_sec
225
        segment.end += chunk_start_sec
226
        all_segments.append(segment)
227

228
# Display results
229
for segment in all_segments:
230
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
231
```
232

233
### VAD Quality Analysis
234

235
```python
236
from faster_whisper import decode_audio
237
from faster_whisper.vad import get_speech_timestamps, VadOptions
238
import numpy as np
239

240
def analyze_vad_quality(audio_path, vad_options=None):
241
    """Analyze VAD performance on audio file."""
242
    audio = decode_audio(audio_path)
243
    total_duration = len(audio) / 16000
244
    
245
    speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)
246
    
247
    if not speech_timestamps:
248
        print("No speech detected!")
249
        return
250
    
251
    # Calculate statistics
252
    speech_samples = sum(seg["end"] - seg["start"] for seg in speech_timestamps)
253
    speech_duration = speech_samples / 16000
254
    silence_duration = total_duration - speech_duration
255
    
256
    segment_durations = [(seg["end"] - seg["start"]) / 16000 for seg in speech_timestamps]
257
    avg_segment_duration = np.mean(segment_durations)
258
    
259
    print(f"Audio Analysis for {audio_path}:")
260
    print(f"  Total duration: {total_duration:.2f}s")
261
    print(f"  Speech duration: {speech_duration:.2f}s ({speech_duration/total_duration*100:.1f}%)")
262
    print(f"  Silence duration: {silence_duration:.2f}s ({silence_duration/total_duration*100:.1f}%)")
263
    print(f"  Number of segments: {len(speech_timestamps)}")
264
    print(f"  Average segment duration: {avg_segment_duration:.2f}s")
265
    print(f"  Shortest segment: {min(segment_durations):.2f}s")
266
    print(f"  Longest segment: {max(segment_durations):.2f}s")
267

268
# Test different VAD configurations
269
analyze_vad_quality("meeting.wav")
270

271
# More aggressive VAD
272
strict_options = VadOptions(threshold=0.7, min_speech_duration_ms=1500)
273
analyze_vad_quality("meeting.wav", strict_options)
274
```
275

276
## VAD Parameter Tuning Guidelines
277

278
### Threshold Selection
279
- **0.3-0.4**: Sensitive, good for quiet/distant speech
280
- **0.5**: Balanced, good for most scenarios (default)  
281
- **0.6-0.7**: Conservative, good for noisy environments
282
- **0.8+**: Very conservative, may miss quiet speech
283

284
### Duration Parameters
285
- **min_speech_duration_ms**: Filter out mouth sounds, very short utterances
286
- **max_speech_duration_s**: Prevent excessively long segments that hurt transcription
287
- **min_silence_duration_ms**: Control sensitivity to brief pauses in speech
288
- **speech_pad_ms**: Ensure speech edges aren't cut off
289

290
### Use Cases
291
- **Interviews/Meetings**: Lower threshold (0.4), longer min_speech_duration_ms
292
- **Phone Calls**: Higher threshold (0.6), more padding
293
- **Lectures**: Lower threshold, longer max_speech_duration_s
294
- **Noisy Environments**: Higher threshold, more filtering

Version

Tile

Files

voice-activity-detection.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

voice-activity-detection.mddocs/