0
# Voice Activity Detection
1
2
Voice activity detection functionality using Silero VAD for automatic silence detection and audio segmentation. VAD helps improve transcription accuracy by filtering out silence and focusing processing on speech segments.
3
4
## Capabilities
5
6
### VAD Configuration
7
8
Configure voice activity detection parameters for different audio scenarios and quality requirements.
9
10
```python { .api }
11
@dataclass
12
class VadOptions:
13
"""
14
Voice Activity Detection options for Silero VAD.
15
16
Attributes:
17
threshold: Speech threshold (0-1). Probabilities above this are considered speech.
18
Higher values are more conservative. Default: 0.5
19
neg_threshold: Silence threshold for speech end detection. If None, uses threshold.
20
Values below this are always silence. Values above are speech only if
21
previous sample was speech. Default: None
22
min_speech_duration_ms: Minimum speech segment duration in milliseconds.
23
Shorter segments are discarded. Default: 0
24
max_speech_duration_s: Maximum speech segment duration in seconds.
25
Longer segments are split at silence gaps > 100ms or
26
aggressively if no suitable split point. Default: inf
27
min_silence_duration_ms: Minimum silence duration before ending speech segment.
28
Must be silent this long to end segment. Default: 2000
29
speech_pad_ms: Padding added to both ends of speech segments in milliseconds.
30
Helps avoid cutting off speech edges. Default: 400
31
"""
32
threshold: float = 0.5
33
neg_threshold: float | None = None
34
min_speech_duration_ms: int = 0
35
max_speech_duration_s: float = float("inf")
36
min_silence_duration_ms: int = 2000
37
speech_pad_ms: int = 400
38
```
39
40
### Speech Timestamp Detection
41
42
Extract speech timestamps from audio using Silero VAD model for automatic speech segmentation.
43
44
```python { .api }
45
def get_speech_timestamps(
46
audio: np.ndarray,
47
vad_options: VadOptions | None = None,
48
sampling_rate: int = 16000,
49
**kwargs
50
) -> list[dict]:
51
"""
52
Get speech timestamps using Silero VAD.
53
54
Args:
55
audio: Audio data as numpy array (mono, float32)
56
vad_options: VAD configuration options. If None, uses defaults
57
sampling_rate: Audio sample rate in Hz
58
**kwargs: Additional arguments passed to Silero VAD
59
60
Returns:
61
List of dictionaries with speech segments:
62
[
63
{"start": start_sample, "end": end_sample},
64
{"start": start_sample, "end": end_sample},
65
...
66
]
67
68
Notes:
69
- Timestamps are in sample indices, not seconds
70
- Convert to seconds by dividing by sampling_rate
71
- Empty list returned if no speech detected
72
"""
73
```
74
75
### Speech Chunk Collection
76
77
Collect and process audio chunks based on detected speech timestamps.
78
79
```python { .api }
80
def collect_chunks(
81
audio: np.ndarray,
82
chunks: list[dict],
83
sampling_rate: int = 16000,
84
max_duration: float = float("inf")
85
) -> tuple[list[np.ndarray], list[dict[str, float]]]:
86
"""
87
Collect and merge audio chunks based on speech timestamps.
88
89
Args:
90
audio: Original audio array
91
chunks: List of timestamp dictionaries from get_speech_timestamps
92
sampling_rate: Audio sampling rate in Hz (default: 16000)
93
max_duration: Maximum duration in seconds for merged chunks (default: inf)
94
95
Returns:
96
Tuple of (audio_chunks, chunks_metadata)
97
- audio_chunks: List of audio chunk arrays corresponding to speech segments
98
- chunks_metadata: List of metadata dictionaries with offset, duration, and segments info
99
100
Notes:
101
- Merges speech chunks that would exceed max_duration
102
- Returns empty chunk if no speech timestamps provided
103
- Metadata includes timing information for each merged chunk
104
"""
105
```
106
107
## Usage Examples
108
109
### Basic VAD Usage
110
111
```python
112
from faster_whisper import decode_audio
113
from faster_whisper.vad import get_speech_timestamps, VadOptions
114
115
# Decode audio
116
audio = decode_audio("interview.mp3", sampling_rate=16000)
117
118
# Get speech timestamps with default settings
119
speech_timestamps = get_speech_timestamps(audio)
120
121
# Convert to seconds and display
122
for i, segment in enumerate(speech_timestamps):
123
start_sec = segment["start"] / 16000
124
end_sec = segment["end"] / 16000
125
duration = end_sec - start_sec
126
print(f"Speech segment {i+1}: {start_sec:.2f}s - {end_sec:.2f}s ({duration:.2f}s)")
127
```
128
129
### Custom VAD Configuration
130
131
```python
132
from faster_whisper import decode_audio
133
from faster_whisper.vad import get_speech_timestamps, VadOptions
134
135
audio = decode_audio("noisy_audio.wav")
136
137
# Configure VAD for noisy environment
138
vad_options = VadOptions(
139
threshold=0.6, # Higher threshold for noisy audio
140
min_speech_duration_ms=500, # Ignore very short speech
141
min_silence_duration_ms=1000, # Shorter silence gaps
142
speech_pad_ms=200 # Less padding for tight segments
143
)
144
145
speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)
146
147
print(f"Found {len(speech_timestamps)} speech segments")
148
for segment in speech_timestamps:
149
start_sec = segment["start"] / 16000
150
end_sec = segment["end"] / 16000
151
print(f" {start_sec:.2f}s - {end_sec:.2f}s")
152
```
153
154
### VAD with Transcription
155
156
```python
157
from faster_whisper import WhisperModel, decode_audio
158
from faster_whisper.vad import VadOptions
159
160
model = WhisperModel("base")
161
162
# Use VAD filtering during transcription
163
vad_options = VadOptions(
164
threshold=0.5,
165
min_speech_duration_ms=1000,
166
max_speech_duration_s=30
167
)
168
169
segments, info = model.transcribe(
170
"lecture.mp3",
171
vad_filter=True,
172
vad_parameters=vad_options,
173
word_timestamps=True
174
)
175
176
print(f"Duration before VAD: {info.duration:.2f}s")
177
print(f"Duration after VAD: {info.duration_after_vad:.2f}s")
178
print(f"VAD filtered out {info.duration - info.duration_after_vad:.2f}s of silence")
179
180
for segment in segments:
181
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
182
```
183
184
### Processing Long Audio with VAD
185
186
```python
187
from faster_whisper import WhisperModel, decode_audio
188
from faster_whisper.vad import get_speech_timestamps, collect_chunks, VadOptions
189
import numpy as np
190
191
# Process very long audio file efficiently
192
audio = decode_audio("long_podcast.mp3")
193
print(f"Total audio duration: {len(audio) / 16000 / 60:.1f} minutes")
194
195
# Configure VAD for podcast content
196
vad_options = VadOptions(
197
threshold=0.4, # Lower threshold for clear speech
198
min_speech_duration_ms=2000, # Ignore short utterances
199
max_speech_duration_s=60, # Split very long segments
200
min_silence_duration_ms=3000, # Allow longer pauses
201
speech_pad_ms=500 # More padding for natural speech
202
)
203
204
# Get speech segments
205
speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)
206
speech_chunks, chunks_metadata = collect_chunks(audio, speech_timestamps)
207
208
print(f"Found {len(speech_chunks)} speech segments")
209
210
# Transcribe only speech chunks
211
model = WhisperModel("medium")
212
all_segments = []
213
214
for i, (chunk, chunk_metadata) in enumerate(zip(speech_chunks, chunks_metadata)):
215
print(f"Processing speech chunk {i+1}/{len(speech_chunks)}")
216
217
# Transcribe chunk
218
segments, info = model.transcribe(chunk)
219
220
# Adjust timestamps to global timeline
221
chunk_start_sec = chunk_metadata["offset"]
222
223
for segment in segments:
224
segment.start += chunk_start_sec
225
segment.end += chunk_start_sec
226
all_segments.append(segment)
227
228
# Display results
229
for segment in all_segments:
230
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
231
```
232
233
### VAD Quality Analysis
234
235
```python
236
from faster_whisper import decode_audio
237
from faster_whisper.vad import get_speech_timestamps, VadOptions
238
import numpy as np
239
240
def analyze_vad_quality(audio_path, vad_options=None):
241
"""Analyze VAD performance on audio file."""
242
audio = decode_audio(audio_path)
243
total_duration = len(audio) / 16000
244
245
speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)
246
247
if not speech_timestamps:
248
print("No speech detected!")
249
return
250
251
# Calculate statistics
252
speech_samples = sum(seg["end"] - seg["start"] for seg in speech_timestamps)
253
speech_duration = speech_samples / 16000
254
silence_duration = total_duration - speech_duration
255
256
segment_durations = [(seg["end"] - seg["start"]) / 16000 for seg in speech_timestamps]
257
avg_segment_duration = np.mean(segment_durations)
258
259
print(f"Audio Analysis for {audio_path}:")
260
print(f" Total duration: {total_duration:.2f}s")
261
print(f" Speech duration: {speech_duration:.2f}s ({speech_duration/total_duration*100:.1f}%)")
262
print(f" Silence duration: {silence_duration:.2f}s ({silence_duration/total_duration*100:.1f}%)")
263
print(f" Number of segments: {len(speech_timestamps)}")
264
print(f" Average segment duration: {avg_segment_duration:.2f}s")
265
print(f" Shortest segment: {min(segment_durations):.2f}s")
266
print(f" Longest segment: {max(segment_durations):.2f}s")
267
268
# Test different VAD configurations
269
analyze_vad_quality("meeting.wav")
270
271
# More aggressive VAD
272
strict_options = VadOptions(threshold=0.7, min_speech_duration_ms=1500)
273
analyze_vad_quality("meeting.wav", strict_options)
274
```
275
276
## VAD Parameter Tuning Guidelines
277
278
### Threshold Selection
279
- **0.3-0.4**: Sensitive, good for quiet/distant speech
280
- **0.5**: Balanced, good for most scenarios (default)
281
- **0.6-0.7**: Conservative, good for noisy environments
282
- **0.8+**: Very conservative, may miss quiet speech
283
284
### Duration Parameters
285
- **min_speech_duration_ms**: Filter out mouth sounds, very short utterances
286
- **max_speech_duration_s**: Prevent excessively long segments that hurt transcription
287
- **min_silence_duration_ms**: Control sensitivity to brief pauses in speech
288
- **speech_pad_ms**: Ensure speech edges aren't cut off
289
290
### Use Cases
291
- **Interviews/Meetings**: Lower threshold (0.4), longer min_speech_duration_ms
292
- **Phone Calls**: Higher threshold (0.6), more padding
293
- **Lectures**: Lower threshold, longer max_speech_duration_s
294
- **Noisy Environments**: Higher threshold, more filtering