or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

audio-processing.mdbatched-processing.mdcore-speech-recognition.mdindex.mdutilities.mdvoice-activity-detection.md

voice-activity-detection.mddocs/

0

# Voice Activity Detection

1

2

Voice activity detection functionality using Silero VAD for automatic silence detection and audio segmentation. VAD helps improve transcription accuracy by filtering out silence and focusing processing on speech segments.

3

4

## Capabilities

5

6

### VAD Configuration

7

8

Configure voice activity detection parameters for different audio scenarios and quality requirements.

9

10

```python { .api }

11

@dataclass

12

class VadOptions:

13

"""

14

Voice Activity Detection options for Silero VAD.

15

16

Attributes:

17

threshold: Speech threshold (0-1). Probabilities above this are considered speech.

18

Higher values are more conservative. Default: 0.5

19

neg_threshold: Silence threshold for speech end detection. If None, uses threshold.

20

Values below this are always silence. Values above are speech only if

21

previous sample was speech. Default: None

22

min_speech_duration_ms: Minimum speech segment duration in milliseconds.

23

Shorter segments are discarded. Default: 0

24

max_speech_duration_s: Maximum speech segment duration in seconds.

25

Longer segments are split at silence gaps > 100ms or

26

aggressively if no suitable split point. Default: inf

27

min_silence_duration_ms: Minimum silence duration before ending speech segment.

28

Must be silent this long to end segment. Default: 2000

29

speech_pad_ms: Padding added to both ends of speech segments in milliseconds.

30

Helps avoid cutting off speech edges. Default: 400

31

"""

32

threshold: float = 0.5

33

neg_threshold: float | None = None

34

min_speech_duration_ms: int = 0

35

max_speech_duration_s: float = float("inf")

36

min_silence_duration_ms: int = 2000

37

speech_pad_ms: int = 400

38

```

39

40

### Speech Timestamp Detection

41

42

Extract speech timestamps from audio using Silero VAD model for automatic speech segmentation.

43

44

```python { .api }

45

def get_speech_timestamps(

46

audio: np.ndarray,

47

vad_options: VadOptions | None = None,

48

sampling_rate: int = 16000,

49

**kwargs

50

) -> list[dict]:

51

"""

52

Get speech timestamps using Silero VAD.

53

54

Args:

55

audio: Audio data as numpy array (mono, float32)

56

vad_options: VAD configuration options. If None, uses defaults

57

sampling_rate: Audio sample rate in Hz

58

**kwargs: Additional arguments passed to Silero VAD

59

60

Returns:

61

List of dictionaries with speech segments:

62

[

63

{"start": start_sample, "end": end_sample},

64

{"start": start_sample, "end": end_sample},

65

...

66

]

67

68

Notes:

69

- Timestamps are in sample indices, not seconds

70

- Convert to seconds by dividing by sampling_rate

71

- Empty list returned if no speech detected

72

"""

73

```

74

75

### Speech Chunk Collection

76

77

Collect and process audio chunks based on detected speech timestamps.

78

79

```python { .api }

80

def collect_chunks(

81

audio: np.ndarray,

82

chunks: list[dict],

83

sampling_rate: int = 16000,

84

max_duration: float = float("inf")

85

) -> tuple[list[np.ndarray], list[dict[str, float]]]:

86

"""

87

Collect and merge audio chunks based on speech timestamps.

88

89

Args:

90

audio: Original audio array

91

chunks: List of timestamp dictionaries from get_speech_timestamps

92

sampling_rate: Audio sampling rate in Hz (default: 16000)

93

max_duration: Maximum duration in seconds for merged chunks (default: inf)

94

95

Returns:

96

Tuple of (audio_chunks, chunks_metadata)

97

- audio_chunks: List of audio chunk arrays corresponding to speech segments

98

- chunks_metadata: List of metadata dictionaries with offset, duration, and segments info

99

100

Notes:

101

- Merges speech chunks that would exceed max_duration

102

- Returns empty chunk if no speech timestamps provided

103

- Metadata includes timing information for each merged chunk

104

"""

105

```

106

107

## Usage Examples

108

109

### Basic VAD Usage

110

111

```python

112

from faster_whisper import decode_audio

113

from faster_whisper.vad import get_speech_timestamps, VadOptions

114

115

# Decode audio

116

audio = decode_audio("interview.mp3", sampling_rate=16000)

117

118

# Get speech timestamps with default settings

119

speech_timestamps = get_speech_timestamps(audio)

120

121

# Convert to seconds and display

122

for i, segment in enumerate(speech_timestamps):

123

start_sec = segment["start"] / 16000

124

end_sec = segment["end"] / 16000

125

duration = end_sec - start_sec

126

print(f"Speech segment {i+1}: {start_sec:.2f}s - {end_sec:.2f}s ({duration:.2f}s)")

127

```

128

129

### Custom VAD Configuration

130

131

```python

132

from faster_whisper import decode_audio

133

from faster_whisper.vad import get_speech_timestamps, VadOptions

134

135

audio = decode_audio("noisy_audio.wav")

136

137

# Configure VAD for noisy environment

138

vad_options = VadOptions(

139

threshold=0.6, # Higher threshold for noisy audio

140

min_speech_duration_ms=500, # Ignore very short speech

141

min_silence_duration_ms=1000, # Shorter silence gaps

142

speech_pad_ms=200 # Less padding for tight segments

143

)

144

145

speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)

146

147

print(f"Found {len(speech_timestamps)} speech segments")

148

for segment in speech_timestamps:

149

start_sec = segment["start"] / 16000

150

end_sec = segment["end"] / 16000

151

print(f" {start_sec:.2f}s - {end_sec:.2f}s")

152

```

153

154

### VAD with Transcription

155

156

```python

157

from faster_whisper import WhisperModel, decode_audio

158

from faster_whisper.vad import VadOptions

159

160

model = WhisperModel("base")

161

162

# Use VAD filtering during transcription

163

vad_options = VadOptions(

164

threshold=0.5,

165

min_speech_duration_ms=1000,

166

max_speech_duration_s=30

167

)

168

169

segments, info = model.transcribe(

170

"lecture.mp3",

171

vad_filter=True,

172

vad_parameters=vad_options,

173

word_timestamps=True

174

)

175

176

print(f"Duration before VAD: {info.duration:.2f}s")

177

print(f"Duration after VAD: {info.duration_after_vad:.2f}s")

178

print(f"VAD filtered out {info.duration - info.duration_after_vad:.2f}s of silence")

179

180

for segment in segments:

181

print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

182

```

183

184

### Processing Long Audio with VAD

185

186

```python

187

from faster_whisper import WhisperModel, decode_audio

188

from faster_whisper.vad import get_speech_timestamps, collect_chunks, VadOptions

189

import numpy as np

190

191

# Process very long audio file efficiently

192

audio = decode_audio("long_podcast.mp3")

193

print(f"Total audio duration: {len(audio) / 16000 / 60:.1f} minutes")

194

195

# Configure VAD for podcast content

196

vad_options = VadOptions(

197

threshold=0.4, # Lower threshold for clear speech

198

min_speech_duration_ms=2000, # Ignore short utterances

199

max_speech_duration_s=60, # Split very long segments

200

min_silence_duration_ms=3000, # Allow longer pauses

201

speech_pad_ms=500 # More padding for natural speech

202

)

203

204

# Get speech segments

205

speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)

206

speech_chunks, chunks_metadata = collect_chunks(audio, speech_timestamps)

207

208

print(f"Found {len(speech_chunks)} speech segments")

209

210

# Transcribe only speech chunks

211

model = WhisperModel("medium")

212

all_segments = []

213

214

for i, (chunk, chunk_metadata) in enumerate(zip(speech_chunks, chunks_metadata)):

215

print(f"Processing speech chunk {i+1}/{len(speech_chunks)}")

216

217

# Transcribe chunk

218

segments, info = model.transcribe(chunk)

219

220

# Adjust timestamps to global timeline

221

chunk_start_sec = chunk_metadata["offset"]

222

223

for segment in segments:

224

segment.start += chunk_start_sec

225

segment.end += chunk_start_sec

226

all_segments.append(segment)

227

228

# Display results

229

for segment in all_segments:

230

print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

231

```

232

233

### VAD Quality Analysis

234

235

```python

236

from faster_whisper import decode_audio

237

from faster_whisper.vad import get_speech_timestamps, VadOptions

238

import numpy as np

239

240

def analyze_vad_quality(audio_path, vad_options=None):

241

"""Analyze VAD performance on audio file."""

242

audio = decode_audio(audio_path)

243

total_duration = len(audio) / 16000

244

245

speech_timestamps = get_speech_timestamps(audio, vad_options=vad_options)

246

247

if not speech_timestamps:

248

print("No speech detected!")

249

return

250

251

# Calculate statistics

252

speech_samples = sum(seg["end"] - seg["start"] for seg in speech_timestamps)

253

speech_duration = speech_samples / 16000

254

silence_duration = total_duration - speech_duration

255

256

segment_durations = [(seg["end"] - seg["start"]) / 16000 for seg in speech_timestamps]

257

avg_segment_duration = np.mean(segment_durations)

258

259

print(f"Audio Analysis for {audio_path}:")

260

print(f" Total duration: {total_duration:.2f}s")

261

print(f" Speech duration: {speech_duration:.2f}s ({speech_duration/total_duration*100:.1f}%)")

262

print(f" Silence duration: {silence_duration:.2f}s ({silence_duration/total_duration*100:.1f}%)")

263

print(f" Number of segments: {len(speech_timestamps)}")

264

print(f" Average segment duration: {avg_segment_duration:.2f}s")

265

print(f" Shortest segment: {min(segment_durations):.2f}s")

266

print(f" Longest segment: {max(segment_durations):.2f}s")

267

268

# Test different VAD configurations

269

analyze_vad_quality("meeting.wav")

270

271

# More aggressive VAD

272

strict_options = VadOptions(threshold=0.7, min_speech_duration_ms=1500)

273

analyze_vad_quality("meeting.wav", strict_options)

274

```

275

276

## VAD Parameter Tuning Guidelines

277

278

### Threshold Selection

279

- **0.3-0.4**: Sensitive, good for quiet/distant speech

280

- **0.5**: Balanced, good for most scenarios (default)

281

- **0.6-0.7**: Conservative, good for noisy environments

282

- **0.8+**: Very conservative, may miss quiet speech

283

284

### Duration Parameters

285

- **min_speech_duration_ms**: Filter out mouth sounds, very short utterances

286

- **max_speech_duration_s**: Prevent excessively long segments that hurt transcription

287

- **min_silence_duration_ms**: Control sensitivity to brief pauses in speech

288

- **speech_pad_ms**: Ensure speech edges aren't cut off

289

290

### Use Cases

291

- **Interviews/Meetings**: Lower threshold (0.4), longer min_speech_duration_ms

292

- **Phone Calls**: Higher threshold (0.6), more padding

293

- **Lectures**: Lower threshold, longer max_speech_duration_s

294

- **Noisy Environments**: Higher threshold, more filtering