0
# Signal Processing Functions
1
2
Extensive collection of stateless audio processing functions for spectral analysis, filtering, resampling, pitch manipulation, and advanced signal processing algorithms. These functions operate directly on tensors and are compatible with PyTorch's autograd system for gradient-based optimization.
3
4
## Capabilities
5
6
### Spectral Analysis
7
8
Core spectral analysis functions for converting between time and frequency domains.
9
10
```python { .api }
11
def spectrogram(waveform: torch.Tensor, pad: int, window: torch.Tensor,
12
n_fft: int, hop_length: int, win_length: int,
13
power: Optional[float], normalized: Union[bool, str],
14
center: bool = True, pad_mode: str = "reflect",
15
onesided: bool = True, return_complex: Optional[bool] = None) -> torch.Tensor:
16
"""
17
Create spectrogram from waveform.
18
19
Args:
20
waveform: Tensor of audio of dimension (..., time)
21
pad: Two sided padding of signal
22
window: Window tensor that is applied/multiplied to each frame/window
23
n_fft: Size of FFT
24
hop_length: Length of hop between STFT windows
25
win_length: Window size
26
power: Exponent for the magnitude spectrogram (must be > 0) e.g., 1 for magnitude, 2 for power, etc. If None, then the complex spectrum is returned instead.
27
normalized: Whether to normalize by magnitude after stft. If input is str, choices are "window" and "frame_length", if specific normalization type is desirable. True maps to "window".
28
center: whether to pad waveform on both sides so that the t-th frame is centered at time t × hop_length
29
pad_mode: controls the padding method used when center is True
30
onesided: controls whether to return half of results to avoid redundancy
31
return_complex: Deprecated, use power=None instead
32
33
Returns:
34
Tensor: Spectrogram with shape (..., freq, time)
35
"""
36
37
def inverse_spectrogram(spectrogram: torch.Tensor, length: Optional[int] = None,
38
pad: int = 0, window: Optional[torch.Tensor] = None,
39
n_fft: int = 400, hop_length: Optional[int] = None,
40
win_length: Optional[int] = None, normalized: bool = False,
41
center: bool = True, pad_mode: str = "reflect",
42
onesided: bool = True) -> torch.Tensor:
43
"""
44
Reconstruct waveform from spectrogram using inverse STFT.
45
46
Args:
47
spectrogram: Input spectrogram (..., freq, time)
48
length: Expected length of output
49
(other parameters same as spectrogram)
50
51
Returns:
52
Tensor: Reconstructed waveform (..., time)
53
"""
54
55
def griffinlim(spectrogram: torch.Tensor, window: Optional[torch.Tensor] = None,
56
n_fft: int = 400, hop_length: Optional[int] = None,
57
win_length: Optional[int] = None, power: float = 2.0,
58
n_iter: int = 32, momentum: float = 0.99,
59
length: Optional[int] = None, rand_init: bool = True) -> torch.Tensor:
60
"""
61
Reconstruct waveform from magnitude spectrogram using Griffin-Lim algorithm.
62
63
Args:
64
spectrogram: Magnitude spectrogram (..., freq, time)
65
window: Window function
66
n_fft: Size of FFT
67
hop_length: Length of hop between STFT windows
68
win_length: Window size
69
power: Exponent applied to spectrogram
70
n_iter: Number of Griffin-Lim iterations
71
momentum: Momentum parameter for fast Griffin-Lim
72
length: Expected output length
73
rand_init: Whether to initialize with random phase
74
75
Returns:
76
Tensor: Reconstructed waveform (..., time)
77
"""
78
```
79
80
### Mel-Scale Processing
81
82
Functions for mel-scale analysis commonly used in speech and music processing.
83
84
```python { .api }
85
def melscale_fbanks(n_freqs: int, f_min: float, f_max: float, n_mels: int,
86
sample_rate: int, norm: Optional[str] = None,
87
mel_scale: str = "htk") -> torch.Tensor:
88
"""
89
Create mel-scale filter banks.
90
91
Args:
92
n_freqs: Number of frequency bins (typically n_fft // 2 + 1)
93
f_min: Minimum frequency
94
f_max: Maximum frequency
95
n_mels: Number of mel filter banks
96
sample_rate: Sample rate of audio
97
norm: Normalization method ("slaney" or None)
98
mel_scale: Scale to use ("htk" or "slaney")
99
100
Returns:
101
Tensor: Mel filter bank matrix (n_mels, n_freqs)
102
"""
103
104
def linear_fbanks(n_freqs: int, f_min: float, f_max: float, n_filter: int,
105
sample_rate: int) -> torch.Tensor:
106
"""
107
Create linear-spaced filter banks.
108
109
Args:
110
n_freqs: Number of frequency bins
111
f_min: Minimum frequency
112
f_max: Maximum frequency
113
n_filter: Number of linear filter banks
114
sample_rate: Sample rate of audio
115
116
Returns:
117
Tensor: Linear filter bank matrix (n_filter, n_freqs)
118
"""
119
```
120
121
### Amplitude and Decibel Conversion
122
123
Functions for converting between linear amplitude and logarithmic decibel scales.
124
125
```python { .api }
126
def amplitude_to_DB(x: torch.Tensor, multiplier: float = 10.0, amin: float = 1e-10,
127
db_multiplier: float = 0.0, top_db: Optional[float] = None) -> torch.Tensor:
128
"""
129
Convert amplitude spectrogram to decibel scale.
130
131
Args:
132
x: Input tensor (amplitude or power spectrogram)
133
multiplier: Multiplier for log10 (10.0 for power, 20.0 for amplitude)
134
amin: Minimum value to clamp x
135
db_multiplier: Additional multiplier for result
136
top_db: Minimum negative cut-off in decibels
137
138
Returns:
139
Tensor: Spectrogram in decibel scale
140
"""
141
142
def DB_to_amplitude(x: torch.Tensor, ref: float = 1.0, power: float = 1.0) -> torch.Tensor:
143
"""
144
Convert decibel scale back to amplitude.
145
146
Args:
147
x: Input tensor in decibel scale
148
ref: Reference value
149
power: Power exponent (1.0 for amplitude, 2.0 for power)
150
151
Returns:
152
Tensor: Amplitude spectrogram
153
"""
154
```
155
156
### Resampling
157
158
Audio resampling for sample rate conversion.
159
160
```python { .api }
161
def resample(waveform: torch.Tensor, orig_freq: int, new_freq: int,
162
resampling_method: str = "sinc_interp_kaiser",
163
lowpass_filter_width: int = 6, rolloff: float = 0.99,
164
beta: Optional[float] = None) -> torch.Tensor:
165
"""
166
Resample waveform to different sample rate.
167
168
Args:
169
waveform: Input waveform tensor (..., time)
170
orig_freq: Original sample rate
171
new_freq: Target sample rate
172
resampling_method: Resampling algorithm ("sinc_interp_kaiser" or "sinc_interp_hann")
173
lowpass_filter_width: Width of lowpass filter
174
rolloff: Roll-off frequency of lowpass filter
175
beta: Shape parameter for Kaiser window
176
177
Returns:
178
Tensor: Resampled waveform
179
"""
180
```
181
182
### Audio Effects and Filtering
183
184
Comprehensive collection of audio filters and effects.
185
186
```python { .api }
187
def biquad(waveform: torch.Tensor, b0: float, b1: float, b2: float,
188
a0: float, a1: float, a2: float) -> torch.Tensor:
189
"""
190
Apply biquad IIR filter.
191
192
Args:
193
waveform: Input audio (..., time)
194
b0, b1, b2: Numerator coefficients
195
a0, a1, a2: Denominator coefficients
196
197
Returns:
198
Tensor: Filtered audio
199
"""
200
201
def allpass_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float, Q: float = 0.707) -> torch.Tensor:
202
"""
203
Design two-pole all-pass filter. Similar to SoX implementation.
204
205
Args:
206
waveform: Audio waveform of dimension (..., time)
207
sample_rate: Sampling rate of the waveform, e.g. 44100 (Hz)
208
central_freq: Central frequency (in Hz)
209
Q: Q factor (Default: 0.707)
210
211
Returns:
212
Tensor: Waveform of dimension (..., time)
213
"""
214
215
def band_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float,
216
Q: float = 0.707, noise: bool = False) -> torch.Tensor:
217
"""
218
Design two-pole band filter. Similar to SoX implementation.
219
220
Args:
221
waveform: Audio waveform of dimension (..., time)
222
sample_rate: Sampling rate of the waveform, e.g. 44100 (Hz)
223
central_freq: Central frequency (in Hz)
224
Q: Q factor (Default: 0.707)
225
noise: Add noise to the filter
226
227
Returns:
228
Tensor: Waveform of dimension (..., time)
229
"""
230
231
def bandpass_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float,
232
Q: float = 0.707, const_skirt_gain: bool = False) -> torch.Tensor:
233
"""
234
Design two-pole band-pass filter. Similar to SoX implementation.
235
236
Args:
237
waveform: Audio waveform of dimension (..., time)
238
sample_rate: Sampling rate of the waveform, e.g. 44100 (Hz)
239
central_freq: Central frequency (in Hz)
240
Q: Q factor (Default: 0.707)
241
const_skirt_gain: Constant skirt gain
242
243
Returns:
244
Tensor: Waveform of dimension (..., time)
245
"""
246
247
def bandreject_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float, Q: float = 0.707) -> torch.Tensor:
248
"""
249
Design two-pole band-reject filter. Similar to SoX implementation.
250
251
Args:
252
waveform: Audio waveform of dimension (..., time)
253
sample_rate: Sampling rate of the waveform, e.g. 44100 (Hz)
254
central_freq: Central frequency (in Hz)
255
Q: Q factor (Default: 0.707)
256
257
Returns:
258
Tensor: Waveform of dimension (..., time)
259
"""
260
261
def bass_biquad(waveform: torch.Tensor, sample_rate: int, gain: float,
262
central_freq: float = 100, Q: float = 0.707) -> torch.Tensor:
263
"""
264
Design a bass tone-control effect. Similar to SoX implementation.
265
266
Args:
267
waveform: Audio waveform of dimension (..., time)
268
sample_rate: Sampling rate of the waveform, e.g. 44100 (Hz)
269
gain: Gain in dB
270
central_freq: Central frequency (in Hz, default: 100)
271
Q: Q factor (Default: 0.707)
272
273
Returns:
274
Tensor: Waveform of dimension (..., time)
275
"""
276
277
def contrast(waveform: torch.Tensor, enhancement_amount: float = 75.0) -> torch.Tensor:
278
"""
279
Apply contrast effect. Similar to SoX implementation.
280
281
Args:
282
waveform: Audio waveform of dimension (..., time)
283
enhancement_amount: Enhancement amount (default: 75.0)
284
285
Returns:
286
Tensor: Waveform of dimension (..., time)
287
"""
288
289
def dcshift(waveform: torch.Tensor, shift: float, limiter_gain: Optional[float] = None) -> torch.Tensor:
290
"""
291
Apply a DC shift to the audio. Similar to SoX implementation.
292
293
Args:
294
waveform: Audio waveform of dimension (..., time)
295
shift: DC shift amount
296
limiter_gain: Optional limiter gain
297
298
Returns:
299
Tensor: Waveform of dimension (..., time)
300
"""
301
302
def deemph_biquad(waveform: torch.Tensor, sample_rate: int) -> torch.Tensor:
303
"""
304
Apply ISO 908 CD de-emphasis (shelving) IIR filter. Similar to SoX implementation.
305
306
Args:
307
waveform: Audio waveform of dimension (..., time)
308
sample_rate: Sampling rate of the waveform
309
310
Returns:
311
Tensor: Waveform of dimension (..., time)
312
"""
313
314
def dither(waveform: torch.Tensor, density_function: str = "TPDF", noise_shaping: bool = False) -> torch.Tensor:
315
"""
316
Apply dither. Dither increases the perceived dynamic range of audio stored at a particular bit-depth.
317
318
Args:
319
waveform: Audio waveform of dimension (..., time)
320
density_function: Density function ("TPDF", "RPDF", "GPDF")
321
noise_shaping: Apply noise shaping
322
323
Returns:
324
Tensor: Dithered waveform
325
"""
326
327
def equalizer_biquad(waveform: torch.Tensor, sample_rate: int, center_freq: float,
328
gain: float, Q: float = 0.707) -> torch.Tensor:
329
"""
330
Design biquad peaking equalizer filter and perform filtering. Similar to SoX implementation.
331
332
Args:
333
waveform: Audio waveform of dimension (..., time)
334
sample_rate: Sampling rate of the waveform
335
center_freq: Center frequency (in Hz)
336
gain: Gain in dB
337
Q: Q factor (Default: 0.707)
338
339
Returns:
340
Tensor: Waveform of dimension (..., time)
341
"""
342
343
def filtfilt(waveform: torch.Tensor, a_coeffs: torch.Tensor, b_coeffs: torch.Tensor,
344
clamp: bool = True) -> torch.Tensor:
345
"""
346
Apply an IIR filter forward and backward to a waveform. Inspired by scipy.signal.filtfilt.
347
348
Args:
349
waveform: Input waveform (..., time)
350
a_coeffs: Denominator coefficients of the filter
351
b_coeffs: Numerator coefficients of the filter
352
clamp: Clamp intermediate values
353
354
Returns:
355
Tensor: Zero-phase filtered waveform
356
"""
357
358
def flanger(waveform: torch.Tensor, sample_rate: int, delay: float = 0.0,
359
depth: float = 2.0, regen: float = 0.0, width: float = 71.0,
360
speed: float = 0.5, phase: float = 25.0, modulation: str = "sinusoidal",
361
interpolation: str = "linear") -> torch.Tensor:
362
"""
363
Apply a flanger effect to the audio. Similar to SoX implementation.
364
365
Args:
366
waveform: Audio waveform of dimension (..., channel, time)
367
sample_rate: Sampling rate of the waveform
368
delay: Base delay in milliseconds
369
depth: Delay depth in milliseconds
370
regen: Regeneration (feedback) in percent
371
width: Delay line width in percent
372
speed: Modulation speed in Hz
373
phase: Phase in percent
374
modulation: Modulation type ("sinusoidal" or "triangular")
375
interpolation: Interpolation type ("linear" or "quadratic")
376
377
Returns:
378
Tensor: Waveform of dimension (..., channel, time)
379
"""
380
381
def gain(waveform: torch.Tensor, gain_db: float = 1.0) -> torch.Tensor:
382
"""
383
Apply amplification or attenuation to the whole waveform.
384
385
Args:
386
waveform: Audio waveform of dimension (..., time)
387
gain_db: Gain in decibels
388
389
Returns:
390
Tensor: Amplified waveform
391
"""
392
393
def highpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) -> torch.Tensor:
394
"""
395
Design biquad highpass filter and perform filtering. Similar to SoX implementation.
396
397
Args:
398
waveform: Audio waveform of dimension (..., time)
399
sample_rate: Sampling rate of the waveform
400
cutoff_freq: Cutoff frequency
401
Q: Q factor (Default: 0.707)
402
403
Returns:
404
Tensor: Waveform dimension (..., time)
405
"""
406
407
def lfilter(waveform: torch.Tensor, a_coeffs: torch.Tensor, b_coeffs: torch.Tensor,
408
clamp: bool = True, batching: bool = True) -> torch.Tensor:
409
"""
410
Perform an IIR filter by evaluating difference equation.
411
412
Args:
413
waveform: Input waveform (..., time)
414
a_coeffs: Denominator coefficients of the filter
415
b_coeffs: Numerator coefficients of the filter
416
clamp: Clamp intermediate values
417
batching: Enable batching optimization
418
419
Returns:
420
Tensor: Filtered waveform
421
"""
422
423
def lowpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) -> torch.Tensor:
424
"""
425
Design biquad lowpass filter and perform filtering. Similar to SoX implementation.
426
427
Args:
428
waveform: Audio waveform of dimension (..., time)
429
sample_rate: Sampling rate of the waveform
430
cutoff_freq: Cutoff frequency
431
Q: Q factor (Default: 0.707)
432
433
Returns:
434
Tensor: Waveform of dimension (..., time)
435
"""
436
437
def overdrive(waveform: torch.Tensor, gain: float = 20, colour: float = 20) -> torch.Tensor:
438
"""
439
Apply a overdrive effect to the audio. Similar to SoX implementation.
440
441
Args:
442
waveform: Audio waveform of dimension (..., time)
443
gain: Gain amount
444
colour: Colour amount
445
446
Returns:
447
Tensor: Waveform of dimension (..., time)
448
"""
449
450
def phaser(waveform: torch.Tensor, sample_rate: int, gain_in: float = 0.4,
451
gain_out: float = 0.74, delay_ms: float = 3.0, decay: float = 0.4,
452
mod_speed: float = 0.5, sinusoidal: bool = True) -> torch.Tensor:
453
"""
454
Apply a phasing effect to the audio. Similar to SoX implementation.
455
456
Args:
457
waveform: Audio waveform of dimension (..., time)
458
sample_rate: Sampling rate of the waveform
459
gain_in: Input gain
460
gain_out: Output gain
461
delay_ms: Delay in milliseconds
462
decay: Decay amount
463
mod_speed: Modulation speed
464
sinusoidal: Use sinusoidal modulation
465
466
Returns:
467
Tensor: Waveform of dimension (..., time)
468
"""
469
470
def riaa_biquad(waveform: torch.Tensor, sample_rate: int) -> torch.Tensor:
471
"""
472
Apply RIAA vinyl playback equalization. Similar to SoX implementation.
473
474
Args:
475
waveform: Audio waveform of dimension (..., time)
476
sample_rate: Sampling rate of the waveform
477
478
Returns:
479
Tensor: Waveform of dimension (..., time)
480
"""
481
482
def treble_biquad(waveform: torch.Tensor, sample_rate: int, gain: float,
483
central_freq: float = 3000, Q: float = 0.707) -> torch.Tensor:
484
"""
485
Design a treble tone-control effect. Similar to SoX implementation.
486
487
Args:
488
waveform: Audio waveform of dimension (..., time)
489
sample_rate: Sampling rate of the waveform
490
gain: Gain in dB
491
central_freq: Central frequency (in Hz, default: 3000)
492
Q: Q factor (Default: 0.707)
493
494
Returns:
495
Tensor: Waveform of dimension (..., time)
496
"""
497
498
def vad(waveform: torch.Tensor, sample_rate: int, trigger_level: float = 7.0,
499
trigger_time: float = 0.25, search_time: float = 1.0,
500
allowed_gap: float = 0.25, pre_trigger_time: float = 0.0,
501
boot_time: float = 0.35, noise_up_time: float = 0.1,
502
noise_down_time: float = 0.01, noise_reduction_amount: float = 1.35,
503
measure_freq: float = 20.0, measure_duration: Optional[float] = None,
504
measure_smooth_time: float = 0.4, hp_filter_freq: float = 50.0,
505
lp_filter_freq: float = 6000.0, hp_lifter_freq: float = 150.0,
506
lp_lifter_freq: float = 2000.0) -> torch.Tensor:
507
"""
508
Voice Activity Detector. Similar to SoX implementation.
509
510
Args:
511
waveform: Tensor of audio of dimension (..., time)
512
sample_rate: Sample rate of audio
513
trigger_level: Trigger level (default: 7.0)
514
trigger_time: Trigger time (default: 0.25)
515
search_time: Search time (default: 1.0)
516
allowed_gap: Allowed gap (default: 0.25)
517
pre_trigger_time: Pre-trigger time (default: 0.0)
518
boot_time: Boot time (default: 0.35)
519
noise_up_time: Noise up time (default: 0.1)
520
noise_down_time: Noise down time (default: 0.01)
521
noise_reduction_amount: Noise reduction amount (default: 1.35)
522
measure_freq: Measure frequency (default: 20.0)
523
measure_duration: Measure duration (optional)
524
measure_smooth_time: Measure smooth time (default: 0.4)
525
hp_filter_freq: High-pass filter frequency (default: 50.0)
526
lp_filter_freq: Low-pass filter frequency (default: 6000.0)
527
hp_lifter_freq: High-pass lifter frequency (default: 150.0)
528
lp_lifter_freq: Low-pass lifter frequency (default: 2000.0)
529
530
Returns:
531
Tensor: Audio with silence trimmed
532
"""
533
534
```
535
536
### Beamforming and Array Processing
537
538
Advanced beamforming algorithms for multi-channel audio processing and spatial filtering.
539
540
```python { .api }
541
def apply_beamforming(multi_channel_audio: torch.Tensor, beamforming_weights: torch.Tensor) -> torch.Tensor:
542
"""
543
Apply beamforming weights to multi-channel audio.
544
545
Args:
546
multi_channel_audio: Multi-channel audio tensor (..., channel, freq, time)
547
beamforming_weights: Beamforming weights (..., channel, freq)
548
549
Returns:
550
Tensor: Beamformed audio (..., freq, time)
551
"""
552
553
def mvdr_weights_souden(psd_s: torch.Tensor, psd_n: torch.Tensor, reference_vector: torch.Tensor,
554
diagonal_loading: bool = True, diag_eps: float = 1e-7) -> torch.Tensor:
555
"""
556
Compute MVDR (Minimum Variance Distortionless Response) beamforming weights using Souden's method.
557
558
Args:
559
psd_s: Power spectral density matrix of target speech (..., freq, channel, channel)
560
psd_n: Power spectral density matrix of noise (..., freq, channel, channel)
561
reference_vector: Reference microphone vector (..., channel)
562
diagonal_loading: Whether to apply diagonal loading
563
diag_eps: Diagonal loading factor
564
565
Returns:
566
Tensor: MVDR beamforming weights (..., freq, channel)
567
"""
568
569
def mvdr_weights_rtf(rtf_mat: torch.Tensor, psd_n: torch.Tensor, reference_vector: torch.Tensor,
570
diagonal_loading: bool = True, diag_eps: float = 1e-7) -> torch.Tensor:
571
"""
572
Compute MVDR beamforming weights using Relative Transfer Function (RTF).
573
574
Args:
575
rtf_mat: Relative transfer function matrix (..., freq, channel)
576
psd_n: Power spectral density matrix of noise (..., freq, channel, channel)
577
reference_vector: Reference microphone vector (..., channel)
578
diagonal_loading: Whether to apply diagonal loading
579
diag_eps: Diagonal loading factor
580
581
Returns:
582
Tensor: MVDR beamforming weights (..., freq, channel)
583
"""
584
585
def rtf_evd(psd_s: torch.Tensor, psd_n: torch.Tensor) -> torch.Tensor:
586
"""
587
Estimate relative transfer function (RTF) using eigenvalue decomposition.
588
589
Args:
590
psd_s: Power spectral density matrix of target speech (..., freq, channel, channel)
591
psd_n: Power spectral density matrix of noise (..., freq, channel, channel)
592
593
Returns:
594
Tensor: RTF matrix (..., freq, channel)
595
"""
596
597
def rtf_power(psd_s: torch.Tensor, psd_n: torch.Tensor, reference_channel: int = 0) -> torch.Tensor:
598
"""
599
Estimate relative transfer function (RTF) using power method.
600
601
Args:
602
psd_s: Power spectral density matrix of target speech (..., freq, channel, channel)
603
psd_n: Power spectral density matrix of noise (..., freq, channel, channel)
604
reference_channel: Reference channel index
605
606
Returns:
607
Tensor: RTF matrix (..., freq, channel)
608
"""
609
610
def psd(specgrams: torch.Tensor, mask: Optional[torch.Tensor] = None,
611
normalize: bool = True, eps: float = 1e-15) -> torch.Tensor:
612
"""
613
Compute power spectral density (PSD) matrix.
614
615
Args:
616
specgrams: Multi-channel spectrograms (..., channel, freq, time)
617
mask: Optional mask for PSD estimation (..., freq, time)
618
normalize: Whether to normalize by time frames
619
eps: Small value for numerical stability
620
621
Returns:
622
Tensor: PSD matrix (..., freq, channel, channel)
623
"""
624
```
625
626
### Pitch and Speed Manipulation
627
628
Functions for pitch shifting and time-scale modification.
629
630
```python { .api }
631
def pitch_shift(waveform: torch.Tensor, sample_rate: int, n_steps: float,
632
bins_per_octave: int = 12, n_fft: int = 512,
633
win_length: Optional[int] = None, hop_length: Optional[int] = None,
634
window: Optional[torch.Tensor] = None) -> torch.Tensor:
635
"""
636
Shift the pitch of waveform by n_steps steps.
637
638
Args:
639
waveform: Input waveform (..., time)
640
sample_rate: Sample rate of waveform
641
n_steps: Number of pitch steps to shift
642
bins_per_octave: Number of steps per octave
643
n_fft: Size of FFT
644
win_length: Window size
645
hop_length: Length of hop between STFT windows
646
window: Window function
647
648
Returns:
649
Tensor: Pitch-shifted waveform
650
"""
651
652
def speed(waveform: torch.Tensor, orig_freq: int, factor: float,
653
lengths: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
654
"""
655
Adjust waveform speed by a given factor.
656
657
Args:
658
waveform: Input waveform (..., time)
659
orig_freq: Original sample rate
660
factor: Speed factor (>1.0 makes faster, <1.0 makes slower)
661
lengths: Original lengths of waveforms
662
663
Returns:
664
Tuple: (speed-adjusted waveform, adjusted lengths)
665
"""
666
667
def detect_pitch_frequency(waveform: torch.Tensor, sample_rate: int,
668
frame_time: float = 10 ** (-2), win_length: int = 30,
669
freq_low: int = 85, freq_high: int = 3400) -> torch.Tensor:
670
"""
671
Detect pitch frequency using autocorrelation method.
672
673
Args:
674
waveform: Input waveform (..., time)
675
sample_rate: Sample rate of the waveform
676
frame_time: Duration of a frame in seconds
677
win_length: Length of the window in frames
678
freq_low: Lowest detectable frequency
679
freq_high: Highest detectable frequency
680
681
Returns:
682
Tensor: Detected pitch frequencies (..., frame)
683
"""
684
```
685
686
### Codec and Format Processing
687
688
Functions for codec simulation and audio format processing.
689
690
```python { .api }
691
def apply_codec(waveform: torch.Tensor, sample_rate: int, format: str,
692
encoder: Optional[str] = None, encoder_config: Optional[dict] = None,
693
decoder: Optional[str] = None, decoder_config: Optional[dict] = None) -> torch.Tensor:
694
"""
695
Apply codec compression and decompression to waveform.
696
697
Args:
698
waveform: Input waveform (..., time)
699
sample_rate: Sample rate
700
format: Audio format ("wav", "mp3", "ogg", etc.)
701
encoder: Encoder name
702
encoder_config: Encoder configuration
703
decoder: Decoder name
704
decoder_config: Decoder configuration
705
706
Returns:
707
Tensor: Codec-processed waveform
708
"""
709
710
def mu_law_encoding(x: torch.Tensor, quantization_channels: int = 256) -> torch.Tensor:
711
"""
712
Encode signal based on mu-law companding.
713
714
Args:
715
x: Input tensor (..., time)
716
quantization_channels: Number of quantization channels
717
718
Returns:
719
Tensor: Mu-law encoded tensor
720
"""
721
722
def mu_law_decoding(x_mu: torch.Tensor, quantization_channels: int = 256) -> torch.Tensor:
723
"""
724
Decode mu-law encoded signal.
725
726
Args:
727
x_mu: Mu-law encoded input (..., time)
728
quantization_channels: Number of quantization channels
729
730
Returns:
731
Tensor: Decoded tensor
732
"""
733
```
734
735
### Advanced Signal Processing
736
737
Additional signal processing utilities and analysis functions.
738
739
```python { .api }
740
def preemphasis(waveform: torch.Tensor, coeff: float = 0.97) -> torch.Tensor:
741
"""
742
Apply pre-emphasis filter to waveform.
743
744
Args:
745
waveform: Input waveform (..., time)
746
coeff: Pre-emphasis coefficient
747
748
Returns:
749
Tensor: Pre-emphasized waveform
750
"""
751
752
def deemphasis(waveform: torch.Tensor, coeff: float = 0.97) -> torch.Tensor:
753
"""
754
Apply de-emphasis filter to waveform.
755
756
Args:
757
waveform: Input waveform (..., time)
758
coeff: De-emphasis coefficient
759
760
Returns:
761
Tensor: De-emphasized waveform
762
"""
763
764
def phase_vocoder(complex_specgrams: torch.Tensor, rate: float, phase_advance: torch.Tensor) -> torch.Tensor:
765
"""
766
Given a STFT tensor, speed up in time without modifying pitch by applying phase vocoder.
767
768
Args:
769
complex_specgrams: Complex-valued spectrogram (..., freq, time)
770
rate: Speed-up factor
771
phase_advance: Expected phase advance in each bin
772
773
Returns:
774
Tensor: Time-stretched complex spectrogram
775
"""
776
777
def mask_along_axis(specgrams: torch.Tensor, mask_param: int, mask_value: float,
778
axis: int) -> torch.Tensor:
779
"""
780
Apply masking along the given axis.
781
782
Args:
783
specgrams: Tensor spectrogram (..., freq, time)
784
mask_param: Number of columns to be masked
785
mask_value: Value to assign to masked columns
786
axis: Axis to apply masking on (1 for freq, 2 for time)
787
788
Returns:
789
Tensor: Masked spectrogram
790
"""
791
792
def mask_along_axis_iid(specgrams: torch.Tensor, mask_param: int, mask_value: float,
793
axis: int) -> torch.Tensor:
794
"""
795
Apply masking along the given axis with independent masks for each example.
796
797
Args:
798
specgrams: Tensor spectrogram (..., freq, time)
799
mask_param: Number of columns to be masked
800
mask_value: Value to assign to masked columns
801
axis: Axis to apply masking on (1 for freq, 2 for time)
802
803
Returns:
804
Tensor: Masked spectrogram
805
"""
806
807
def compute_deltas(specgram: torch.Tensor, win_length: int = 5, mode: str = "replicate") -> torch.Tensor:
808
"""
809
Compute delta coefficients of a tensor.
810
811
Args:
812
specgram: Input tensor (..., freq, time)
813
win_length: The window length used for computing delta
814
mode: Mode for padding ("replicate", "constant", etc.)
815
816
Returns:
817
Tensor: Delta coefficients
818
"""
819
820
def create_dct(n_mfcc: int, n_mels: int, norm: Optional[str] = None) -> torch.Tensor:
821
"""
822
Create DCT transformation matrix.
823
824
Args:
825
n_mfcc: Number of MFCC coefficients
826
n_mels: Number of mel filter banks
827
norm: Normalization mode ("ortho" or None)
828
829
Returns:
830
Tensor: DCT transformation matrix (n_mfcc, n_mels)
831
"""
832
833
def sliding_window_cmn(specgram: torch.Tensor, cmn_window: int = 600,
834
min_cmn_window: int = 100, center: bool = False,
835
norm_vars: bool = False) -> torch.Tensor:
836
"""
837
Apply sliding-window cepstral mean (and optionally variance) normalization per utterance.
838
839
Args:
840
specgram: Input tensor (..., freq, time)
841
cmn_window: Window length for normalization
842
min_cmn_window: Minimum window length
843
center: Whether to center the window
844
norm_vars: Whether to normalize variance
845
846
Returns:
847
Tensor: Normalized tensor
848
"""
849
850
def spectral_centroid(waveform: torch.Tensor, sample_rate: int, pad: int = 0,
851
window: Optional[torch.Tensor] = None, n_fft: int = 400,
852
hop_length: Optional[int] = None, win_length: Optional[int] = None) -> torch.Tensor:
853
"""
854
Compute the spectral centroid for each frame.
855
856
Args:
857
waveform: Input tensor (..., time)
858
sample_rate: Sample rate of waveform
859
pad: Two sided padding of signal
860
window: Window tensor
861
n_fft: Size of FFT
862
hop_length: Length of hop between STFT windows
863
win_length: Window size
864
865
Returns:
866
Tensor: Spectral centroid (..., time)
867
"""
868
869
def add_noise(waveform: torch.Tensor, noise: torch.Tensor, snr: torch.Tensor,
870
lengths: Optional[torch.Tensor] = None) -> torch.Tensor:
871
"""
872
Add noise to waveform with given Signal-to-Noise Ratio (SNR).
873
874
Args:
875
waveform: Input waveform (..., time)
876
noise: Noise tensor (..., time)
877
snr: Signal-to-noise ratio in dB
878
lengths: Lengths of waveforms
879
880
Returns:
881
Tensor: Noisy waveform
882
"""
883
884
def convolve(waveform: torch.Tensor, kernel: torch.Tensor, mode: str = "full") -> torch.Tensor:
885
"""
886
Convolve waveform with kernel using PyTorch operations.
887
888
Args:
889
waveform: Input waveform (..., time)
890
kernel: Convolution kernel (..., time)
891
mode: Convolution mode ("full", "valid", "same")
892
893
Returns:
894
Tensor: Convolved waveform
895
"""
896
897
def fftconvolve(waveform: torch.Tensor, kernel: torch.Tensor, mode: str = "full") -> torch.Tensor:
898
"""
899
Convolve waveform with kernel using FFT.
900
901
Args:
902
waveform: Input waveform (..., time)
903
kernel: Convolution kernel (..., time)
904
mode: Convolution mode ("full", "valid", "same")
905
906
Returns:
907
Tensor: Convolved waveform
908
"""
909
910
def loudness(specgram: torch.Tensor, sample_rate: int) -> torch.Tensor:
911
"""
912
Compute loudness according to ITU-R BS.1770-4.
913
914
Args:
915
specgram: Input spectrogram (..., freq, time)
916
sample_rate: Sample rate
917
918
Returns:
919
Tensor: Loudness values
920
"""
921
922
def edit_distance(seq1: List[int], seq2: List[int]) -> int:
923
"""
924
Calculate edit distance between two sequences.
925
926
Args:
927
seq1: First sequence
928
seq2: Second sequence
929
930
Returns:
931
int: Edit distance
932
"""
933
934
def rnnt_loss(logits: torch.Tensor, targets: torch.Tensor, logit_lengths: torch.Tensor,
935
target_lengths: torch.Tensor, blank: int = -1, clamp: float = -1) -> torch.Tensor:
936
"""
937
Compute RNN-Transducer loss.
938
939
Args:
940
logits: Predicted logits (..., time, target_length, n_class)
941
targets: Target sequences (..., target_length)
942
logit_lengths: Length of logits for each sample
943
target_lengths: Length of targets for each sample
944
blank: Blank label index
945
clamp: Clamp gradients
946
947
Returns:
948
Tensor: RNN-T loss
949
"""
950
951
def frechet_distance(mu_x: torch.Tensor, sigma_x: torch.Tensor,
952
mu_y: torch.Tensor, sigma_y: torch.Tensor) -> torch.Tensor:
953
"""
954
Compute Fréchet distance between two multivariate Gaussians.
955
956
Args:
957
mu_x: Mean of first distribution
958
sigma_x: Covariance of first distribution
959
mu_y: Mean of second distribution
960
sigma_y: Covariance of second distribution
961
962
Returns:
963
Tensor: Fréchet distance
964
"""
965
966
```
967
968
```python { .api }
969
def lfilter(waveform: torch.Tensor, a_coeffs: torch.Tensor, b_coeffs: torch.Tensor,
970
zi: Optional[torch.Tensor] = None) -> torch.Tensor:
971
"""
972
Apply IIR filter using difference equation.
973
974
Args:
975
waveform: Input signal (..., time)
976
a_coeffs: Denominator coefficients (autoregressive)
977
b_coeffs: Numerator coefficients (moving average)
978
zi: Initial conditions for filter delays
979
980
Returns:
981
Tensor: Filtered signal
982
"""
983
984
def filtfilt(waveform: torch.Tensor, a_coeffs: torch.Tensor, b_coeffs: torch.Tensor,
985
clamp: bool = True) -> torch.Tensor:
986
"""
987
Apply zero-phase filtering using forward-backward filter.
988
989
Args:
990
waveform: Input signal (..., time)
991
a_coeffs: Denominator coefficients
992
b_coeffs: Numerator coefficients
993
clamp: Whether to clamp output to prevent numerical issues
994
995
Returns:
996
Tensor: Zero-phase filtered signal
997
"""
998
```
999
1000
### Pitch and Time Manipulation
1001
1002
Functions for manipulating pitch and temporal characteristics of audio.
1003
1004
```python { .api }
1005
def pitch_shift(waveform: torch.Tensor, sample_rate: int, n_steps: float,
1006
bins_per_octave: int = 12, n_fft: int = 512,
1007
win_length: Optional[int] = None, hop_length: Optional[int] = None,
1008
window: Optional[torch.Tensor] = None) -> torch.Tensor:
1009
"""
1010
Shift pitch of waveform by n_steps semitones.
1011
1012
Args:
1013
waveform: Input audio (..., time)
1014
sample_rate: Sample rate
1015
n_steps: Number of semitones to shift (positive = higher, negative = lower)
1016
bins_per_octave: Number of steps per octave
1017
n_fft: FFT size for STFT
1018
win_length: Window length
1019
hop_length: Hop length
1020
window: Window function
1021
1022
Returns:
1023
Tensor: Pitch-shifted audio
1024
"""
1025
1026
def speed(waveform: torch.Tensor, orig_freq: int, factor: float,
1027
lengths: Optional[torch.Tensor] = None) -> torch.Tensor:
1028
"""
1029
Adjust playback speed by resampling.
1030
1031
Args:
1032
waveform: Input audio (..., time)
1033
orig_freq: Original sample rate
1034
factor: Speed factor (>1.0 = faster, <1.0 = slower)
1035
lengths: Length of each sequence in batch
1036
1037
Returns:
1038
Tensor: Speed-adjusted audio
1039
"""
1040
1041
def phase_vocoder(complex_specgrams: torch.Tensor, rate: float,
1042
phase_advance: torch.Tensor) -> torch.Tensor:
1043
"""
1044
Apply phase vocoder for time stretching/compression.
1045
1046
Args:
1047
complex_specgrams: Complex STFT (..., freq, time)
1048
rate: Rate factor (>1.0 = faster, <1.0 = slower)
1049
phase_advance: Expected phase advance per hop
1050
1051
Returns:
1052
Tensor: Time-stretched complex spectrogram
1053
"""
1054
```
1055
1056
### Audio Analysis
1057
1058
Functions for analyzing audio characteristics and extracting features.
1059
1060
```python { .api }
1061
def spectral_centroid(waveform: torch.Tensor, sample_rate: int, pad: int = 0,
1062
window: Optional[torch.Tensor] = None, n_fft: int = 400,
1063
hop_length: Optional[int] = None, win_length: Optional[int] = None) -> torch.Tensor:
1064
"""
1065
Compute spectral centroid (center of mass of spectrum).
1066
1067
Args:
1068
waveform: Input audio (..., time)
1069
sample_rate: Sample rate
1070
(other parameters same as spectrogram)
1071
1072
Returns:
1073
Tensor: Spectral centroid over time (..., time)
1074
"""
1075
1076
def detect_pitch_frequency(waveform: torch.Tensor, sample_rate: int, frame_time: float = 10**(-2),
1077
win_length: int = 30, freq_low: int = 85, freq_high: int = 3400) -> torch.Tensor:
1078
"""
1079
Detect pitch frequency using autocorrelation method.
1080
1081
Args:
1082
waveform: Input audio (..., time)
1083
sample_rate: Sample rate
1084
frame_time: Length of frame in seconds
1085
win_length: Length of window for median filtering
1086
freq_low: Lowest frequency that can be detected
1087
freq_high: Highest frequency that can be detected
1088
1089
Returns:
1090
Tensor: Detected pitch frequency over time
1091
"""
1092
1093
def loudness(waveform: torch.Tensor, sample_rate: int) -> torch.Tensor:
1094
"""
1095
Compute loudness using ITU-R BS.1770-4 standard.
1096
1097
Args:
1098
waveform: Input audio (..., time)
1099
sample_rate: Sample rate
1100
1101
Returns:
1102
Tensor: Loudness in LUFS (Loudness Units Full Scale)
1103
"""
1104
```
1105
1106
### Convolution Operations
1107
1108
Convolution-based processing for impulse response application and acoustic modeling.
1109
1110
```python { .api }
1111
def convolve(x: torch.Tensor, y: torch.Tensor, mode: str = "full") -> torch.Tensor:
1112
"""
1113
Convolve two 1D tensors.
1114
1115
Args:
1116
x: First input tensor (..., time)
1117
y: Second input tensor (..., time)
1118
mode: Convolution mode ("full", "valid", "same")
1119
1120
Returns:
1121
Tensor: Convolved signal
1122
"""
1123
1124
def fftconvolve(x: torch.Tensor, y: torch.Tensor, mode: str = "full") -> torch.Tensor:
1125
"""
1126
Convolve using FFT for efficiency with long signals.
1127
1128
Args:
1129
x: First input tensor (..., time)
1130
y: Second input tensor (..., time)
1131
mode: Convolution mode ("full", "valid", "same")
1132
1133
Returns:
1134
Tensor: Convolved signal
1135
"""
1136
```
1137
1138
### Mu-Law Encoding/Decoding
1139
1140
Logarithmic quantization commonly used in telecommunications.
1141
1142
```python { .api }
1143
def mu_law_encoding(x: torch.Tensor, quantization_channels: int = 256) -> torch.Tensor:
1144
"""
1145
Encode waveform using mu-law companding.
1146
1147
Args:
1148
x: Input waveform (..., time)
1149
quantization_channels: Number of quantization levels
1150
1151
Returns:
1152
Tensor: Mu-law encoded signal (integer values)
1153
"""
1154
1155
def mu_law_decoding(x_mu: torch.Tensor, quantization_channels: int = 256) -> torch.Tensor:
1156
"""
1157
Decode mu-law encoded waveform.
1158
1159
Args:
1160
x_mu: Mu-law encoded signal (..., time)
1161
quantization_channels: Number of quantization levels
1162
1163
Returns:
1164
Tensor: Decoded waveform
1165
"""
1166
```
1167
1168
### Feature Processing
1169
1170
Functions for processing extracted audio features.
1171
1172
```python { .api }
1173
def compute_deltas(specgram: torch.Tensor, win_length: int = 5) -> torch.Tensor:
1174
"""
1175
Compute delta features (first derivatives) of spectrogram.
1176
1177
Args:
1178
specgram: Input spectrogram (..., freq, time)
1179
win_length: Window length for delta computation
1180
1181
Returns:
1182
Tensor: Delta features with same shape as input
1183
"""
1184
1185
def create_dct(n_mfcc: int, n_mels: int, norm: Optional[str] = None) -> torch.Tensor:
1186
"""
1187
Create Discrete Cosine Transform matrix for MFCC computation.
1188
1189
Args:
1190
n_mfcc: Number of MFCC coefficients
1191
n_mels: Number of mel filter banks
1192
norm: Normalization method ("ortho" or None)
1193
1194
Returns:
1195
Tensor: DCT matrix (n_mfcc, n_mels)
1196
"""
1197
1198
def sliding_window_cmn(specgram: torch.Tensor, cmn_window: int = 600, min_cmn_window: int = 100,
1199
center: bool = False, norm_vars: bool = False) -> torch.Tensor:
1200
"""
1201
Apply sliding window cepstral mean normalization.
1202
1203
Args:
1204
specgram: Input spectrogram (..., freq, time)
1205
cmn_window: Window size for normalization
1206
min_cmn_window: Minimum window size
1207
center: Whether to center the window
1208
norm_vars: Whether to normalize variance
1209
1210
Returns:
1211
Tensor: Normalized spectrogram
1212
"""
1213
```
1214
1215
This covers the extensive functional API of TorchAudio, providing stateless functions for all major audio processing operations from basic spectral analysis to advanced effects and feature extraction.