0
# TorchAudio
1
2
A comprehensive audio processing library for PyTorch that provides GPU-accelerated audio I/O operations, signal processing transforms, and machine learning utilities specifically designed for audio data. TorchAudio supports loading and saving various audio formats, offers dataloaders for common audio datasets, implements essential audio transforms, and provides pre-trained models for speech recognition, synthesis, and source separation.
3
4
## Package Information
5
6
- **Package Name**: torchaudio
7
- **Language**: Python
8
- **Installation**: `pip install torchaudio`
9
- **PyTorch Integration**: Requires PyTorch to be installed
10
11
## Core Imports
12
13
```python
14
import torchaudio
15
```
16
17
Common import patterns:
18
19
```python
20
import torchaudio
21
import torchaudio.functional as F
22
import torchaudio.transforms as T
23
from torchaudio.models import Wav2Vec2Model
24
from torchaudio.datasets import LIBRISPEECH
25
```
26
27
## Basic Usage
28
29
```python
30
import torchaudio
31
import torch
32
33
# Load audio file
34
waveform, sample_rate = torchaudio.load("audio.wav")
35
print(f"Shape: {waveform.shape}, Sample rate: {sample_rate}")
36
37
# Apply spectrogram transform
38
spectrogram_transform = torchaudio.transforms.Spectrogram(
39
n_fft=1024,
40
hop_length=512
41
)
42
spectrogram = spectrogram_transform(waveform)
43
44
# Apply mel spectrogram
45
mel_transform = torchaudio.transforms.MelSpectrogram(
46
sample_rate=sample_rate,
47
n_mels=80
48
)
49
mel_spectrogram = mel_transform(waveform)
50
51
# Save processed audio
52
torchaudio.save("processed_audio.wav", waveform, sample_rate)
53
```
54
55
## Architecture
56
57
TorchAudio is built around several key architectural components:
58
59
- **I/O Operations**: Unified interface for loading/saving audio across multiple backends (FFmpeg, SoX, SoundFile)
60
- **Functional Processing**: Stateless functions for signal processing and audio transformations
61
- **Transform Objects**: Stateful, PyTorch-compatible transforms that integrate with neural networks
62
- **Pre-trained Models**: Ready-to-use models for speech recognition, synthesis, and source separation
63
- **Dataset Loaders**: Standard interfaces for common audio datasets with consistent preprocessing
64
- **Pipeline Bundles**: Pre-configured model pipelines with preprocessing and post-processing
65
66
This design ensures seamless integration with PyTorch's autograd system and enables end-to-end differentiable audio processing pipelines.
67
68
## Capabilities
69
70
### Audio I/O Operations
71
72
Core functionality for loading, saving, and managing audio files with support for multiple backends and formats. Includes metadata extraction and backend management.
73
74
```python { .api }
75
def load(filepath: str, frame_offset: int = 0, num_frames: int = -1,
76
normalize: bool = True, channels_first: bool = True,
77
format: Optional[str] = None) -> Tuple[torch.Tensor, int]: ...
78
79
def save(filepath: str, src: torch.Tensor, sample_rate: int,
80
channels_first: bool = True, compression: Optional[float] = None) -> None: ...
81
82
def info(filepath: str, format: Optional[str] = None) -> AudioMetaData: ...
83
84
class AudioMetaData:
85
sample_rate: int
86
num_frames: int
87
num_channels: int
88
bits_per_sample: int
89
encoding: str
90
```
91
92
[Audio I/O Operations](./audio-io.md)
93
94
### Signal Processing Functions
95
96
Extensive collection of stateless audio processing functions including spectral analysis, filtering, resampling, pitch manipulation, and advanced signal processing algorithms.
97
98
```python { .api }
99
def spectrogram(waveform: torch.Tensor, pad: int = 0, window: torch.Tensor = None,
100
n_fft: int = 400, hop_length: Optional[int] = None,
101
win_length: Optional[int] = None, power: Optional[float] = 2.0,
102
normalized: bool = False) -> torch.Tensor: ...
103
104
def melscale_fbanks(n_freqs: int, f_min: float, f_max: float, n_mels: int,
105
sample_rate: int, norm: Optional[str] = None,
106
mel_scale: str = "htk") -> torch.Tensor: ...
107
108
def resample(waveform: torch.Tensor, orig_freq: int, new_freq: int,
109
resampling_method: str = "sinc_interp_kaiser") -> torch.Tensor: ...
110
```
111
112
[Signal Processing Functions](./functional.md)
113
114
### Audio Transforms
115
116
PyTorch-compatible transform classes for building differentiable audio processing pipelines. Includes spectral transforms, data augmentation, and preprocessing transforms.
117
118
```python { .api }
119
class Spectrogram(torch.nn.Module):
120
def __init__(self, n_fft: int = 400, win_length: Optional[int] = None,
121
hop_length: Optional[int] = None, pad: int = 0,
122
window_fn: Callable = torch.hann_window, power: Optional[float] = 2.0): ...
123
def forward(self, waveform: torch.Tensor) -> torch.Tensor: ...
124
125
class MelSpectrogram(torch.nn.Module):
126
def __init__(self, sample_rate: int = 16000, n_fft: int = 400,
127
win_length: Optional[int] = None, hop_length: Optional[int] = None,
128
f_min: float = 0., f_max: Optional[float] = None, n_mels: int = 128): ...
129
def forward(self, waveform: torch.Tensor) -> torch.Tensor: ...
130
131
class MFCC(torch.nn.Module):
132
def __init__(self, sample_rate: int = 16000, n_mfcc: int = 40,
133
dct_type: int = 2, norm: str = "ortho", log_mels: bool = False): ...
134
def forward(self, waveform: torch.Tensor) -> torch.Tensor: ...
135
```
136
137
[Audio Transforms](./transforms.md)
138
139
### Pre-trained Models
140
141
Ready-to-use neural network models for speech recognition, synthesis, and source separation. Includes Wav2Vec2, HuBERT, Tacotron2, WaveRNN, and source separation models.
142
143
```python { .api }
144
class Wav2Vec2Model(torch.nn.Module):
145
def __init__(self, feature_extractor: torch.nn.Module, encoder: torch.nn.Module): ...
146
def forward(self, waveforms: torch.Tensor) -> torch.Tensor: ...
147
148
def wav2vec2_base(num_out: Optional[int] = None) -> Wav2Vec2Model: ...
149
def wav2vec2_large(num_out: Optional[int] = None) -> Wav2Vec2Model: ...
150
151
class Tacotron2(torch.nn.Module):
152
def __init__(self, mask_padding: bool = False, n_mels: int = 80): ...
153
def forward(self, tokens: torch.Tensor, token_lengths: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: ...
154
155
class ConvTasNet(torch.nn.Module):
156
def __init__(self, num_sources: int = 2, enc_kernel_size: int = 16): ...
157
def forward(self, waveforms: torch.Tensor) -> torch.Tensor: ...
158
```
159
160
[Pre-trained Models](./models.md)
161
162
### Model Pipelines
163
164
Pre-configured model bundles with preprocessing, inference, and post-processing for production-ready audio applications. Includes ASR, TTS, and source separation pipelines.
165
166
```python { .api }
167
class Wav2Vec2Bundle:
168
def get_model(self) -> Wav2Vec2Model: ...
169
def get_labels(self) -> List[str]: ...
170
sample_rate: int
171
172
class Wav2Vec2ASRBundle(Wav2Vec2Bundle):
173
def get_model(self) -> Wav2Vec2Model: ...
174
def get_decoder(self) -> torch.nn.Module: ...
175
176
# Pre-trained bundle instances
177
WAV2VEC2_ASR_BASE_960H: Wav2Vec2ASRBundle
178
HUBERT_ASR_LARGE: Wav2Vec2ASRBundle
179
TACOTRON2_WAVERNN_CHAR_LJSPEECH: Tacotron2TTSBundle
180
```
181
182
[Model Pipelines](./pipelines.md)
183
184
### Audio Datasets
185
186
Standard dataset loaders for common audio datasets with consistent interfaces and preprocessing. Supports speech recognition, synthesis, music analysis, and source separation datasets.
187
188
```python { .api }
189
class LIBRISPEECH(torch.utils.data.Dataset):
190
def __init__(self, root: str, url: str = "train-clean-100",
191
folder_in_archive: str = "LibriSpeech", download: bool = False): ...
192
def __getitem__(self, n: int) -> Tuple[torch.Tensor, int, str, int, int, int]: ...
193
194
class SPEECHCOMMANDS(torch.utils.data.Dataset):
195
def __init__(self, root: str, url: str = "speech_commands_v0.02",
196
folder_in_archive: str = "SpeechCommands", download: bool = False): ...
197
def __getitem__(self, n: int) -> Tuple[torch.Tensor, int, str, str, int]: ...
198
199
class LJSPEECH(torch.utils.data.Dataset):
200
def __init__(self, root: str, url: str = "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2",
201
folder_in_archive: str = "LJSpeech-1.1", download: bool = False): ...
202
def __getitem__(self, n: int) -> Tuple[torch.Tensor, int, str, str]: ...
203
```
204
205
[Audio Datasets](./datasets.md)
206
207
### Streaming I/O
208
209
Advanced streaming capabilities for real-time audio processing, media encoding/decoding, and efficient handling of large audio files.
210
211
```python { .api }
212
class StreamReader:
213
def __init__(self, src: str, format: Optional[str] = None, option: Optional[Dict[str, str]] = None): ...
214
def add_basic_audio_stream(self, frames_per_chunk: int, buffer_chunk_size: int = 3,
215
stream_index: Optional[int] = None, decoder: Optional[str] = None) -> int: ...
216
def process_packet(self, timeout: Optional[float] = None, backoff: float = 10.) -> int: ...
217
218
class StreamWriter:
219
def __init__(self, dst: str, format: Optional[str] = None, option: Optional[Dict[str, str]] = None): ...
220
def add_audio_stream(self, sample_rate: int, num_channels: int, format: str = "fltp",
221
encoder: Optional[str] = None, codec_config: Optional[CodecConfig] = None) -> int: ...
222
def write_audio_chunk(self, stream_index: int, chunk: torch.Tensor, pts: Optional[int] = None) -> None: ...
223
```
224
225
[Streaming I/O](./streaming.md)
226
227
### Audio Effects and Filtering
228
229
Comprehensive audio effects processing including filters, EQ, dynamic effects, and spatial audio processing capabilities.
230
231
```python { .api }
232
def biquad(waveform: torch.Tensor, b0: float, b1: float, b2: float,
233
a0: float, a1: float, a2: float) -> torch.Tensor: ...
234
def lowpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) -> torch.Tensor: ...
235
def highpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) -> torch.Tensor: ...
236
def flanger(waveform: torch.Tensor, sample_rate: int, delay: float = 0.0,
237
depth: float = 2.0, regen: float = 0.0, width: float = 71.0) -> torch.Tensor: ...
238
def phaser(waveform: torch.Tensor, sample_rate: int, gain_in: float = 0.4,
239
gain_out: float = 0.74, delay_ms: float = 3.0, decay: float = 0.4) -> torch.Tensor: ...
240
```
241
242
[Audio Effects and Filtering](./effects.md)
243
244
### Utility Functions
245
246
Helper functions for audio file management, format conversion, backend configuration, and integration with other audio processing libraries.
247
248
```python { .api }
249
def list_audio_backends() -> List[str]: ...
250
def get_audio_backend() -> Optional[str]: ...
251
def set_audio_backend(backend: Optional[str]) -> None: ...
252
253
def download_asset(filename: str, subfolder: str = "") -> str: ...
254
255
# SoX utilities
256
def init_sox_effects() -> None: ...
257
def shutdown_sox_effects() -> None: ...
258
def effect_names() -> List[str]: ...
259
```
260
261
[Utility Functions](./utils.md)