or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

audio-io.mddatasets.mdeffects.mdfunctional.mdindex.mdmodels.mdpipelines.mdstreaming.mdtransforms.mdutils.md

index.mddocs/

0

# TorchAudio

1

2

A comprehensive audio processing library for PyTorch that provides GPU-accelerated audio I/O operations, signal processing transforms, and machine learning utilities specifically designed for audio data. TorchAudio supports loading and saving various audio formats, offers dataloaders for common audio datasets, implements essential audio transforms, and provides pre-trained models for speech recognition, synthesis, and source separation.

3

4

## Package Information

5

6

- **Package Name**: torchaudio

7

- **Language**: Python

8

- **Installation**: `pip install torchaudio`

9

- **PyTorch Integration**: Requires PyTorch to be installed

10

11

## Core Imports

12

13

```python

14

import torchaudio

15

```

16

17

Common import patterns:

18

19

```python

20

import torchaudio

21

import torchaudio.functional as F

22

import torchaudio.transforms as T

23

from torchaudio.models import Wav2Vec2Model

24

from torchaudio.datasets import LIBRISPEECH

25

```

26

27

## Basic Usage

28

29

```python

30

import torchaudio

31

import torch

32

33

# Load audio file

34

waveform, sample_rate = torchaudio.load("audio.wav")

35

print(f"Shape: {waveform.shape}, Sample rate: {sample_rate}")

36

37

# Apply spectrogram transform

38

spectrogram_transform = torchaudio.transforms.Spectrogram(

39

n_fft=1024,

40

hop_length=512

41

)

42

spectrogram = spectrogram_transform(waveform)

43

44

# Apply mel spectrogram

45

mel_transform = torchaudio.transforms.MelSpectrogram(

46

sample_rate=sample_rate,

47

n_mels=80

48

)

49

mel_spectrogram = mel_transform(waveform)

50

51

# Save processed audio

52

torchaudio.save("processed_audio.wav", waveform, sample_rate)

53

```

54

55

## Architecture

56

57

TorchAudio is built around several key architectural components:

58

59

- **I/O Operations**: Unified interface for loading/saving audio across multiple backends (FFmpeg, SoX, SoundFile)

60

- **Functional Processing**: Stateless functions for signal processing and audio transformations

61

- **Transform Objects**: Stateful, PyTorch-compatible transforms that integrate with neural networks

62

- **Pre-trained Models**: Ready-to-use models for speech recognition, synthesis, and source separation

63

- **Dataset Loaders**: Standard interfaces for common audio datasets with consistent preprocessing

64

- **Pipeline Bundles**: Pre-configured model pipelines with preprocessing and post-processing

65

66

This design ensures seamless integration with PyTorch's autograd system and enables end-to-end differentiable audio processing pipelines.

67

68

## Capabilities

69

70

### Audio I/O Operations

71

72

Core functionality for loading, saving, and managing audio files with support for multiple backends and formats. Includes metadata extraction and backend management.

73

74

```python { .api }

75

def load(filepath: str, frame_offset: int = 0, num_frames: int = -1,

76

normalize: bool = True, channels_first: bool = True,

77

format: Optional[str] = None) -> Tuple[torch.Tensor, int]: ...

78

79

def save(filepath: str, src: torch.Tensor, sample_rate: int,

80

channels_first: bool = True, compression: Optional[float] = None) -> None: ...

81

82

def info(filepath: str, format: Optional[str] = None) -> AudioMetaData: ...

83

84

class AudioMetaData:

85

sample_rate: int

86

num_frames: int

87

num_channels: int

88

bits_per_sample: int

89

encoding: str

90

```

91

92

[Audio I/O Operations](./audio-io.md)

93

94

### Signal Processing Functions

95

96

Extensive collection of stateless audio processing functions including spectral analysis, filtering, resampling, pitch manipulation, and advanced signal processing algorithms.

97

98

```python { .api }

99

def spectrogram(waveform: torch.Tensor, pad: int = 0, window: torch.Tensor = None,

100

n_fft: int = 400, hop_length: Optional[int] = None,

101

win_length: Optional[int] = None, power: Optional[float] = 2.0,

102

normalized: bool = False) -> torch.Tensor: ...

103

104

def melscale_fbanks(n_freqs: int, f_min: float, f_max: float, n_mels: int,

105

sample_rate: int, norm: Optional[str] = None,

106

mel_scale: str = "htk") -> torch.Tensor: ...

107

108

def resample(waveform: torch.Tensor, orig_freq: int, new_freq: int,

109

resampling_method: str = "sinc_interp_kaiser") -> torch.Tensor: ...

110

```

111

112

[Signal Processing Functions](./functional.md)

113

114

### Audio Transforms

115

116

PyTorch-compatible transform classes for building differentiable audio processing pipelines. Includes spectral transforms, data augmentation, and preprocessing transforms.

117

118

```python { .api }

119

class Spectrogram(torch.nn.Module):

120

def __init__(self, n_fft: int = 400, win_length: Optional[int] = None,

121

hop_length: Optional[int] = None, pad: int = 0,

122

window_fn: Callable = torch.hann_window, power: Optional[float] = 2.0): ...

123

def forward(self, waveform: torch.Tensor) -> torch.Tensor: ...

124

125

class MelSpectrogram(torch.nn.Module):

126

def __init__(self, sample_rate: int = 16000, n_fft: int = 400,

127

win_length: Optional[int] = None, hop_length: Optional[int] = None,

128

f_min: float = 0., f_max: Optional[float] = None, n_mels: int = 128): ...

129

def forward(self, waveform: torch.Tensor) -> torch.Tensor: ...

130

131

class MFCC(torch.nn.Module):

132

def __init__(self, sample_rate: int = 16000, n_mfcc: int = 40,

133

dct_type: int = 2, norm: str = "ortho", log_mels: bool = False): ...

134

def forward(self, waveform: torch.Tensor) -> torch.Tensor: ...

135

```

136

137

[Audio Transforms](./transforms.md)

138

139

### Pre-trained Models

140

141

Ready-to-use neural network models for speech recognition, synthesis, and source separation. Includes Wav2Vec2, HuBERT, Tacotron2, WaveRNN, and source separation models.

142

143

```python { .api }

144

class Wav2Vec2Model(torch.nn.Module):

145

def __init__(self, feature_extractor: torch.nn.Module, encoder: torch.nn.Module): ...

146

def forward(self, waveforms: torch.Tensor) -> torch.Tensor: ...

147

148

def wav2vec2_base(num_out: Optional[int] = None) -> Wav2Vec2Model: ...

149

def wav2vec2_large(num_out: Optional[int] = None) -> Wav2Vec2Model: ...

150

151

class Tacotron2(torch.nn.Module):

152

def __init__(self, mask_padding: bool = False, n_mels: int = 80): ...

153

def forward(self, tokens: torch.Tensor, token_lengths: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: ...

154

155

class ConvTasNet(torch.nn.Module):

156

def __init__(self, num_sources: int = 2, enc_kernel_size: int = 16): ...

157

def forward(self, waveforms: torch.Tensor) -> torch.Tensor: ...

158

```

159

160

[Pre-trained Models](./models.md)

161

162

### Model Pipelines

163

164

Pre-configured model bundles with preprocessing, inference, and post-processing for production-ready audio applications. Includes ASR, TTS, and source separation pipelines.

165

166

```python { .api }

167

class Wav2Vec2Bundle:

168

def get_model(self) -> Wav2Vec2Model: ...

169

def get_labels(self) -> List[str]: ...

170

sample_rate: int

171

172

class Wav2Vec2ASRBundle(Wav2Vec2Bundle):

173

def get_model(self) -> Wav2Vec2Model: ...

174

def get_decoder(self) -> torch.nn.Module: ...

175

176

# Pre-trained bundle instances

177

WAV2VEC2_ASR_BASE_960H: Wav2Vec2ASRBundle

178

HUBERT_ASR_LARGE: Wav2Vec2ASRBundle

179

TACOTRON2_WAVERNN_CHAR_LJSPEECH: Tacotron2TTSBundle

180

```

181

182

[Model Pipelines](./pipelines.md)

183

184

### Audio Datasets

185

186

Standard dataset loaders for common audio datasets with consistent interfaces and preprocessing. Supports speech recognition, synthesis, music analysis, and source separation datasets.

187

188

```python { .api }

189

class LIBRISPEECH(torch.utils.data.Dataset):

190

def __init__(self, root: str, url: str = "train-clean-100",

191

folder_in_archive: str = "LibriSpeech", download: bool = False): ...

192

def __getitem__(self, n: int) -> Tuple[torch.Tensor, int, str, int, int, int]: ...

193

194

class SPEECHCOMMANDS(torch.utils.data.Dataset):

195

def __init__(self, root: str, url: str = "speech_commands_v0.02",

196

folder_in_archive: str = "SpeechCommands", download: bool = False): ...

197

def __getitem__(self, n: int) -> Tuple[torch.Tensor, int, str, str, int]: ...

198

199

class LJSPEECH(torch.utils.data.Dataset):

200

def __init__(self, root: str, url: str = "https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2",

201

folder_in_archive: str = "LJSpeech-1.1", download: bool = False): ...

202

def __getitem__(self, n: int) -> Tuple[torch.Tensor, int, str, str]: ...

203

```

204

205

[Audio Datasets](./datasets.md)

206

207

### Streaming I/O

208

209

Advanced streaming capabilities for real-time audio processing, media encoding/decoding, and efficient handling of large audio files.

210

211

```python { .api }

212

class StreamReader:

213

def __init__(self, src: str, format: Optional[str] = None, option: Optional[Dict[str, str]] = None): ...

214

def add_basic_audio_stream(self, frames_per_chunk: int, buffer_chunk_size: int = 3,

215

stream_index: Optional[int] = None, decoder: Optional[str] = None) -> int: ...

216

def process_packet(self, timeout: Optional[float] = None, backoff: float = 10.) -> int: ...

217

218

class StreamWriter:

219

def __init__(self, dst: str, format: Optional[str] = None, option: Optional[Dict[str, str]] = None): ...

220

def add_audio_stream(self, sample_rate: int, num_channels: int, format: str = "fltp",

221

encoder: Optional[str] = None, codec_config: Optional[CodecConfig] = None) -> int: ...

222

def write_audio_chunk(self, stream_index: int, chunk: torch.Tensor, pts: Optional[int] = None) -> None: ...

223

```

224

225

[Streaming I/O](./streaming.md)

226

227

### Audio Effects and Filtering

228

229

Comprehensive audio effects processing including filters, EQ, dynamic effects, and spatial audio processing capabilities.

230

231

```python { .api }

232

def biquad(waveform: torch.Tensor, b0: float, b1: float, b2: float,

233

a0: float, a1: float, a2: float) -> torch.Tensor: ...

234

def lowpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) -> torch.Tensor: ...

235

def highpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) -> torch.Tensor: ...

236

def flanger(waveform: torch.Tensor, sample_rate: int, delay: float = 0.0,

237

depth: float = 2.0, regen: float = 0.0, width: float = 71.0) -> torch.Tensor: ...

238

def phaser(waveform: torch.Tensor, sample_rate: int, gain_in: float = 0.4,

239

gain_out: float = 0.74, delay_ms: float = 3.0, decay: float = 0.4) -> torch.Tensor: ...

240

```

241

242

[Audio Effects and Filtering](./effects.md)

243

244

### Utility Functions

245

246

Helper functions for audio file management, format conversion, backend configuration, and integration with other audio processing libraries.

247

248

```python { .api }

249

def list_audio_backends() -> List[str]: ...

250

def get_audio_backend() -> Optional[str]: ...

251

def set_audio_backend(backend: Optional[str]) -> None: ...

252

253

def download_asset(filename: str, subfolder: str = "") -> str: ...

254

255

# SoX utilities

256

def init_sox_effects() -> None: ...

257

def shutdown_sox_effects() -> None: ...

258

def effect_names() -> List[str]: ...

259

```

260

261

[Utility Functions](./utils.md)