or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/pypi-datasets

HuggingFace community-driven open-source library of datasets for machine learning with one-line dataloaders, efficient preprocessing, and multi-framework support

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
pypipkg:pypi/datasets@4.0.x

To install, run

npx @tessl/cli install tessl/pypi-datasets@4.0.0

0

# HuggingFace Datasets

1

2

A comprehensive dataset management library that enables developers to easily load, process, and work with machine learning datasets. It offers one-line dataloaders for thousands of public datasets from the HuggingFace Datasets Hub, efficient data pre-processing capabilities with memory-mapped storage using Apache Arrow for handling large datasets without RAM limitations, and built-in interoperability with major ML frameworks including NumPy, PyTorch, TensorFlow, JAX, and Pandas.

3

4

## Package Information

5

6

- **Package Name**: datasets

7

- **Language**: Python

8

- **Installation**: `pip install datasets`

9

10

## Core Imports

11

12

```python

13

import datasets

14

```

15

16

Common patterns for loading and working with datasets:

17

18

```python

19

from datasets import load_dataset, Dataset, DatasetDict

20

from datasets import concatenate_datasets, interleave_datasets

21

```

22

23

## Basic Usage

24

25

```python

26

from datasets import load_dataset

27

28

# Load a dataset from the Hub

29

dataset = load_dataset("squad", split="train")

30

31

# Access dataset features and data

32

print(dataset.features)

33

print(len(dataset))

34

print(dataset[0])

35

36

# Apply transformations

37

def preprocess(example):

38

example["question_length"] = len(example["question"])

39

return example

40

41

dataset = dataset.map(preprocess)

42

43

# Convert to different formats

44

torch_dataset = dataset.with_format("torch")

45

pandas_df = dataset.to_pandas()

46

47

# Save to disk

48

dataset.save_to_disk("./my_dataset")

49

```

50

51

## Architecture

52

53

The datasets library is built around these key components:

54

55

- **Dataset Classes**: `Dataset` for map-style access and `IterableDataset` for streaming large datasets

56

- **Loading System**: `load_dataset()` function with automatic discovery of dataset builders

57

- **Features System**: Comprehensive type definitions for structured data (text, audio, images, etc.)

58

- **Arrow Backend**: Memory-mapped storage using Apache Arrow for efficient data handling

59

- **Caching System**: Fingerprint-based caching for reproducible data processing

60

- **Hub Integration**: Direct access to thousands of datasets from the HuggingFace Hub

61

62

This design enables efficient processing of datasets ranging from small research datasets to massive production corpora, with seamless integration into popular ML frameworks and automatic optimization through caching and memory mapping.

63

64

## Capabilities

65

66

### Data Loading and Discovery

67

68

The primary interface for loading datasets from the HuggingFace Hub, local files, or custom data sources. Supports automatic format detection, streaming for large datasets, and flexible data splitting.

69

70

```python { .api }

71

def load_dataset(

72

path: str,

73

name: Optional[str] = None,

74

data_dir: Optional[str] = None,

75

data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,

76

split: Optional[Union[str, Split]] = None,

77

cache_dir: Optional[str] = None,

78

features: Optional[Features] = None,

79

download_config: Optional[DownloadConfig] = None,

80

download_mode: Optional[Union[DownloadMode, str]] = None,

81

verification_mode: Optional[Union[VerificationMode, str]] = None,

82

keep_in_memory: Optional[bool] = None,

83

save_infos: bool = False,

84

revision: Optional[Union[str, Version]] = None,

85

token: Optional[Union[bool, str]] = None,

86

streaming: bool = False,

87

num_proc: Optional[int] = None,

88

storage_options: Optional[Dict] = None,

89

trust_remote_code: bool = None,

90

**config_kwargs,

91

) -> Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict]:

92

"""Load a dataset from the HuggingFace Hub, local files, or custom sources."""

93

94

def load_dataset_builder(

95

path: str,

96

name: Optional[str] = None,

97

data_dir: Optional[str] = None,

98

data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,

99

cache_dir: Optional[str] = None,

100

features: Optional[Features] = None,

101

download_config: Optional[DownloadConfig] = None,

102

download_mode: Optional[Union[DownloadMode, str]] = None,

103

revision: Optional[Union[str, Version]] = None,

104

token: Optional[Union[bool, str]] = None,

105

storage_options: Optional[Dict] = None,

106

trust_remote_code: bool = None,

107

**config_kwargs,

108

) -> DatasetBuilder:

109

"""Load a dataset builder without building the dataset."""

110

111

def load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] = None) -> Union[Dataset, DatasetDict]:

112

"""Load a dataset that was previously saved using save_to_disk."""

113

```

114

115

[Data Loading](./data-loading.md)

116

117

### Core Dataset Classes

118

119

The fundamental dataset classes that provide different access patterns and capabilities for working with dataset collections.

120

121

```python { .api }

122

class Dataset:

123

"""Map-style dataset backed by Apache Arrow for efficient random access."""

124

125

def __getitem__(self, key): ...

126

def __len__(self) -> int: ...

127

def map(self, function, **kwargs) -> "Dataset": ...

128

def filter(self, function, **kwargs) -> "Dataset": ...

129

def select(self, indices) -> "Dataset": ...

130

def with_format(self, type: Optional[str] = None, **kwargs) -> "Dataset": ...

131

def to_pandas(self) -> "pandas.DataFrame": ...

132

def save_to_disk(self, dataset_path: str) -> None: ...

133

134

class DatasetDict(dict):

135

"""Dictionary of Dataset objects, typically for train/validation/test splits."""

136

137

def map(self, function, **kwargs) -> "DatasetDict": ...

138

def filter(self, function, **kwargs) -> "DatasetDict": ...

139

def with_format(self, type: Optional[str] = None, **kwargs) -> "DatasetDict": ...

140

def save_to_disk(self, dataset_dict_path: str) -> None: ...

141

142

class IterableDataset:

143

"""Iterable-style dataset for streaming large datasets without loading into memory."""

144

145

def __iter__(self): ...

146

def map(self, function, **kwargs) -> "IterableDataset": ...

147

def filter(self, function, **kwargs) -> "IterableDataset": ...

148

def take(self, n: int) -> "IterableDataset": ...

149

def skip(self, n: int) -> "IterableDataset": ...

150

```

151

152

[Core Dataset Classes](./core-dataset-classes.md)

153

154

### Dataset Operations

155

156

Functions for combining, transforming, and manipulating datasets, including concatenation, interleaving, and caching control.

157

158

```python { .api }

159

def concatenate_datasets(

160

dsets: List[Dataset],

161

info: Optional[DatasetInfo] = None,

162

split: Optional[NamedSplit] = None,

163

axis: int = 0,

164

) -> Dataset:

165

"""Concatenate multiple Dataset objects."""

166

167

def interleave_datasets(

168

datasets: List[Union[Dataset, IterableDataset]],

169

probabilities: Optional[List[float]] = None,

170

seed: Optional[int] = None,

171

info: Optional[DatasetInfo] = None,

172

split: Optional[NamedSplit] = None,

173

stopping_strategy: str = "first_exhausted",

174

) -> Union[Dataset, IterableDataset]:

175

"""Interleave multiple datasets."""

176

177

def enable_caching() -> None:

178

"""Enable caching of dataset operations."""

179

180

def disable_caching() -> None:

181

"""Disable caching of dataset operations."""

182

183

def is_caching_enabled() -> bool:

184

"""Check if caching is currently enabled."""

185

```

186

187

[Dataset Operations](./dataset-operations.md)

188

189

### Features and Type System

190

191

Comprehensive type system for defining and validating dataset schemas, supporting primitive types, complex nested structures, and multimedia data.

192

193

```python { .api }

194

class Features(dict):

195

"""Dictionary-like container for dataset features with type validation."""

196

197

def encode_example(self, example: dict) -> dict: ...

198

def decode_example(self, example: dict) -> dict: ...

199

200

class Value:

201

"""Feature for primitive data types (int32, float64, string, bool, etc.)."""

202

203

def __init__(self, dtype: str, id: Optional[str] = None): ...

204

205

class ClassLabel:

206

"""Feature for categorical/classification labels."""

207

208

def __init__(

209

self,

210

num_classes: Optional[int] = None,

211

names: Optional[List[str]] = None,

212

names_file: Optional[str] = None,

213

id: Optional[str] = None,

214

): ...

215

216

class Audio:

217

"""Feature for audio data with automatic format handling."""

218

219

def __init__(self, sampling_rate: Optional[int] = None, mono: bool = True, decode: bool = True): ...

220

221

class Image:

222

"""Feature for image data with automatic format handling."""

223

224

def __init__(self, decode: bool = True, id: Optional[str] = None): ...

225

```

226

227

[Features and Types](./features-and-types.md)

228

229

### Dataset Building

230

231

Classes and utilities for creating custom dataset builders and configurations for new datasets.

232

233

```python { .api }

234

class DatasetBuilder(ABC):

235

"""Abstract base class for dataset builders."""

236

237

def download_and_prepare(self, download_config: Optional[DownloadConfig] = None, **kwargs) -> None: ...

238

def as_dataset(self, split: Optional[Split] = None, **kwargs) -> Union[Dataset, DatasetDict]: ...

239

240

class GeneratorBasedBuilder(DatasetBuilder):

241

"""Dataset builder for datasets generated from Python generators."""

242

243

def _generate_examples(self, **kwargs): ...

244

245

class BuilderConfig:

246

"""Configuration class for dataset builders."""

247

248

def __init__(

249

self,

250

name: str = "default",

251

version: Optional[Union[str, Version]] = "0.0.0",

252

data_dir: Optional[str] = None,

253

data_files: Optional[DataFilesDict] = None,

254

description: Optional[str] = None,

255

): ...

256

```

257

258

[Dataset Building](./dataset-building.md)

259

260

### Dataset Information and Inspection

261

262

Functions and classes for inspecting dataset metadata, configurations, and available splits.

263

264

```python { .api }

265

class DatasetInfo:

266

"""Container for dataset metadata and information."""

267

268

description: str

269

features: Optional[Features]

270

total_num_examples: Optional[int]

271

splits: Optional[SplitDict]

272

supervised_keys: Optional[SupervisedKeysData]

273

version: Optional[Version]

274

license: Optional[str]

275

citation: Optional[str]

276

277

def get_dataset_config_names(path: str, **kwargs) -> List[str]:

278

"""Get available configuration names for a dataset."""

279

280

def get_dataset_split_names(path: str, config_name: Optional[str] = None, **kwargs) -> List[str]:

281

"""Get available split names for a dataset."""

282

283

def get_dataset_infos(path: str, **kwargs) -> Dict[str, DatasetInfo]:

284

"""Get information about all configurations of a dataset."""

285

```

286

287

[Dataset Information](./dataset-information.md)

288

289

## Types

290

291

```python { .api }

292

class Split:

293

"""Standard dataset splits."""

294

TRAIN: str = "train"

295

TEST: str = "test"

296

VALIDATION: str = "validation"

297

298

class DownloadMode:

299

"""Download behavior modes."""

300

REUSE_DATASET_IF_EXISTS: str = "reuse_dataset_if_exists"

301

REUSE_CACHE_IF_EXISTS: str = "reuse_cache_if_exists"

302

FORCE_REDOWNLOAD: str = "force_redownload"

303

304

class VerificationMode:

305

"""Dataset verification modes."""

306

BASIC_CHECKS: str = "basic_checks"

307

ALL_CHECKS: str = "all_checks"

308

NO_CHECKS: str = "no_checks"

309

```