HuggingFace community-driven open-source library of datasets for machine learning with one-line dataloaders, efficient preprocessing, and multi-framework support
npx @tessl/cli install tessl/pypi-datasets@4.0.00
# HuggingFace Datasets
1
2
A comprehensive dataset management library that enables developers to easily load, process, and work with machine learning datasets. It offers one-line dataloaders for thousands of public datasets from the HuggingFace Datasets Hub, efficient data pre-processing capabilities with memory-mapped storage using Apache Arrow for handling large datasets without RAM limitations, and built-in interoperability with major ML frameworks including NumPy, PyTorch, TensorFlow, JAX, and Pandas.
3
4
## Package Information
5
6
- **Package Name**: datasets
7
- **Language**: Python
8
- **Installation**: `pip install datasets`
9
10
## Core Imports
11
12
```python
13
import datasets
14
```
15
16
Common patterns for loading and working with datasets:
17
18
```python
19
from datasets import load_dataset, Dataset, DatasetDict
20
from datasets import concatenate_datasets, interleave_datasets
21
```
22
23
## Basic Usage
24
25
```python
26
from datasets import load_dataset
27
28
# Load a dataset from the Hub
29
dataset = load_dataset("squad", split="train")
30
31
# Access dataset features and data
32
print(dataset.features)
33
print(len(dataset))
34
print(dataset[0])
35
36
# Apply transformations
37
def preprocess(example):
38
example["question_length"] = len(example["question"])
39
return example
40
41
dataset = dataset.map(preprocess)
42
43
# Convert to different formats
44
torch_dataset = dataset.with_format("torch")
45
pandas_df = dataset.to_pandas()
46
47
# Save to disk
48
dataset.save_to_disk("./my_dataset")
49
```
50
51
## Architecture
52
53
The datasets library is built around these key components:
54
55
- **Dataset Classes**: `Dataset` for map-style access and `IterableDataset` for streaming large datasets
56
- **Loading System**: `load_dataset()` function with automatic discovery of dataset builders
57
- **Features System**: Comprehensive type definitions for structured data (text, audio, images, etc.)
58
- **Arrow Backend**: Memory-mapped storage using Apache Arrow for efficient data handling
59
- **Caching System**: Fingerprint-based caching for reproducible data processing
60
- **Hub Integration**: Direct access to thousands of datasets from the HuggingFace Hub
61
62
This design enables efficient processing of datasets ranging from small research datasets to massive production corpora, with seamless integration into popular ML frameworks and automatic optimization through caching and memory mapping.
63
64
## Capabilities
65
66
### Data Loading and Discovery
67
68
The primary interface for loading datasets from the HuggingFace Hub, local files, or custom data sources. Supports automatic format detection, streaming for large datasets, and flexible data splitting.
69
70
```python { .api }
71
def load_dataset(
72
path: str,
73
name: Optional[str] = None,
74
data_dir: Optional[str] = None,
75
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
76
split: Optional[Union[str, Split]] = None,
77
cache_dir: Optional[str] = None,
78
features: Optional[Features] = None,
79
download_config: Optional[DownloadConfig] = None,
80
download_mode: Optional[Union[DownloadMode, str]] = None,
81
verification_mode: Optional[Union[VerificationMode, str]] = None,
82
keep_in_memory: Optional[bool] = None,
83
save_infos: bool = False,
84
revision: Optional[Union[str, Version]] = None,
85
token: Optional[Union[bool, str]] = None,
86
streaming: bool = False,
87
num_proc: Optional[int] = None,
88
storage_options: Optional[Dict] = None,
89
trust_remote_code: bool = None,
90
**config_kwargs,
91
) -> Union[Dataset, DatasetDict, IterableDataset, IterableDatasetDict]:
92
"""Load a dataset from the HuggingFace Hub, local files, or custom sources."""
93
94
def load_dataset_builder(
95
path: str,
96
name: Optional[str] = None,
97
data_dir: Optional[str] = None,
98
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
99
cache_dir: Optional[str] = None,
100
features: Optional[Features] = None,
101
download_config: Optional[DownloadConfig] = None,
102
download_mode: Optional[Union[DownloadMode, str]] = None,
103
revision: Optional[Union[str, Version]] = None,
104
token: Optional[Union[bool, str]] = None,
105
storage_options: Optional[Dict] = None,
106
trust_remote_code: bool = None,
107
**config_kwargs,
108
) -> DatasetBuilder:
109
"""Load a dataset builder without building the dataset."""
110
111
def load_from_disk(dataset_path: str, fs=None, keep_in_memory: Optional[bool] = None) -> Union[Dataset, DatasetDict]:
112
"""Load a dataset that was previously saved using save_to_disk."""
113
```
114
115
[Data Loading](./data-loading.md)
116
117
### Core Dataset Classes
118
119
The fundamental dataset classes that provide different access patterns and capabilities for working with dataset collections.
120
121
```python { .api }
122
class Dataset:
123
"""Map-style dataset backed by Apache Arrow for efficient random access."""
124
125
def __getitem__(self, key): ...
126
def __len__(self) -> int: ...
127
def map(self, function, **kwargs) -> "Dataset": ...
128
def filter(self, function, **kwargs) -> "Dataset": ...
129
def select(self, indices) -> "Dataset": ...
130
def with_format(self, type: Optional[str] = None, **kwargs) -> "Dataset": ...
131
def to_pandas(self) -> "pandas.DataFrame": ...
132
def save_to_disk(self, dataset_path: str) -> None: ...
133
134
class DatasetDict(dict):
135
"""Dictionary of Dataset objects, typically for train/validation/test splits."""
136
137
def map(self, function, **kwargs) -> "DatasetDict": ...
138
def filter(self, function, **kwargs) -> "DatasetDict": ...
139
def with_format(self, type: Optional[str] = None, **kwargs) -> "DatasetDict": ...
140
def save_to_disk(self, dataset_dict_path: str) -> None: ...
141
142
class IterableDataset:
143
"""Iterable-style dataset for streaming large datasets without loading into memory."""
144
145
def __iter__(self): ...
146
def map(self, function, **kwargs) -> "IterableDataset": ...
147
def filter(self, function, **kwargs) -> "IterableDataset": ...
148
def take(self, n: int) -> "IterableDataset": ...
149
def skip(self, n: int) -> "IterableDataset": ...
150
```
151
152
[Core Dataset Classes](./core-dataset-classes.md)
153
154
### Dataset Operations
155
156
Functions for combining, transforming, and manipulating datasets, including concatenation, interleaving, and caching control.
157
158
```python { .api }
159
def concatenate_datasets(
160
dsets: List[Dataset],
161
info: Optional[DatasetInfo] = None,
162
split: Optional[NamedSplit] = None,
163
axis: int = 0,
164
) -> Dataset:
165
"""Concatenate multiple Dataset objects."""
166
167
def interleave_datasets(
168
datasets: List[Union[Dataset, IterableDataset]],
169
probabilities: Optional[List[float]] = None,
170
seed: Optional[int] = None,
171
info: Optional[DatasetInfo] = None,
172
split: Optional[NamedSplit] = None,
173
stopping_strategy: str = "first_exhausted",
174
) -> Union[Dataset, IterableDataset]:
175
"""Interleave multiple datasets."""
176
177
def enable_caching() -> None:
178
"""Enable caching of dataset operations."""
179
180
def disable_caching() -> None:
181
"""Disable caching of dataset operations."""
182
183
def is_caching_enabled() -> bool:
184
"""Check if caching is currently enabled."""
185
```
186
187
[Dataset Operations](./dataset-operations.md)
188
189
### Features and Type System
190
191
Comprehensive type system for defining and validating dataset schemas, supporting primitive types, complex nested structures, and multimedia data.
192
193
```python { .api }
194
class Features(dict):
195
"""Dictionary-like container for dataset features with type validation."""
196
197
def encode_example(self, example: dict) -> dict: ...
198
def decode_example(self, example: dict) -> dict: ...
199
200
class Value:
201
"""Feature for primitive data types (int32, float64, string, bool, etc.)."""
202
203
def __init__(self, dtype: str, id: Optional[str] = None): ...
204
205
class ClassLabel:
206
"""Feature for categorical/classification labels."""
207
208
def __init__(
209
self,
210
num_classes: Optional[int] = None,
211
names: Optional[List[str]] = None,
212
names_file: Optional[str] = None,
213
id: Optional[str] = None,
214
): ...
215
216
class Audio:
217
"""Feature for audio data with automatic format handling."""
218
219
def __init__(self, sampling_rate: Optional[int] = None, mono: bool = True, decode: bool = True): ...
220
221
class Image:
222
"""Feature for image data with automatic format handling."""
223
224
def __init__(self, decode: bool = True, id: Optional[str] = None): ...
225
```
226
227
[Features and Types](./features-and-types.md)
228
229
### Dataset Building
230
231
Classes and utilities for creating custom dataset builders and configurations for new datasets.
232
233
```python { .api }
234
class DatasetBuilder(ABC):
235
"""Abstract base class for dataset builders."""
236
237
def download_and_prepare(self, download_config: Optional[DownloadConfig] = None, **kwargs) -> None: ...
238
def as_dataset(self, split: Optional[Split] = None, **kwargs) -> Union[Dataset, DatasetDict]: ...
239
240
class GeneratorBasedBuilder(DatasetBuilder):
241
"""Dataset builder for datasets generated from Python generators."""
242
243
def _generate_examples(self, **kwargs): ...
244
245
class BuilderConfig:
246
"""Configuration class for dataset builders."""
247
248
def __init__(
249
self,
250
name: str = "default",
251
version: Optional[Union[str, Version]] = "0.0.0",
252
data_dir: Optional[str] = None,
253
data_files: Optional[DataFilesDict] = None,
254
description: Optional[str] = None,
255
): ...
256
```
257
258
[Dataset Building](./dataset-building.md)
259
260
### Dataset Information and Inspection
261
262
Functions and classes for inspecting dataset metadata, configurations, and available splits.
263
264
```python { .api }
265
class DatasetInfo:
266
"""Container for dataset metadata and information."""
267
268
description: str
269
features: Optional[Features]
270
total_num_examples: Optional[int]
271
splits: Optional[SplitDict]
272
supervised_keys: Optional[SupervisedKeysData]
273
version: Optional[Version]
274
license: Optional[str]
275
citation: Optional[str]
276
277
def get_dataset_config_names(path: str, **kwargs) -> List[str]:
278
"""Get available configuration names for a dataset."""
279
280
def get_dataset_split_names(path: str, config_name: Optional[str] = None, **kwargs) -> List[str]:
281
"""Get available split names for a dataset."""
282
283
def get_dataset_infos(path: str, **kwargs) -> Dict[str, DatasetInfo]:
284
"""Get information about all configurations of a dataset."""
285
```
286
287
[Dataset Information](./dataset-information.md)
288
289
## Types
290
291
```python { .api }
292
class Split:
293
"""Standard dataset splits."""
294
TRAIN: str = "train"
295
TEST: str = "test"
296
VALIDATION: str = "validation"
297
298
class DownloadMode:
299
"""Download behavior modes."""
300
REUSE_DATASET_IF_EXISTS: str = "reuse_dataset_if_exists"
301
REUSE_CACHE_IF_EXISTS: str = "reuse_cache_if_exists"
302
FORCE_REDOWNLOAD: str = "force_redownload"
303
304
class VerificationMode:
305
"""Dataset verification modes."""
306
BASIC_CHECKS: str = "basic_checks"
307
ALL_CHECKS: str = "all_checks"
308
NO_CHECKS: str = "no_checks"
309
```