0
# Data Loading
1
2
The primary interface for loading datasets from the HuggingFace Hub, local files, or custom data sources. This module provides functions for automatic format detection, streaming for large datasets, and flexible data splitting.
3
4
## Capabilities
5
6
### Loading Datasets from Hub and Files
7
8
The main entry point for loading datasets, supporting thousands of datasets from the HuggingFace Hub as well as local files in various formats (CSV, JSON, Parquet, etc.).
9
10
```python { .api }
11
def load_dataset(
12
path: str,
13
name: Optional[str] = None,
14
data_dir: Optional[str] = None,
15
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
16
split: Optional[Union[str, Split, list[str], list[Split]]] = None,
17
cache_dir: Optional[str] = None,
18
features: Optional[Features] = None,
19
download_config: Optional[DownloadConfig] = None,
20
download_mode: Optional[Union[DownloadMode, str]] = None,
21
verification_mode: Optional[Union[VerificationMode, str]] = None,
22
keep_in_memory: Optional[bool] = None,
23
save_infos: bool = False,
24
revision: Optional[Union[str, Version]] = None,
25
token: Optional[Union[bool, str]] = None,
26
streaming: bool = False,
27
num_proc: Optional[int] = None,
28
storage_options: Optional[dict] = None,
29
**config_kwargs,
30
) -> Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]:
31
"""
32
Load a dataset from the Hugging Face Hub, or a local dataset.
33
34
Parameters:
35
- path (str): Path or name of the dataset
36
- name (str, optional): Defining the name of the dataset configuration
37
- data_dir (str, optional): Defining the data_dir of the dataset configuration
38
- data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)
39
- split (str, Split, list[str], list[Split], optional): Which split of the data to load
40
- cache_dir (str, optional): Directory to read/write data
41
- features (Features, optional): Set the dataset features type to align and scale your features
42
- download_config (DownloadConfig, optional): Specific download configuration parameters
43
- download_mode (DownloadMode or str, optional): Select the download/generation mode
44
- verification_mode (VerificationMode or str, optional): Select the verification mode
45
- keep_in_memory (bool, optional): Whether to copy the dataset in-memory
46
- save_infos (bool): Save the dataset information (checksums/size/splits/...)
47
- revision (str, Version, optional): Version of the dataset script to load
48
- token (bool or str, optional): Optional string or boolean to use as Bearer token for remote files
49
- streaming (bool): If True, don't download the data files. Instead, it streams the data progressively while iterating
50
- num_proc (int, optional): Number of processes when downloading and generating the dataset locally
51
- storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
52
- **config_kwargs: Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder
53
54
Returns:
55
- Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]: Depending on split and streaming parameters
56
"""
57
```
58
59
**Usage Examples:**
60
61
```python
62
# Load a dataset from the Hub
63
dataset = load_dataset("squad")
64
65
# Load specific split
66
train_dataset = load_dataset("squad", split="train")
67
68
# Load with streaming for large datasets
69
streaming_dataset = load_dataset("oscar", "unshuffled_deduplicated_en", streaming=True)
70
71
# Load local CSV files
72
dataset = load_dataset("csv", data_files="my_file.csv")
73
74
# Load multiple files with different splits
75
dataset = load_dataset("csv", data_files={
76
"train": ["train1.csv", "train2.csv"],
77
"test": "test.csv"
78
})
79
```
80
81
### Loading Dataset Builders
82
83
Load a dataset builder without building the dataset, useful for inspecting dataset information before downloading.
84
85
```python { .api }
86
def load_dataset_builder(
87
path: str,
88
name: Optional[str] = None,
89
data_dir: Optional[str] = None,
90
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
91
cache_dir: Optional[str] = None,
92
features: Optional[Features] = None,
93
download_config: Optional[DownloadConfig] = None,
94
download_mode: Optional[Union[DownloadMode, str]] = None,
95
revision: Optional[Union[str, Version]] = None,
96
token: Optional[Union[bool, str]] = None,
97
storage_options: Optional[dict] = None,
98
**config_kwargs,
99
) -> DatasetBuilder:
100
"""
101
Load a dataset builder which can be used to inspect dataset information.
102
103
Parameters:
104
- path (str): Path or name of the dataset
105
- name (str, optional): Defining the name of the dataset configuration
106
- data_dir (str, optional): Defining the data_dir of the dataset configuration
107
- data_files (str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], optional): Path(s) to source data file(s)
108
- cache_dir (str, optional): Directory to read/write data
109
- features (Features, optional): Set the dataset features type
110
- download_config (DownloadConfig, optional): Specific download configuration parameters
111
- download_mode (DownloadMode or str, optional): Select the download/generation mode
112
- revision (str, Version, optional): Version of the dataset script to load
113
- token (bool or str, optional): Optional string or boolean to use as Bearer token
114
- storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
115
- **config_kwargs: Keyword arguments to be passed to the BuilderConfig
116
117
Returns:
118
- DatasetBuilder: A DatasetBuilder instance
119
"""
120
```
121
122
### Loading from Disk
123
124
Load datasets that were previously saved to disk using the save_to_disk method.
125
126
```python { .api }
127
def load_from_disk(
128
dataset_path: PathLike,
129
keep_in_memory: Optional[bool] = None,
130
storage_options: Optional[dict] = None
131
) -> Union[Dataset, DatasetDict]:
132
"""
133
Load a dataset that was previously saved using save_to_disk from a filesystem using a path.
134
135
Parameters:
136
- dataset_path (PathLike): Path (e.g. "dataset/train") or remote URI (e.g. "s3://my-bucket/dataset/train")
137
- keep_in_memory (bool, optional): Whether to copy the dataset in-memory
138
- storage_options (dict, optional): Key/value pairs to be passed on to the file-system backend
139
140
Returns:
141
- Union[Dataset, DatasetDict]: If the saved dataset is a Dataset, returns Dataset. If the saved dataset is a DatasetDict, returns DatasetDict
142
"""
143
```
144
145
**Usage Examples:**
146
147
```python
148
# Inspect dataset without downloading
149
builder = load_dataset_builder("squad")
150
print(builder.info.description)
151
print(builder.info.features)
152
153
# Load previously saved dataset
154
dataset = load_from_disk("./my_saved_dataset")
155
```
156
157
## Types
158
159
### Path Types
160
161
```python { .api }
162
from os import PathLike
163
```
164
165
### Download and Verification Modes
166
167
```python { .api }
168
class DownloadMode:
169
"""Download behavior modes."""
170
REUSE_DATASET_IF_EXISTS: str = "reuse_dataset_if_exists"
171
REUSE_CACHE_IF_EXISTS: str = "reuse_cache_if_exists"
172
FORCE_REDOWNLOAD: str = "force_redownload"
173
174
class VerificationMode:
175
"""Dataset verification modes."""
176
BASIC_CHECKS: str = "basic_checks"
177
ALL_CHECKS: str = "all_checks"
178
NO_CHECKS: str = "no_checks"
179
180
class DownloadConfig:
181
"""Configuration for download operations."""
182
183
def __init__(
184
self,
185
cache_dir: Optional[Union[str, Path]] = None,
186
force_download: bool = False,
187
resume_download: bool = False,
188
proxies: Optional[Dict[str, str]] = None,
189
token: Optional[Union[str, bool]] = None,
190
use_etag: bool = True,
191
num_proc: Optional[int] = None,
192
max_retries: int = 1,
193
**kwargs
194
): ...
195
196
class ReadInstruction:
197
"""Reading instruction for specifying dataset subsets and splits."""
198
199
def __init__(
200
self,
201
split_name: str,
202
from_: Optional[int] = None,
203
to: Optional[int] = None,
204
unit: str = 'abs',
205
): ...
206
207
@classmethod
208
def from_spec(cls, spec: str) -> "ReadInstruction": ...
209
```