0
# Dataset Operations
1
2
Functions for combining, transforming, and manipulating datasets, including concatenation, interleaving, and caching control. These operations enable composition of multiple datasets and fine-grained control over dataset processing behavior.
3
4
## Capabilities
5
6
### Dataset Combination
7
8
Functions for combining multiple datasets into unified collections, supporting both vertical (row-wise) and horizontal (column-wise) concatenation, as well as sophisticated interleaving patterns.
9
10
```python { .api }
11
def concatenate_datasets(
12
dsets: List[Union[Dataset, IterableDataset]],
13
info: Optional[DatasetInfo] = None,
14
split: Optional[NamedSplit] = None,
15
axis: int = 0,
16
) -> Union[Dataset, IterableDataset]:
17
"""
18
Converts a list of datasets with the same schema into a single dataset.
19
20
Parameters:
21
- dsets (List[Dataset] or List[IterableDataset]): List of datasets to concatenate
22
- info (DatasetInfo, optional): Dataset information, like description, citation, etc.
23
- split (NamedSplit, optional): Name of the dataset split
24
- axis (int): Axis to concatenate over, where 0 means over rows (vertically) and 1 means over columns (horizontally)
25
26
Returns:
27
- Union[Dataset, IterableDataset]: Concatenated dataset of the same type as input datasets
28
"""
29
30
def interleave_datasets(
31
datasets: List[Union[Dataset, IterableDataset]],
32
probabilities: Optional[List[float]] = None,
33
seed: Optional[int] = None,
34
info: Optional[DatasetInfo] = None,
35
split: Optional[NamedSplit] = None,
36
stopping_strategy: str = "first_exhausted",
37
) -> Union[Dataset, IterableDataset]:
38
"""
39
Interleave several datasets (sources) into a single dataset by alternating between sources.
40
41
Parameters:
42
- datasets (List[Dataset] or List[IterableDataset]): List of datasets to interleave
43
- probabilities (List[float], optional): If specified, examples are sampled from sources according to these probabilities
44
- seed (int, optional): Random seed used to choose a source for each example
45
- info (DatasetInfo, optional): Dataset information, like description, citation, etc.
46
- split (NamedSplit, optional): Name of the dataset split
47
- stopping_strategy (str): Either "first_exhausted" (stop when first dataset is exhausted) or "all_exhausted" (oversample until all datasets exhausted)
48
49
Returns:
50
- Union[Dataset, IterableDataset]: Interleaved dataset of the same type as input datasets
51
"""
52
```
53
54
**Usage Examples:**
55
56
```python
57
from datasets import Dataset, concatenate_datasets, interleave_datasets
58
59
# Create sample datasets
60
ds1 = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})
61
ds2 = Dataset.from_dict({"text": ["foo", "bar"], "label": [1, 0]})
62
ds3 = Dataset.from_dict({"text": ["alice", "bob"], "label": [0, 1]})
63
64
# Concatenate datasets vertically (append rows)
65
combined = concatenate_datasets([ds1, ds2, ds3])
66
print(len(combined)) # 6
67
68
# Interleave datasets with equal probability
69
interleaved = interleave_datasets([ds1, ds2, ds3])
70
print(interleaved["text"]) # ['hello', 'foo', 'alice', 'world', 'bar', 'bob']
71
72
# Interleave with custom probabilities
73
weighted = interleave_datasets([ds1, ds2, ds3], probabilities=[0.7, 0.2, 0.1], seed=42)
74
75
# Different stopping strategies
76
all_exhausted = interleave_datasets([ds1, ds2, ds3], stopping_strategy="all_exhausted")
77
```
78
79
### Caching Control
80
81
Global functions for controlling the caching behavior of dataset operations. By default, dataset transformations are cached for reproducibility and performance.
82
83
```python { .api }
84
def enable_caching() -> None:
85
"""
86
Enable caching of dataset operations.
87
88
When enabled (default), data transformations are stored in cache files named using
89
dataset fingerprints. This allows reloading existing cache files if they've already
90
been computed, improving performance for repeated operations.
91
"""
92
93
def disable_caching() -> None:
94
"""
95
Disable caching of dataset operations.
96
97
When disabled, cache files are always recreated and existing cache files are ignored.
98
This forces recomputation of all transformations but ensures fresh processing of data.
99
"""
100
101
def is_caching_enabled() -> bool:
102
"""
103
Check if caching is currently enabled.
104
105
Returns:
106
- bool: True if caching is enabled, False otherwise
107
"""
108
```
109
110
**Usage Examples:**
111
112
```python
113
from datasets import disable_caching, enable_caching, is_caching_enabled, load_dataset
114
115
# Check current caching status
116
print(f"Caching enabled: {is_caching_enabled()}") # True by default
117
118
# Disable caching for fresh processing
119
disable_caching()
120
dataset = load_dataset("squad", split="train[:100]")
121
processed = dataset.map(lambda x: {"length": len(x["question"])}) # Always recomputed
122
123
# Re-enable caching
124
enable_caching()
125
cached_processed = dataset.map(lambda x: {"length": len(x["question"])}) # Uses cache if available
126
```
127
128
### Progress Bar Control
129
130
Functions for controlling the display of progress bars during dataset operations, particularly useful for long-running transformations.
131
132
```python { .api }
133
def enable_progress_bar() -> None:
134
"""Enable progress bar display during dataset operations."""
135
136
def disable_progress_bar() -> None:
137
"""Disable progress bar display during dataset operations."""
138
139
def is_progress_bar_enabled() -> bool:
140
"""
141
Check if progress bars are currently enabled.
142
143
Returns:
144
- bool: True if progress bars are enabled, False otherwise
145
"""
146
147
def enable_progress_bars() -> None:
148
"""Enable progress bars (plural form for consistency)."""
149
150
def disable_progress_bars() -> None:
151
"""Disable progress bars (plural form for consistency)."""
152
153
def are_progress_bars_disabled() -> bool:
154
"""
155
Check if progress bars are currently disabled.
156
157
Returns:
158
- bool: True if progress bars are disabled, False otherwise
159
"""
160
```
161
162
**Usage Examples:**
163
164
```python
165
from datasets import disable_progress_bar, enable_progress_bar, load_dataset
166
167
# Disable progress bars for cleaner output
168
disable_progress_bar()
169
dataset = load_dataset("squad", split="train")
170
processed = dataset.map(lambda x: {"length": len(x["question"])}) # No progress bar shown
171
172
# Re-enable progress bars
173
enable_progress_bar()
174
filtered = processed.filter(lambda x: x["length"] > 10) # Progress bar displayed
175
```
176
177
### Experimental Features
178
179
Decorator for marking experimental functionality that may change in future versions.
180
181
```python { .api }
182
def experimental(fn):
183
"""
184
Decorator to mark experimental features.
185
186
Features marked as experimental may have their API changed or removed in future versions
187
without a deprecation cycle. Use with caution in production code.
188
"""
189
```
190
191
## Advanced Dataset Operations
192
193
### Column-wise Concatenation
194
195
```python
196
# Concatenate datasets horizontally (add columns)
197
# Note: datasets must have the same number of rows
198
ds1 = Dataset.from_dict({"text": ["hello", "world"]})
199
ds2 = Dataset.from_dict({"label": [0, 1]})
200
201
# Horizontal concatenation (axis=1)
202
combined = concatenate_datasets([ds1, ds2], axis=1)
203
print(combined.column_names) # ['text', 'label']
204
```
205
206
### Complex Interleaving Patterns
207
208
```python
209
# Create datasets of different sizes
210
small_ds = Dataset.from_dict({"text": ["a", "b"]})
211
medium_ds = Dataset.from_dict({"text": ["c", "d", "e"]})
212
large_ds = Dataset.from_dict({"text": ["f", "g", "h", "i"]})
213
214
# Use probabilities to control sampling
215
# Higher probability = more examples from that dataset
216
interleaved = interleave_datasets(
217
[small_ds, medium_ds, large_ds],
218
probabilities=[0.1, 0.3, 0.6], # Favor the large dataset
219
seed=42,
220
stopping_strategy="all_exhausted" # Ensure all data is used
221
)
222
```
223
224
### Performance Considerations
225
226
- **Caching**: Enabled by default, provides significant speedup for repeated operations
227
- **Memory Usage**: `concatenate_datasets` creates a new dataset referencing original data
228
- **Streaming**: Both operations work with IterableDataset for memory-efficient processing
229
- **Fingerprinting**: Each operation updates the dataset fingerprint for cache invalidation
230
- **Multiprocessing**: Operations inherit multiprocessing settings from constituent datasets
231
232
### Error Handling
233
234
Common error scenarios and their solutions:
235
236
```python
237
# Schema mismatch in concatenation
238
try:
239
ds1 = Dataset.from_dict({"text": ["hello"]})
240
ds2 = Dataset.from_dict({"label": [0]}) # Different columns
241
concatenate_datasets([ds1, ds2]) # Will fail
242
except ValueError as e:
243
print("Schema mismatch - ensure datasets have compatible features")
244
245
# Empty dataset list
246
try:
247
concatenate_datasets([]) # Will fail
248
except ValueError as e:
249
print("Cannot concatenate empty list of datasets")
250
251
# Probability mismatch in interleaving
252
try:
253
interleave_datasets([ds1, ds2], probabilities=[0.5]) # Wrong length
254
except ValueError as e:
255
print("Probabilities list must match number of datasets")
256
```