Tessl Tile for pypi/datasets@4.0.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

core-dataset-classes.md data-loading.md dataset-building.md dataset-information.md dataset-operations.md features-and-types.md index.md

dataset-operations.mddocs/

0
# Dataset Operations
1

2
Functions for combining, transforming, and manipulating datasets, including concatenation, interleaving, and caching control. These operations enable composition of multiple datasets and fine-grained control over dataset processing behavior.
3

4
## Capabilities
5

6
### Dataset Combination
7

8
Functions for combining multiple datasets into unified collections, supporting both vertical (row-wise) and horizontal (column-wise) concatenation, as well as sophisticated interleaving patterns.
9

10
```python { .api }
11
def concatenate_datasets(
12
    dsets: List[Union[Dataset, IterableDataset]],
13
    info: Optional[DatasetInfo] = None,
14
    split: Optional[NamedSplit] = None,
15
    axis: int = 0,
16
) -> Union[Dataset, IterableDataset]:
17
    """
18
    Converts a list of datasets with the same schema into a single dataset.
19

20
    Parameters:
21
    - dsets (List[Dataset] or List[IterableDataset]): List of datasets to concatenate
22
    - info (DatasetInfo, optional): Dataset information, like description, citation, etc.
23
    - split (NamedSplit, optional): Name of the dataset split
24
    - axis (int): Axis to concatenate over, where 0 means over rows (vertically) and 1 means over columns (horizontally)
25

26
    Returns:
27
    - Union[Dataset, IterableDataset]: Concatenated dataset of the same type as input datasets
28
    """
29

30
def interleave_datasets(
31
    datasets: List[Union[Dataset, IterableDataset]],
32
    probabilities: Optional[List[float]] = None,
33
    seed: Optional[int] = None,
34
    info: Optional[DatasetInfo] = None,
35
    split: Optional[NamedSplit] = None,
36
    stopping_strategy: str = "first_exhausted",
37
) -> Union[Dataset, IterableDataset]:
38
    """
39
    Interleave several datasets (sources) into a single dataset by alternating between sources.
40

41
    Parameters:
42
    - datasets (List[Dataset] or List[IterableDataset]): List of datasets to interleave
43
    - probabilities (List[float], optional): If specified, examples are sampled from sources according to these probabilities
44
    - seed (int, optional): Random seed used to choose a source for each example
45
    - info (DatasetInfo, optional): Dataset information, like description, citation, etc.
46
    - split (NamedSplit, optional): Name of the dataset split
47
    - stopping_strategy (str): Either "first_exhausted" (stop when first dataset is exhausted) or "all_exhausted" (oversample until all datasets exhausted)
48

49
    Returns:
50
    - Union[Dataset, IterableDataset]: Interleaved dataset of the same type as input datasets
51
    """
52
```
53

54
**Usage Examples:**
55

56
```python
57
from datasets import Dataset, concatenate_datasets, interleave_datasets
58

59
# Create sample datasets
60
ds1 = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})
61
ds2 = Dataset.from_dict({"text": ["foo", "bar"], "label": [1, 0]})
62
ds3 = Dataset.from_dict({"text": ["alice", "bob"], "label": [0, 1]})
63

64
# Concatenate datasets vertically (append rows)
65
combined = concatenate_datasets([ds1, ds2, ds3])
66
print(len(combined))  # 6
67

68
# Interleave datasets with equal probability
69
interleaved = interleave_datasets([ds1, ds2, ds3])
70
print(interleaved["text"])  # ['hello', 'foo', 'alice', 'world', 'bar', 'bob']
71

72
# Interleave with custom probabilities
73
weighted = interleave_datasets([ds1, ds2, ds3], probabilities=[0.7, 0.2, 0.1], seed=42)
74

75
# Different stopping strategies
76
all_exhausted = interleave_datasets([ds1, ds2, ds3], stopping_strategy="all_exhausted")
77
```
78

79
### Caching Control
80

81
Global functions for controlling the caching behavior of dataset operations. By default, dataset transformations are cached for reproducibility and performance.
82

83
```python { .api }
84
def enable_caching() -> None:
85
    """
86
    Enable caching of dataset operations.
87
    
88
    When enabled (default), data transformations are stored in cache files named using 
89
    dataset fingerprints. This allows reloading existing cache files if they've already 
90
    been computed, improving performance for repeated operations.
91
    """
92

93
def disable_caching() -> None:
94
    """
95
    Disable caching of dataset operations.
96
    
97
    When disabled, cache files are always recreated and existing cache files are ignored. 
98
    This forces recomputation of all transformations but ensures fresh processing of data.
99
    """
100

101
def is_caching_enabled() -> bool:
102
    """
103
    Check if caching is currently enabled.
104
    
105
    Returns:
106
    - bool: True if caching is enabled, False otherwise
107
    """
108
```
109

110
**Usage Examples:**
111

112
```python
113
from datasets import disable_caching, enable_caching, is_caching_enabled, load_dataset
114

115
# Check current caching status
116
print(f"Caching enabled: {is_caching_enabled()}")  # True by default
117

118
# Disable caching for fresh processing
119
disable_caching()
120
dataset = load_dataset("squad", split="train[:100]")
121
processed = dataset.map(lambda x: {"length": len(x["question"])})  # Always recomputed
122

123
# Re-enable caching
124
enable_caching()
125
cached_processed = dataset.map(lambda x: {"length": len(x["question"])})  # Uses cache if available
126
```
127

128
### Progress Bar Control
129

130
Functions for controlling the display of progress bars during dataset operations, particularly useful for long-running transformations.
131

132
```python { .api }
133
def enable_progress_bar() -> None:
134
    """Enable progress bar display during dataset operations."""
135

136
def disable_progress_bar() -> None:
137
    """Disable progress bar display during dataset operations."""
138

139
def is_progress_bar_enabled() -> bool:
140
    """
141
    Check if progress bars are currently enabled.
142
    
143
    Returns:
144
    - bool: True if progress bars are enabled, False otherwise
145
    """
146

147
def enable_progress_bars() -> None:
148
    """Enable progress bars (plural form for consistency)."""
149

150
def disable_progress_bars() -> None:
151
    """Disable progress bars (plural form for consistency)."""
152

153
def are_progress_bars_disabled() -> bool:
154
    """
155
    Check if progress bars are currently disabled.
156
    
157
    Returns:
158
    - bool: True if progress bars are disabled, False otherwise
159
    """
160
```
161

162
**Usage Examples:**
163

164
```python
165
from datasets import disable_progress_bar, enable_progress_bar, load_dataset
166

167
# Disable progress bars for cleaner output
168
disable_progress_bar()
169
dataset = load_dataset("squad", split="train")
170
processed = dataset.map(lambda x: {"length": len(x["question"])})  # No progress bar shown
171

172
# Re-enable progress bars
173
enable_progress_bar()
174
filtered = processed.filter(lambda x: x["length"] > 10)  # Progress bar displayed
175
```
176

177
### Experimental Features
178

179
Decorator for marking experimental functionality that may change in future versions.
180

181
```python { .api }
182
def experimental(fn):
183
    """
184
    Decorator to mark experimental features.
185
    
186
    Features marked as experimental may have their API changed or removed in future versions
187
    without a deprecation cycle. Use with caution in production code.
188
    """
189
```
190

191
## Advanced Dataset Operations
192

193
### Column-wise Concatenation
194

195
```python
196
# Concatenate datasets horizontally (add columns)
197
# Note: datasets must have the same number of rows
198
ds1 = Dataset.from_dict({"text": ["hello", "world"]})
199
ds2 = Dataset.from_dict({"label": [0, 1]})
200

201
# Horizontal concatenation (axis=1)
202
combined = concatenate_datasets([ds1, ds2], axis=1)
203
print(combined.column_names)  # ['text', 'label']
204
```
205

206
### Complex Interleaving Patterns
207

208
```python
209
# Create datasets of different sizes
210
small_ds = Dataset.from_dict({"text": ["a", "b"]})
211
medium_ds = Dataset.from_dict({"text": ["c", "d", "e"]})
212
large_ds = Dataset.from_dict({"text": ["f", "g", "h", "i"]})
213

214
# Use probabilities to control sampling
215
# Higher probability = more examples from that dataset
216
interleaved = interleave_datasets(
217
    [small_ds, medium_ds, large_ds], 
218
    probabilities=[0.1, 0.3, 0.6],  # Favor the large dataset
219
    seed=42,
220
    stopping_strategy="all_exhausted"  # Ensure all data is used
221
)
222
```
223

224
### Performance Considerations
225

226
- **Caching**: Enabled by default, provides significant speedup for repeated operations
227
- **Memory Usage**: `concatenate_datasets` creates a new dataset referencing original data
228
- **Streaming**: Both operations work with IterableDataset for memory-efficient processing
229
- **Fingerprinting**: Each operation updates the dataset fingerprint for cache invalidation
230
- **Multiprocessing**: Operations inherit multiprocessing settings from constituent datasets
231

232
### Error Handling
233

234
Common error scenarios and their solutions:
235

236
```python
237
# Schema mismatch in concatenation
238
try:
239
    ds1 = Dataset.from_dict({"text": ["hello"]})
240
    ds2 = Dataset.from_dict({"label": [0]})  # Different columns
241
    concatenate_datasets([ds1, ds2])  # Will fail
242
except ValueError as e:
243
    print("Schema mismatch - ensure datasets have compatible features")
244

245
# Empty dataset list
246
try:
247
    concatenate_datasets([])  # Will fail
248
except ValueError as e:
249
    print("Cannot concatenate empty list of datasets")
250

251
# Probability mismatch in interleaving
252
try:
253
    interleave_datasets([ds1, ds2], probabilities=[0.5])  # Wrong length
254
except ValueError as e:
255
    print("Probabilities list must match number of datasets")
256
```

Version

Tile

Files

dataset-operations.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

dataset-operations.mddocs/