or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

core-dataset-classes.mddata-loading.mddataset-building.mddataset-information.mddataset-operations.mdfeatures-and-types.mdindex.md

dataset-operations.mddocs/

0

# Dataset Operations

1

2

Functions for combining, transforming, and manipulating datasets, including concatenation, interleaving, and caching control. These operations enable composition of multiple datasets and fine-grained control over dataset processing behavior.

3

4

## Capabilities

5

6

### Dataset Combination

7

8

Functions for combining multiple datasets into unified collections, supporting both vertical (row-wise) and horizontal (column-wise) concatenation, as well as sophisticated interleaving patterns.

9

10

```python { .api }

11

def concatenate_datasets(

12

dsets: List[Union[Dataset, IterableDataset]],

13

info: Optional[DatasetInfo] = None,

14

split: Optional[NamedSplit] = None,

15

axis: int = 0,

16

) -> Union[Dataset, IterableDataset]:

17

"""

18

Converts a list of datasets with the same schema into a single dataset.

19

20

Parameters:

21

- dsets (List[Dataset] or List[IterableDataset]): List of datasets to concatenate

22

- info (DatasetInfo, optional): Dataset information, like description, citation, etc.

23

- split (NamedSplit, optional): Name of the dataset split

24

- axis (int): Axis to concatenate over, where 0 means over rows (vertically) and 1 means over columns (horizontally)

25

26

Returns:

27

- Union[Dataset, IterableDataset]: Concatenated dataset of the same type as input datasets

28

"""

29

30

def interleave_datasets(

31

datasets: List[Union[Dataset, IterableDataset]],

32

probabilities: Optional[List[float]] = None,

33

seed: Optional[int] = None,

34

info: Optional[DatasetInfo] = None,

35

split: Optional[NamedSplit] = None,

36

stopping_strategy: str = "first_exhausted",

37

) -> Union[Dataset, IterableDataset]:

38

"""

39

Interleave several datasets (sources) into a single dataset by alternating between sources.

40

41

Parameters:

42

- datasets (List[Dataset] or List[IterableDataset]): List of datasets to interleave

43

- probabilities (List[float], optional): If specified, examples are sampled from sources according to these probabilities

44

- seed (int, optional): Random seed used to choose a source for each example

45

- info (DatasetInfo, optional): Dataset information, like description, citation, etc.

46

- split (NamedSplit, optional): Name of the dataset split

47

- stopping_strategy (str): Either "first_exhausted" (stop when first dataset is exhausted) or "all_exhausted" (oversample until all datasets exhausted)

48

49

Returns:

50

- Union[Dataset, IterableDataset]: Interleaved dataset of the same type as input datasets

51

"""

52

```

53

54

**Usage Examples:**

55

56

```python

57

from datasets import Dataset, concatenate_datasets, interleave_datasets

58

59

# Create sample datasets

60

ds1 = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})

61

ds2 = Dataset.from_dict({"text": ["foo", "bar"], "label": [1, 0]})

62

ds3 = Dataset.from_dict({"text": ["alice", "bob"], "label": [0, 1]})

63

64

# Concatenate datasets vertically (append rows)

65

combined = concatenate_datasets([ds1, ds2, ds3])

66

print(len(combined)) # 6

67

68

# Interleave datasets with equal probability

69

interleaved = interleave_datasets([ds1, ds2, ds3])

70

print(interleaved["text"]) # ['hello', 'foo', 'alice', 'world', 'bar', 'bob']

71

72

# Interleave with custom probabilities

73

weighted = interleave_datasets([ds1, ds2, ds3], probabilities=[0.7, 0.2, 0.1], seed=42)

74

75

# Different stopping strategies

76

all_exhausted = interleave_datasets([ds1, ds2, ds3], stopping_strategy="all_exhausted")

77

```

78

79

### Caching Control

80

81

Global functions for controlling the caching behavior of dataset operations. By default, dataset transformations are cached for reproducibility and performance.

82

83

```python { .api }

84

def enable_caching() -> None:

85

"""

86

Enable caching of dataset operations.

87

88

When enabled (default), data transformations are stored in cache files named using

89

dataset fingerprints. This allows reloading existing cache files if they've already

90

been computed, improving performance for repeated operations.

91

"""

92

93

def disable_caching() -> None:

94

"""

95

Disable caching of dataset operations.

96

97

When disabled, cache files are always recreated and existing cache files are ignored.

98

This forces recomputation of all transformations but ensures fresh processing of data.

99

"""

100

101

def is_caching_enabled() -> bool:

102

"""

103

Check if caching is currently enabled.

104

105

Returns:

106

- bool: True if caching is enabled, False otherwise

107

"""

108

```

109

110

**Usage Examples:**

111

112

```python

113

from datasets import disable_caching, enable_caching, is_caching_enabled, load_dataset

114

115

# Check current caching status

116

print(f"Caching enabled: {is_caching_enabled()}") # True by default

117

118

# Disable caching for fresh processing

119

disable_caching()

120

dataset = load_dataset("squad", split="train[:100]")

121

processed = dataset.map(lambda x: {"length": len(x["question"])}) # Always recomputed

122

123

# Re-enable caching

124

enable_caching()

125

cached_processed = dataset.map(lambda x: {"length": len(x["question"])}) # Uses cache if available

126

```

127

128

### Progress Bar Control

129

130

Functions for controlling the display of progress bars during dataset operations, particularly useful for long-running transformations.

131

132

```python { .api }

133

def enable_progress_bar() -> None:

134

"""Enable progress bar display during dataset operations."""

135

136

def disable_progress_bar() -> None:

137

"""Disable progress bar display during dataset operations."""

138

139

def is_progress_bar_enabled() -> bool:

140

"""

141

Check if progress bars are currently enabled.

142

143

Returns:

144

- bool: True if progress bars are enabled, False otherwise

145

"""

146

147

def enable_progress_bars() -> None:

148

"""Enable progress bars (plural form for consistency)."""

149

150

def disable_progress_bars() -> None:

151

"""Disable progress bars (plural form for consistency)."""

152

153

def are_progress_bars_disabled() -> bool:

154

"""

155

Check if progress bars are currently disabled.

156

157

Returns:

158

- bool: True if progress bars are disabled, False otherwise

159

"""

160

```

161

162

**Usage Examples:**

163

164

```python

165

from datasets import disable_progress_bar, enable_progress_bar, load_dataset

166

167

# Disable progress bars for cleaner output

168

disable_progress_bar()

169

dataset = load_dataset("squad", split="train")

170

processed = dataset.map(lambda x: {"length": len(x["question"])}) # No progress bar shown

171

172

# Re-enable progress bars

173

enable_progress_bar()

174

filtered = processed.filter(lambda x: x["length"] > 10) # Progress bar displayed

175

```

176

177

### Experimental Features

178

179

Decorator for marking experimental functionality that may change in future versions.

180

181

```python { .api }

182

def experimental(fn):

183

"""

184

Decorator to mark experimental features.

185

186

Features marked as experimental may have their API changed or removed in future versions

187

without a deprecation cycle. Use with caution in production code.

188

"""

189

```

190

191

## Advanced Dataset Operations

192

193

### Column-wise Concatenation

194

195

```python

196

# Concatenate datasets horizontally (add columns)

197

# Note: datasets must have the same number of rows

198

ds1 = Dataset.from_dict({"text": ["hello", "world"]})

199

ds2 = Dataset.from_dict({"label": [0, 1]})

200

201

# Horizontal concatenation (axis=1)

202

combined = concatenate_datasets([ds1, ds2], axis=1)

203

print(combined.column_names) # ['text', 'label']

204

```

205

206

### Complex Interleaving Patterns

207

208

```python

209

# Create datasets of different sizes

210

small_ds = Dataset.from_dict({"text": ["a", "b"]})

211

medium_ds = Dataset.from_dict({"text": ["c", "d", "e"]})

212

large_ds = Dataset.from_dict({"text": ["f", "g", "h", "i"]})

213

214

# Use probabilities to control sampling

215

# Higher probability = more examples from that dataset

216

interleaved = interleave_datasets(

217

[small_ds, medium_ds, large_ds],

218

probabilities=[0.1, 0.3, 0.6], # Favor the large dataset

219

seed=42,

220

stopping_strategy="all_exhausted" # Ensure all data is used

221

)

222

```

223

224

### Performance Considerations

225

226

- **Caching**: Enabled by default, provides significant speedup for repeated operations

227

- **Memory Usage**: `concatenate_datasets` creates a new dataset referencing original data

228

- **Streaming**: Both operations work with IterableDataset for memory-efficient processing

229

- **Fingerprinting**: Each operation updates the dataset fingerprint for cache invalidation

230

- **Multiprocessing**: Operations inherit multiprocessing settings from constituent datasets

231

232

### Error Handling

233

234

Common error scenarios and their solutions:

235

236

```python

237

# Schema mismatch in concatenation

238

try:

239

ds1 = Dataset.from_dict({"text": ["hello"]})

240

ds2 = Dataset.from_dict({"label": [0]}) # Different columns

241

concatenate_datasets([ds1, ds2]) # Will fail

242

except ValueError as e:

243

print("Schema mismatch - ensure datasets have compatible features")

244

245

# Empty dataset list

246

try:

247

concatenate_datasets([]) # Will fail

248

except ValueError as e:

249

print("Cannot concatenate empty list of datasets")

250

251

# Probability mismatch in interleaving

252

try:

253

interleave_datasets([ds1, ds2], probabilities=[0.5]) # Wrong length

254

except ValueError as e:

255

print("Probabilities list must match number of datasets")

256

```