0
# Data Utilities
1
2
Data handling utilities including streaming datasets, combined data loaders, and data processing functions for efficient data pipeline management in large-scale training.
3
4
## Capabilities
5
6
### Streaming Datasets
7
8
High-performance streaming datasets for large-scale data processing.
9
10
```python { .api }
11
class StreamingDataset:
12
def __init__(self, input_dir: str, **kwargs):
13
"""
14
Initialize streaming dataset.
15
16
Args:
17
input_dir: Directory containing streaming data
18
"""
19
20
class CombinedStreamingDataset:
21
def __init__(self, datasets: List[StreamingDataset], **kwargs):
22
"""
23
Initialize combined streaming dataset.
24
25
Args:
26
datasets: List of streaming datasets to combine
27
"""
28
29
class StreamingDataLoader:
30
def __init__(self, dataset: StreamingDataset, **kwargs):
31
"""
32
Initialize streaming data loader.
33
34
Args:
35
dataset: Streaming dataset to load
36
"""
37
38
# Aliases for convenience
39
LightningDataset = StreamingDataset
40
LightningIterableDataset = StreamingDataset
41
```
42
43
### Data Processing Functions
44
45
Functions for optimizing and processing data for efficient loading.
46
47
```python { .api }
48
def optimize(
49
data_dir: str,
50
output_dir: str,
51
chunk_size: int = 1024 * 1024,
52
**kwargs
53
) -> None:
54
"""
55
Optimize data for streaming.
56
57
Args:
58
data_dir: Input data directory
59
output_dir: Output directory for optimized data
60
chunk_size: Size of data chunks
61
"""
62
63
def map(
64
function: Callable,
65
inputs: List[str],
66
output_dir: str,
67
**kwargs
68
) -> None:
69
"""
70
Apply function to data inputs.
71
72
Args:
73
function: Function to apply to data
74
inputs: List of input files/directories
75
output_dir: Output directory for processed data
76
"""
77
78
def walk(data_dir: str, extensions: List[str] = None) -> List[str]:
79
"""
80
Walk directory and find files with specified extensions.
81
82
Args:
83
data_dir: Directory to walk
84
extensions: File extensions to include
85
86
Returns:
87
List of found file paths
88
"""
89
```