or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

accelerators.mdcallbacks.mdcore-training.mddata.mdfabric.mdindex.mdloggers.mdprecision.mdprofilers.mdstrategies.md

data.mddocs/

0

# Data Utilities

1

2

Data handling utilities including streaming datasets, combined data loaders, and data processing functions for efficient data pipeline management in large-scale training.

3

4

## Capabilities

5

6

### Streaming Datasets

7

8

High-performance streaming datasets for large-scale data processing.

9

10

```python { .api }

11

class StreamingDataset:

12

def __init__(self, input_dir: str, **kwargs):

13

"""

14

Initialize streaming dataset.

15

16

Args:

17

input_dir: Directory containing streaming data

18

"""

19

20

class CombinedStreamingDataset:

21

def __init__(self, datasets: List[StreamingDataset], **kwargs):

22

"""

23

Initialize combined streaming dataset.

24

25

Args:

26

datasets: List of streaming datasets to combine

27

"""

28

29

class StreamingDataLoader:

30

def __init__(self, dataset: StreamingDataset, **kwargs):

31

"""

32

Initialize streaming data loader.

33

34

Args:

35

dataset: Streaming dataset to load

36

"""

37

38

# Aliases for convenience

39

LightningDataset = StreamingDataset

40

LightningIterableDataset = StreamingDataset

41

```

42

43

### Data Processing Functions

44

45

Functions for optimizing and processing data for efficient loading.

46

47

```python { .api }

48

def optimize(

49

data_dir: str,

50

output_dir: str,

51

chunk_size: int = 1024 * 1024,

52

**kwargs

53

) -> None:

54

"""

55

Optimize data for streaming.

56

57

Args:

58

data_dir: Input data directory

59

output_dir: Output directory for optimized data

60

chunk_size: Size of data chunks

61

"""

62

63

def map(

64

function: Callable,

65

inputs: List[str],

66

output_dir: str,

67

**kwargs

68

) -> None:

69

"""

70

Apply function to data inputs.

71

72

Args:

73

function: Function to apply to data

74

inputs: List of input files/directories

75

output_dir: Output directory for processed data

76

"""

77

78

def walk(data_dir: str, extensions: List[str] = None) -> List[str]:

79

"""

80

Walk directory and find files with specified extensions.

81

82

Args:

83

data_dir: Directory to walk

84

extensions: File extensions to include

85

86

Returns:

87

List of found file paths

88

"""

89

```