or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-export.mddata-transformation.mdindex.mdsequence-processing.mdspecialized-inputs.md

data-transformation.mddocs/

0

# Data Transformation Functions

1

2

Data transformation functions convert DataVec Writable collections into DataSet objects suitable for deep learning training and inference. These functions handle both classification and regression tasks with support for data preprocessing and conversion.

3

4

## DataVecDataSetFunction

5

6

The primary transformation function for converting `List<Writable>` collections to `DataSet` objects in Spark environments.

7

8

```java { .api }

9

public class DataVecDataSetFunction implements Function<List<Writable>, DataSet>, Serializable {

10

public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);

11

12

public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression,

13

DataSetPreProcessor preProcessor, WritableConverter converter);

14

15

public DataVecDataSetFunction(int labelIndexFrom, int labelIndexTo, int numPossibleLabels,

16

boolean regression, DataSetPreProcessor preProcessor,

17

WritableConverter converter);

18

19

public DataSet call(List<Writable> currList) throws Exception;

20

}

21

```

22

23

### Parameters

24

25

- **labelIndex** / **labelIndexFrom**: Column index where labels begin (0-based). Use -1 to infer from data size.

26

- **labelIndexTo**: Column index where labels end (inclusive). For single labels, same as labelIndexFrom.

27

- **numPossibleLabels**: Number of classes for classification tasks (ignored for regression).

28

- **regression**: `false` for classification (creates one-hot encoded labels), `true` for regression.

29

- **preProcessor**: Optional DataSetPreProcessor for data normalization/transformation.

30

- **converter**: Optional WritableConverter for custom data type conversions.

31

32

### Usage Examples

33

34

#### Basic Classification

35

36

```java

37

// For classification with 10 classes, label in column 4

38

DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);

39

40

JavaRDD<List<Writable>> records = // ... your input RDD

41

JavaRDD<DataSet> datasets = records.map(transformer);

42

```

43

44

#### Regression with Preprocessing

45

46

```java

47

import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;

48

49

// For regression with data normalization

50

DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();

51

DataVecDataSetFunction transformer = new DataVecDataSetFunction(

52

-1, // labelIndex: -1 infers last column as label

53

-1, // numPossibleLabels: ignored for regression

54

true, // regression: true

55

normalizer,

56

null // no converter needed

57

);

58

59

JavaRDD<DataSet> datasets = records.map(transformer);

60

```

61

62

#### Multi-Label Regression

63

64

```java

65

// For multi-output regression with labels in columns 5-7

66

DataVecDataSetFunction transformer = new DataVecDataSetFunction(

67

5, // labelIndexFrom: start of label columns

68

7, // labelIndexTo: end of label columns (inclusive)

69

-1, // numPossibleLabels: ignored for regression

70

true, // regression: true

71

null, // no preprocessing

72

null // no converter

73

);

74

```

75

76

#### Custom Data Conversion

77

78

```java

79

import org.datavec.api.io.converters.SelfWritableConverter;

80

81

// With custom writable converter

82

WritableConverter converter = new SelfWritableConverter();

83

DataVecDataSetFunction transformer = new DataVecDataSetFunction(

84

0, // labelIndex

85

2, // numPossibleLabels

86

false, // classification

87

null, // no preprocessing

88

converter

89

);

90

```

91

92

### Behavior Details

93

94

#### Label Inference

95

When `labelIndex` is -1 and `numPossibleLabels >= 1`, the function automatically uses the last column as the label column.

96

97

#### NDArray Support

98

The function can handle `NDArrayWritable` objects directly:

99

- If input contains 2 elements and both are `NDArrayWritable` with the same reference, treats as feature-only data

100

- If first element is `NDArrayWritable` and second is a scalar, uses NDArray as features and scalar as label

101

102

#### Error Handling

103

- Throws `IllegalStateException` if `numPossibleLabels < 1` for classification

104

- Throws `IllegalStateException` if label value exceeds `numPossibleLabels - 1` for classification

105

- Skips empty Writable values during processing

106

- Supports `UnsupportedOperationException` recovery for `NDArrayWritable` objects

107

108

#### Data Processing Flow

109

1. Determines actual label index (handles -1 inference)

110

2. Processes special cases for NDArrayWritable inputs

111

3. Iterates through input list, separating features and labels

112

4. Creates appropriate label vectors (one-hot for classification, scalar/multi-dimensional for regression)

113

5. Constructs feature vectors from non-label columns

114

6. Applies preprocessing if configured

115

7. Returns constructed DataSet

116

117

### Integration with Spark

118

119

```java

120

import org.apache.spark.api.java.JavaSparkContext;

121

import org.apache.spark.api.java.JavaRDD;

122

123

// Complete Spark workflow example

124

JavaSparkContext sc = new JavaSparkContext();

125

JavaRDD<List<Writable>> inputRDD = // ... load your data

126

127

DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);

128

JavaRDD<DataSet> datasetRDD = inputRDD.map(transformer);

129

130

// Continue with ML pipeline

131

datasetRDD.collect(); // or other Spark actions

132

```