Tessl Tile for maven/org.deeplearning4j/dl4j-spark_2.11@0.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-export.md data-transformation.md index.md sequence-processing.md specialized-inputs.md

data-transformation.mddocs/

0
# Data Transformation Functions
1

2
Data transformation functions convert DataVec Writable collections into DataSet objects suitable for deep learning training and inference. These functions handle both classification and regression tasks with support for data preprocessing and conversion.
3

4
## DataVecDataSetFunction
5

6
The primary transformation function for converting `List<Writable>` collections to `DataSet` objects in Spark environments.
7

8
```java { .api }
9
public class DataVecDataSetFunction implements Function<List<Writable>, DataSet>, Serializable {
10
    public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
11
    
12
    public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression, 
13
                                  DataSetPreProcessor preProcessor, WritableConverter converter);
14
                                  
15
    public DataVecDataSetFunction(int labelIndexFrom, int labelIndexTo, int numPossibleLabels, 
16
                                  boolean regression, DataSetPreProcessor preProcessor, 
17
                                  WritableConverter converter);
18
                                  
19
    public DataSet call(List<Writable> currList) throws Exception;
20
}
21
```
22

23
### Parameters
24

25
- **labelIndex** / **labelIndexFrom**: Column index where labels begin (0-based). Use -1 to infer from data size.
26
- **labelIndexTo**: Column index where labels end (inclusive). For single labels, same as labelIndexFrom.
27
- **numPossibleLabels**: Number of classes for classification tasks (ignored for regression).
28
- **regression**: `false` for classification (creates one-hot encoded labels), `true` for regression.
29
- **preProcessor**: Optional DataSetPreProcessor for data normalization/transformation.
30
- **converter**: Optional WritableConverter for custom data type conversions.
31

32
### Usage Examples
33

34
#### Basic Classification
35

36
```java
37
// For classification with 10 classes, label in column 4
38
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
39

40
JavaRDD<List<Writable>> records = // ... your input RDD
41
JavaRDD<DataSet> datasets = records.map(transformer);
42
```
43

44
#### Regression with Preprocessing
45

46
```java
47
import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;
48

49
// For regression with data normalization
50
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
51
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
52
    -1,    // labelIndex: -1 infers last column as label
53
    -1,    // numPossibleLabels: ignored for regression  
54
    true,  // regression: true
55
    normalizer, 
56
    null   // no converter needed
57
);
58

59
JavaRDD<DataSet> datasets = records.map(transformer);
60
```
61

62
#### Multi-Label Regression
63

64
```java
65
// For multi-output regression with labels in columns 5-7
66
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
67
    5,     // labelIndexFrom: start of label columns
68
    7,     // labelIndexTo: end of label columns (inclusive)
69
    -1,    // numPossibleLabels: ignored for regression
70
    true,  // regression: true
71
    null,  // no preprocessing
72
    null   // no converter
73
);
74
```
75

76
#### Custom Data Conversion
77

78
```java
79
import org.datavec.api.io.converters.SelfWritableConverter;
80

81
// With custom writable converter
82
WritableConverter converter = new SelfWritableConverter();
83
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
84
    0,     // labelIndex
85
    2,     // numPossibleLabels  
86
    false, // classification
87
    null,  // no preprocessing
88
    converter
89
);
90
```
91

92
### Behavior Details
93

94
#### Label Inference
95
When `labelIndex` is -1 and `numPossibleLabels >= 1`, the function automatically uses the last column as the label column.
96

97
#### NDArray Support
98
The function can handle `NDArrayWritable` objects directly:
99
- If input contains 2 elements and both are `NDArrayWritable` with the same reference, treats as feature-only data
100
- If first element is `NDArrayWritable` and second is a scalar, uses NDArray as features and scalar as label
101

102
#### Error Handling
103
- Throws `IllegalStateException` if `numPossibleLabels < 1` for classification
104
- Throws `IllegalStateException` if label value exceeds `numPossibleLabels - 1` for classification
105
- Skips empty Writable values during processing
106
- Supports `UnsupportedOperationException` recovery for `NDArrayWritable` objects
107

108
#### Data Processing Flow
109
1. Determines actual label index (handles -1 inference)
110
2. Processes special cases for NDArrayWritable inputs
111
3. Iterates through input list, separating features and labels
112
4. Creates appropriate label vectors (one-hot for classification, scalar/multi-dimensional for regression)
113
5. Constructs feature vectors from non-label columns
114
6. Applies preprocessing if configured
115
7. Returns constructed DataSet
116

117
### Integration with Spark
118

119
```java
120
import org.apache.spark.api.java.JavaSparkContext;
121
import org.apache.spark.api.java.JavaRDD;
122

123
// Complete Spark workflow example
124
JavaSparkContext sc = new JavaSparkContext();
125
JavaRDD<List<Writable>> inputRDD = // ... load your data
126

127
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
128
JavaRDD<DataSet> datasetRDD = inputRDD.map(transformer);
129

130
// Continue with ML pipeline
131
datasetRDD.collect(); // or other Spark actions
132
```

Version

Tile

Files

data-transformation.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

data-transformation.mddocs/