0
# Data Transformation Functions
1
2
Data transformation functions convert DataVec Writable collections into DataSet objects suitable for deep learning training and inference. These functions handle both classification and regression tasks with support for data preprocessing and conversion.
3
4
## DataVecDataSetFunction
5
6
The primary transformation function for converting `List<Writable>` collections to `DataSet` objects in Spark environments.
7
8
```java { .api }
9
public class DataVecDataSetFunction implements Function<List<Writable>, DataSet>, Serializable {
10
public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
11
12
public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression,
13
DataSetPreProcessor preProcessor, WritableConverter converter);
14
15
public DataVecDataSetFunction(int labelIndexFrom, int labelIndexTo, int numPossibleLabels,
16
boolean regression, DataSetPreProcessor preProcessor,
17
WritableConverter converter);
18
19
public DataSet call(List<Writable> currList) throws Exception;
20
}
21
```
22
23
### Parameters
24
25
- **labelIndex** / **labelIndexFrom**: Column index where labels begin (0-based). Use -1 to infer from data size.
26
- **labelIndexTo**: Column index where labels end (inclusive). For single labels, same as labelIndexFrom.
27
- **numPossibleLabels**: Number of classes for classification tasks (ignored for regression).
28
- **regression**: `false` for classification (creates one-hot encoded labels), `true` for regression.
29
- **preProcessor**: Optional DataSetPreProcessor for data normalization/transformation.
30
- **converter**: Optional WritableConverter for custom data type conversions.
31
32
### Usage Examples
33
34
#### Basic Classification
35
36
```java
37
// For classification with 10 classes, label in column 4
38
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
39
40
JavaRDD<List<Writable>> records = // ... your input RDD
41
JavaRDD<DataSet> datasets = records.map(transformer);
42
```
43
44
#### Regression with Preprocessing
45
46
```java
47
import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;
48
49
// For regression with data normalization
50
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
51
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
52
-1, // labelIndex: -1 infers last column as label
53
-1, // numPossibleLabels: ignored for regression
54
true, // regression: true
55
normalizer,
56
null // no converter needed
57
);
58
59
JavaRDD<DataSet> datasets = records.map(transformer);
60
```
61
62
#### Multi-Label Regression
63
64
```java
65
// For multi-output regression with labels in columns 5-7
66
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
67
5, // labelIndexFrom: start of label columns
68
7, // labelIndexTo: end of label columns (inclusive)
69
-1, // numPossibleLabels: ignored for regression
70
true, // regression: true
71
null, // no preprocessing
72
null // no converter
73
);
74
```
75
76
#### Custom Data Conversion
77
78
```java
79
import org.datavec.api.io.converters.SelfWritableConverter;
80
81
// With custom writable converter
82
WritableConverter converter = new SelfWritableConverter();
83
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
84
0, // labelIndex
85
2, // numPossibleLabels
86
false, // classification
87
null, // no preprocessing
88
converter
89
);
90
```
91
92
### Behavior Details
93
94
#### Label Inference
95
When `labelIndex` is -1 and `numPossibleLabels >= 1`, the function automatically uses the last column as the label column.
96
97
#### NDArray Support
98
The function can handle `NDArrayWritable` objects directly:
99
- If input contains 2 elements and both are `NDArrayWritable` with the same reference, treats as feature-only data
100
- If first element is `NDArrayWritable` and second is a scalar, uses NDArray as features and scalar as label
101
102
#### Error Handling
103
- Throws `IllegalStateException` if `numPossibleLabels < 1` for classification
104
- Throws `IllegalStateException` if label value exceeds `numPossibleLabels - 1` for classification
105
- Skips empty Writable values during processing
106
- Supports `UnsupportedOperationException` recovery for `NDArrayWritable` objects
107
108
#### Data Processing Flow
109
1. Determines actual label index (handles -1 inference)
110
2. Processes special cases for NDArrayWritable inputs
111
3. Iterates through input list, separating features and labels
112
4. Creates appropriate label vectors (one-hot for classification, scalar/multi-dimensional for regression)
113
5. Constructs feature vectors from non-label columns
114
6. Applies preprocessing if configured
115
7. Returns constructed DataSet
116
117
### Integration with Spark
118
119
```java
120
import org.apache.spark.api.java.JavaSparkContext;
121
import org.apache.spark.api.java.JavaRDD;
122
123
// Complete Spark workflow example
124
JavaSparkContext sc = new JavaSparkContext();
125
JavaRDD<List<Writable>> inputRDD = // ... load your data
126
127
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
128
JavaRDD<DataSet> datasetRDD = inputRDD.map(transformer);
129
130
// Continue with ML pipeline
131
datasetRDD.collect(); // or other Spark actions
132
```