Tessl Tile for maven/org.deeplearning4j/dl4j-spark_2.11@0.9.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

batch-export.md data-transformation.md index.md sequence-processing.md specialized-inputs.md

sequence-processing.mddocs/

0
# Sequence Processing
1

2
Sequence processing functions handle time series and sequential data conversion in Spark environments. These functions support variable-length sequences, alignment modes for paired data, and automatic masking for different sequence lengths.
3

4
## DataVecSequenceDataSetFunction
5

6
Converts sequence data (Collection<Collection<Writable>>) to DataSet objects suitable for RNN and time series processing.
7

8
```java { .api }
9
public class DataVecSequenceDataSetFunction implements Function<List<List<Writable>>, DataSet>, Serializable {
10
    public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
11
    
12
    public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression, 
13
                                          DataSetPreProcessor preProcessor, WritableConverter converter);
14
                                          
15
    public DataSet call(List<List<Writable>> input) throws Exception;
16
}
17
```
18

19
### Parameters
20

21
- **labelIndex**: Column index containing labels in each time step
22
- **numPossibleLabels**: Number of classes for classification (ignored for regression)
23
- **regression**: `false` for classification, `true` for regression
24
- **preProcessor**: Optional DataSetPreProcessor for normalization
25
- **converter**: Optional WritableConverter for data type conversion
26

27
### Usage Examples
28

29
#### Time Series Classification
30

31
```java
32
// Sequence classification with 5 classes, label in column 0
33
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 5, false);
34

35
JavaRDD<List<List<Writable>>> sequences = // ... your sequence RDD
36
JavaRDD<DataSet> datasets = sequences.map(transformer);
37
```
38

39
#### Time Series Regression
40

41
```java
42
// Sequence regression with label in last column of each time step
43
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(
44
    -1,   // labelIndex: typically last column
45
    -1,   // numPossibleLabels: ignored for regression
46
    true  // regression mode
47
);
48
```
49

50
### Data Format
51

52
Input sequences should be structured as `List<List<Writable>>` where:
53
- Outer list represents time steps
54
- Inner list represents features + label for each time step
55
- All time steps should have the same number of features
56

57
### Output Shape
58

59
Creates 3D arrays with shape `[batchSize=1, features, timeSteps]`:
60
- **Features**: `[1, numFeatures, sequenceLength]`
61
- **Labels**: `[1, numClasses, sequenceLength]` for classification or `[1, 1, sequenceLength]` for regression
62

63
## DataVecSequencePairDataSetFunction
64

65
Handles paired sequence data from two separate sources, supporting alignment modes for different length sequences.
66

67
```java { .api }
68
public class DataVecSequencePairDataSetFunction 
69
    implements Function<Tuple2<List<List<Writable>>, List<List<Writable>>>, DataSet>, Serializable {
70
    
71
    public enum AlignmentMode {
72
        EQUAL_LENGTH,  // Default: assume input and labels have same length
73
        ALIGN_START,   // Align at first time step, pad shorter sequence at end
74
        ALIGN_END      // Align at last time step, pad shorter sequence at start
75
    }
76
    
77
    public DataVecSequencePairDataSetFunction();
78
    
79
    public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression);
80
    
81
    public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression, 
82
                                              AlignmentMode alignmentMode);
83
                                              
84
    public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression, 
85
                                              AlignmentMode alignmentMode, 
86
                                              DataSetPreProcessor preProcessor, 
87
                                              WritableConverter converter);
88
                                              
89
    public DataSet call(Tuple2<List<List<Writable>>, List<List<Writable>>> input) throws Exception;
90
}
91
```
92

93
### Parameters
94

95
- **numPossibleLabels**: Number of classes for classification (-1 for regression without conversion)
96
- **regression**: `false` for classification (converts to one-hot), `true` for regression
97
- **alignmentMode**: How to handle different sequence lengths
98
- **preProcessor**: Optional data preprocessing
99
- **converter**: Optional writable conversion
100

101
### Alignment Modes
102

103
#### EQUAL_LENGTH
104
Assumes input and label sequences have the same length. No padding applied.
105

106
```java
107
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
108
    10,    // numPossibleLabels
109
    false, // classification
110
    DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
111
);
112
```
113

114
#### ALIGN_START
115
Aligns sequences at the first time step. Shorter sequence is zero-padded at the end.
116

117
```java
118
// For many-to-one scenarios (long input, single output at start)
119
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
120
    5,     // numPossibleLabels
121
    false, // classification
122
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START
123
);
124
```
125

126
#### ALIGN_END
127
Aligns sequences at the last time step. Shorter sequence is zero-padded at the start.
128

129
```java
130
// For one-to-many scenarios (single input, long output sequence)
131
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
132
    3,     // numPossibleLabels
133
    false, // classification  
134
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
135
);
136
```
137

138
### Usage Examples
139

140
#### Sequence-to-Sequence Classification
141

142
```java
143
import scala.Tuple2;
144

145
// Input: features sequence, labels sequence
146
JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> pairedSequences = // ... your data
147

148
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
149
    10,    // 10 classes
150
    false, // classification
151
    DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
152
);
153

154
JavaRDD<DataSet> datasets = pairedSequences.map(transformer);
155
```
156

157
#### Many-to-One with Alignment
158

159
```java
160
// Long input sequence, single label at the end
161
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
162
    2,     // binary classification
163
    false, // classification
164
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
165
);
166

167
// Input sequences: 100 time steps, Label sequences: 1 time step
168
// Result: Both padded to 100 time steps with automatic masking
169
```
170

171
#### Regression with Preprocessing
172

173
```java
174
import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;
175

176
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
177
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
178
    -1,    // ignored for regression
179
    true,  // regression mode
180
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START,
181
    normalizer,
182
    null   // no converter
183
);
184
```
185

186
### Masking Support
187

188
When sequences have different lengths, the function automatically creates mask arrays:
189

190
- **Input Mask**: Indicates valid time steps in feature sequences
191
- **Output Mask**: Indicates valid time steps in label sequences
192
- Masked time steps contain zeros and are ignored during training
193

194
### NDArrayWritable Support
195

196
Both sequence functions support `NDArrayWritable` objects for complex feature representations:
197

198
```java
199
// Each time step can contain NDArrayWritable for multi-dimensional features
200
List<List<Writable>> sequence = Arrays.asList(
201
    Arrays.asList(new NDArrayWritable(featureArray1), new DoubleWritable(label1)),
202
    Arrays.asList(new NDArrayWritable(featureArray2), new DoubleWritable(label2))
203
);
204
```
205

206
### Error Handling
207

208
- Handles `UnsupportedOperationException` for non-scalar Writables by checking for `NDArrayWritable`
209
- Automatically creates appropriate tensor dimensions based on input data
210
- Supports empty sequences (creates zero-length tensors)
211

212
### Integration Patterns
213

214
```java
215
// Complete sequence processing workflow
216
JavaRDD<List<List<Writable>>> sequences = // ... load time series data
217

218
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 10, false);
219
JavaRDD<DataSet> sequenceDatasets = sequences.map(transformer);
220

221
// For RNN training
222
sequenceDatasets.cache(); // Cache for multiple epochs
223
```

Version

Tile

Files

sequence-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

sequence-processing.mddocs/