0
# Sequence Processing
1
2
Sequence processing functions handle time series and sequential data conversion in Spark environments. These functions support variable-length sequences, alignment modes for paired data, and automatic masking for different sequence lengths.
3
4
## DataVecSequenceDataSetFunction
5
6
Converts sequence data (Collection<Collection<Writable>>) to DataSet objects suitable for RNN and time series processing.
7
8
```java { .api }
9
public class DataVecSequenceDataSetFunction implements Function<List<List<Writable>>, DataSet>, Serializable {
10
public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
11
12
public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression,
13
DataSetPreProcessor preProcessor, WritableConverter converter);
14
15
public DataSet call(List<List<Writable>> input) throws Exception;
16
}
17
```
18
19
### Parameters
20
21
- **labelIndex**: Column index containing labels in each time step
22
- **numPossibleLabels**: Number of classes for classification (ignored for regression)
23
- **regression**: `false` for classification, `true` for regression
24
- **preProcessor**: Optional DataSetPreProcessor for normalization
25
- **converter**: Optional WritableConverter for data type conversion
26
27
### Usage Examples
28
29
#### Time Series Classification
30
31
```java
32
// Sequence classification with 5 classes, label in column 0
33
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 5, false);
34
35
JavaRDD<List<List<Writable>>> sequences = // ... your sequence RDD
36
JavaRDD<DataSet> datasets = sequences.map(transformer);
37
```
38
39
#### Time Series Regression
40
41
```java
42
// Sequence regression with label in last column of each time step
43
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(
44
-1, // labelIndex: typically last column
45
-1, // numPossibleLabels: ignored for regression
46
true // regression mode
47
);
48
```
49
50
### Data Format
51
52
Input sequences should be structured as `List<List<Writable>>` where:
53
- Outer list represents time steps
54
- Inner list represents features + label for each time step
55
- All time steps should have the same number of features
56
57
### Output Shape
58
59
Creates 3D arrays with shape `[batchSize=1, features, timeSteps]`:
60
- **Features**: `[1, numFeatures, sequenceLength]`
61
- **Labels**: `[1, numClasses, sequenceLength]` for classification or `[1, 1, sequenceLength]` for regression
62
63
## DataVecSequencePairDataSetFunction
64
65
Handles paired sequence data from two separate sources, supporting alignment modes for different length sequences.
66
67
```java { .api }
68
public class DataVecSequencePairDataSetFunction
69
implements Function<Tuple2<List<List<Writable>>, List<List<Writable>>>, DataSet>, Serializable {
70
71
public enum AlignmentMode {
72
EQUAL_LENGTH, // Default: assume input and labels have same length
73
ALIGN_START, // Align at first time step, pad shorter sequence at end
74
ALIGN_END // Align at last time step, pad shorter sequence at start
75
}
76
77
public DataVecSequencePairDataSetFunction();
78
79
public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression);
80
81
public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression,
82
AlignmentMode alignmentMode);
83
84
public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression,
85
AlignmentMode alignmentMode,
86
DataSetPreProcessor preProcessor,
87
WritableConverter converter);
88
89
public DataSet call(Tuple2<List<List<Writable>>, List<List<Writable>>> input) throws Exception;
90
}
91
```
92
93
### Parameters
94
95
- **numPossibleLabels**: Number of classes for classification (-1 for regression without conversion)
96
- **regression**: `false` for classification (converts to one-hot), `true` for regression
97
- **alignmentMode**: How to handle different sequence lengths
98
- **preProcessor**: Optional data preprocessing
99
- **converter**: Optional writable conversion
100
101
### Alignment Modes
102
103
#### EQUAL_LENGTH
104
Assumes input and label sequences have the same length. No padding applied.
105
106
```java
107
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
108
10, // numPossibleLabels
109
false, // classification
110
DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
111
);
112
```
113
114
#### ALIGN_START
115
Aligns sequences at the first time step. Shorter sequence is zero-padded at the end.
116
117
```java
118
// For many-to-one scenarios (long input, single output at start)
119
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
120
5, // numPossibleLabels
121
false, // classification
122
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START
123
);
124
```
125
126
#### ALIGN_END
127
Aligns sequences at the last time step. Shorter sequence is zero-padded at the start.
128
129
```java
130
// For one-to-many scenarios (single input, long output sequence)
131
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
132
3, // numPossibleLabels
133
false, // classification
134
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
135
);
136
```
137
138
### Usage Examples
139
140
#### Sequence-to-Sequence Classification
141
142
```java
143
import scala.Tuple2;
144
145
// Input: features sequence, labels sequence
146
JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> pairedSequences = // ... your data
147
148
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
149
10, // 10 classes
150
false, // classification
151
DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
152
);
153
154
JavaRDD<DataSet> datasets = pairedSequences.map(transformer);
155
```
156
157
#### Many-to-One with Alignment
158
159
```java
160
// Long input sequence, single label at the end
161
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
162
2, // binary classification
163
false, // classification
164
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
165
);
166
167
// Input sequences: 100 time steps, Label sequences: 1 time step
168
// Result: Both padded to 100 time steps with automatic masking
169
```
170
171
#### Regression with Preprocessing
172
173
```java
174
import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;
175
176
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
177
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
178
-1, // ignored for regression
179
true, // regression mode
180
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START,
181
normalizer,
182
null // no converter
183
);
184
```
185
186
### Masking Support
187
188
When sequences have different lengths, the function automatically creates mask arrays:
189
190
- **Input Mask**: Indicates valid time steps in feature sequences
191
- **Output Mask**: Indicates valid time steps in label sequences
192
- Masked time steps contain zeros and are ignored during training
193
194
### NDArrayWritable Support
195
196
Both sequence functions support `NDArrayWritable` objects for complex feature representations:
197
198
```java
199
// Each time step can contain NDArrayWritable for multi-dimensional features
200
List<List<Writable>> sequence = Arrays.asList(
201
Arrays.asList(new NDArrayWritable(featureArray1), new DoubleWritable(label1)),
202
Arrays.asList(new NDArrayWritable(featureArray2), new DoubleWritable(label2))
203
);
204
```
205
206
### Error Handling
207
208
- Handles `UnsupportedOperationException` for non-scalar Writables by checking for `NDArrayWritable`
209
- Automatically creates appropriate tensor dimensions based on input data
210
- Supports empty sequences (creates zero-length tensors)
211
212
### Integration Patterns
213
214
```java
215
// Complete sequence processing workflow
216
JavaRDD<List<List<Writable>>> sequences = // ... load time series data
217
218
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 10, false);
219
JavaRDD<DataSet> sequenceDatasets = sequences.map(transformer);
220
221
// For RNN training
222
sequenceDatasets.cache(); // Cache for multiple epochs
223
```