or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

batch-export.mddata-transformation.mdindex.mdsequence-processing.mdspecialized-inputs.md

sequence-processing.mddocs/

0

# Sequence Processing

1

2

Sequence processing functions handle time series and sequential data conversion in Spark environments. These functions support variable-length sequences, alignment modes for paired data, and automatic masking for different sequence lengths.

3

4

## DataVecSequenceDataSetFunction

5

6

Converts sequence data (Collection<Collection<Writable>>) to DataSet objects suitable for RNN and time series processing.

7

8

```java { .api }

9

public class DataVecSequenceDataSetFunction implements Function<List<List<Writable>>, DataSet>, Serializable {

10

public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);

11

12

public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression,

13

DataSetPreProcessor preProcessor, WritableConverter converter);

14

15

public DataSet call(List<List<Writable>> input) throws Exception;

16

}

17

```

18

19

### Parameters

20

21

- **labelIndex**: Column index containing labels in each time step

22

- **numPossibleLabels**: Number of classes for classification (ignored for regression)

23

- **regression**: `false` for classification, `true` for regression

24

- **preProcessor**: Optional DataSetPreProcessor for normalization

25

- **converter**: Optional WritableConverter for data type conversion

26

27

### Usage Examples

28

29

#### Time Series Classification

30

31

```java

32

// Sequence classification with 5 classes, label in column 0

33

DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 5, false);

34

35

JavaRDD<List<List<Writable>>> sequences = // ... your sequence RDD

36

JavaRDD<DataSet> datasets = sequences.map(transformer);

37

```

38

39

#### Time Series Regression

40

41

```java

42

// Sequence regression with label in last column of each time step

43

DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(

44

-1, // labelIndex: typically last column

45

-1, // numPossibleLabels: ignored for regression

46

true // regression mode

47

);

48

```

49

50

### Data Format

51

52

Input sequences should be structured as `List<List<Writable>>` where:

53

- Outer list represents time steps

54

- Inner list represents features + label for each time step

55

- All time steps should have the same number of features

56

57

### Output Shape

58

59

Creates 3D arrays with shape `[batchSize=1, features, timeSteps]`:

60

- **Features**: `[1, numFeatures, sequenceLength]`

61

- **Labels**: `[1, numClasses, sequenceLength]` for classification or `[1, 1, sequenceLength]` for regression

62

63

## DataVecSequencePairDataSetFunction

64

65

Handles paired sequence data from two separate sources, supporting alignment modes for different length sequences.

66

67

```java { .api }

68

public class DataVecSequencePairDataSetFunction

69

implements Function<Tuple2<List<List<Writable>>, List<List<Writable>>>, DataSet>, Serializable {

70

71

public enum AlignmentMode {

72

EQUAL_LENGTH, // Default: assume input and labels have same length

73

ALIGN_START, // Align at first time step, pad shorter sequence at end

74

ALIGN_END // Align at last time step, pad shorter sequence at start

75

}

76

77

public DataVecSequencePairDataSetFunction();

78

79

public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression);

80

81

public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression,

82

AlignmentMode alignmentMode);

83

84

public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression,

85

AlignmentMode alignmentMode,

86

DataSetPreProcessor preProcessor,

87

WritableConverter converter);

88

89

public DataSet call(Tuple2<List<List<Writable>>, List<List<Writable>>> input) throws Exception;

90

}

91

```

92

93

### Parameters

94

95

- **numPossibleLabels**: Number of classes for classification (-1 for regression without conversion)

96

- **regression**: `false` for classification (converts to one-hot), `true` for regression

97

- **alignmentMode**: How to handle different sequence lengths

98

- **preProcessor**: Optional data preprocessing

99

- **converter**: Optional writable conversion

100

101

### Alignment Modes

102

103

#### EQUAL_LENGTH

104

Assumes input and label sequences have the same length. No padding applied.

105

106

```java

107

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(

108

10, // numPossibleLabels

109

false, // classification

110

DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH

111

);

112

```

113

114

#### ALIGN_START

115

Aligns sequences at the first time step. Shorter sequence is zero-padded at the end.

116

117

```java

118

// For many-to-one scenarios (long input, single output at start)

119

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(

120

5, // numPossibleLabels

121

false, // classification

122

DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START

123

);

124

```

125

126

#### ALIGN_END

127

Aligns sequences at the last time step. Shorter sequence is zero-padded at the start.

128

129

```java

130

// For one-to-many scenarios (single input, long output sequence)

131

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(

132

3, // numPossibleLabels

133

false, // classification

134

DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END

135

);

136

```

137

138

### Usage Examples

139

140

#### Sequence-to-Sequence Classification

141

142

```java

143

import scala.Tuple2;

144

145

// Input: features sequence, labels sequence

146

JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> pairedSequences = // ... your data

147

148

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(

149

10, // 10 classes

150

false, // classification

151

DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH

152

);

153

154

JavaRDD<DataSet> datasets = pairedSequences.map(transformer);

155

```

156

157

#### Many-to-One with Alignment

158

159

```java

160

// Long input sequence, single label at the end

161

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(

162

2, // binary classification

163

false, // classification

164

DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END

165

);

166

167

// Input sequences: 100 time steps, Label sequences: 1 time step

168

// Result: Both padded to 100 time steps with automatic masking

169

```

170

171

#### Regression with Preprocessing

172

173

```java

174

import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;

175

176

DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();

177

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(

178

-1, // ignored for regression

179

true, // regression mode

180

DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START,

181

normalizer,

182

null // no converter

183

);

184

```

185

186

### Masking Support

187

188

When sequences have different lengths, the function automatically creates mask arrays:

189

190

- **Input Mask**: Indicates valid time steps in feature sequences

191

- **Output Mask**: Indicates valid time steps in label sequences

192

- Masked time steps contain zeros and are ignored during training

193

194

### NDArrayWritable Support

195

196

Both sequence functions support `NDArrayWritable` objects for complex feature representations:

197

198

```java

199

// Each time step can contain NDArrayWritable for multi-dimensional features

200

List<List<Writable>> sequence = Arrays.asList(

201

Arrays.asList(new NDArrayWritable(featureArray1), new DoubleWritable(label1)),

202

Arrays.asList(new NDArrayWritable(featureArray2), new DoubleWritable(label2))

203

);

204

```

205

206

### Error Handling

207

208

- Handles `UnsupportedOperationException` for non-scalar Writables by checking for `NDArrayWritable`

209

- Automatically creates appropriate tensor dimensions based on input data

210

- Supports empty sequences (creates zero-length tensors)

211

212

### Integration Patterns

213

214

```java

215

// Complete sequence processing workflow

216

JavaRDD<List<List<Writable>>> sequences = // ... load time series data

217

218

DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 10, false);

219

JavaRDD<DataSet> sequenceDatasets = sequences.map(transformer);

220

221

// For RNN training

222

sequenceDatasets.cache(); // Cache for multiple epochs

223

```