DeepLearning4j Spark integration component providing DataVec transformations and DataSet operations for distributed deep learning in Apache Spark environments
—
Sequence processing functions handle time series and sequential data conversion in Spark environments. These functions support variable-length sequences, alignment modes for paired data, and automatic masking for different sequence lengths.
Converts sequence data (Collection<Collection<Writable>>) to DataSet objects suitable for RNN and time series processing.
public class DataVecSequenceDataSetFunction implements Function<List<List<Writable>>, DataSet>, Serializable {
public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression,
DataSetPreProcessor preProcessor, WritableConverter converter);
public DataSet call(List<List<Writable>> input) throws Exception;
}false for classification, true for regression// Sequence classification with 5 classes, label in column 0
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 5, false);
JavaRDD<List<List<Writable>>> sequences = // ... your sequence RDD
JavaRDD<DataSet> datasets = sequences.map(transformer);// Sequence regression with label in last column of each time step
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(
-1, // labelIndex: typically last column
-1, // numPossibleLabels: ignored for regression
true // regression mode
);Input sequences should be structured as List<List<Writable>> where:
Creates 3D arrays with shape [batchSize=1, features, timeSteps]:
[1, numFeatures, sequenceLength][1, numClasses, sequenceLength] for classification or [1, 1, sequenceLength] for regressionHandles paired sequence data from two separate sources, supporting alignment modes for different length sequences.
public class DataVecSequencePairDataSetFunction
implements Function<Tuple2<List<List<Writable>>, List<List<Writable>>>, DataSet>, Serializable {
public enum AlignmentMode {
EQUAL_LENGTH, // Default: assume input and labels have same length
ALIGN_START, // Align at first time step, pad shorter sequence at end
ALIGN_END // Align at last time step, pad shorter sequence at start
}
public DataVecSequencePairDataSetFunction();
public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression);
public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression,
AlignmentMode alignmentMode);
public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression,
AlignmentMode alignmentMode,
DataSetPreProcessor preProcessor,
WritableConverter converter);
public DataSet call(Tuple2<List<List<Writable>>, List<List<Writable>>> input) throws Exception;
}false for classification (converts to one-hot), true for regressionAssumes input and label sequences have the same length. No padding applied.
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
10, // numPossibleLabels
false, // classification
DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
);Aligns sequences at the first time step. Shorter sequence is zero-padded at the end.
// For many-to-one scenarios (long input, single output at start)
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
5, // numPossibleLabels
false, // classification
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START
);Aligns sequences at the last time step. Shorter sequence is zero-padded at the start.
// For one-to-many scenarios (single input, long output sequence)
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
3, // numPossibleLabels
false, // classification
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
);import scala.Tuple2;
// Input: features sequence, labels sequence
JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> pairedSequences = // ... your data
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
10, // 10 classes
false, // classification
DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
);
JavaRDD<DataSet> datasets = pairedSequences.map(transformer);// Long input sequence, single label at the end
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
2, // binary classification
false, // classification
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
);
// Input sequences: 100 time steps, Label sequences: 1 time step
// Result: Both padded to 100 time steps with automatic maskingimport org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
-1, // ignored for regression
true, // regression mode
DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START,
normalizer,
null // no converter
);When sequences have different lengths, the function automatically creates mask arrays:
Both sequence functions support NDArrayWritable objects for complex feature representations:
// Each time step can contain NDArrayWritable for multi-dimensional features
List<List<Writable>> sequence = Arrays.asList(
Arrays.asList(new NDArrayWritable(featureArray1), new DoubleWritable(label1)),
Arrays.asList(new NDArrayWritable(featureArray2), new DoubleWritable(label2))
);UnsupportedOperationException for non-scalar Writables by checking for NDArrayWritable// Complete sequence processing workflow
JavaRDD<List<List<Writable>>> sequences = // ... load time series data
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 10, false);
JavaRDD<DataSet> sequenceDatasets = sequences.map(transformer);
// For RNN training
sequenceDatasets.cache(); // Cache for multiple epochsInstall with Tessl CLI
npx tessl i tessl/maven-org-deeplearning4j--dl4j-spark-2-11