tessl/maven-org-deeplearning4j--dl4j-spark-2-11

DeepLearning4j Spark integration component providing DataVec transformations and DataSet operations for distributed deep learning in Apache Spark environments

—

Pending

Overview

Eval results

Files

Sequence Processing

Name: tessl/maven-org-deeplearning4j--dl4j-spark-2-11
Author: tessl

Sequence processing functions handle time series and sequential data conversion in Spark environments. These functions support variable-length sequences, alignment modes for paired data, and automatic masking for different sequence lengths.

DataVecSequenceDataSetFunction

Converts sequence data (Collection<Collection<Writable>>) to DataSet objects suitable for RNN and time series processing.

public class DataVecSequenceDataSetFunction implements Function<List<List<Writable>>, DataSet>, Serializable {
    public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
    
    public DataVecSequenceDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression, 
                                          DataSetPreProcessor preProcessor, WritableConverter converter);
                                          
    public DataSet call(List<List<Writable>> input) throws Exception;
}

Parameters

labelIndex: Column index containing labels in each time step
numPossibleLabels: Number of classes for classification (ignored for regression)
regression: false for classification, true for regression
preProcessor: Optional DataSetPreProcessor for normalization
converter: Optional WritableConverter for data type conversion

Usage Examples

Time Series Classification

// Sequence classification with 5 classes, label in column 0
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 5, false);

JavaRDD<List<List<Writable>>> sequences = // ... your sequence RDD
JavaRDD<DataSet> datasets = sequences.map(transformer);

Time Series Regression

// Sequence regression with label in last column of each time step
DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(
    -1,   // labelIndex: typically last column
    -1,   // numPossibleLabels: ignored for regression
    true  // regression mode
);

Data Format

Input sequences should be structured as List<List<Writable>> where:

Outer list represents time steps
Inner list represents features + label for each time step
All time steps should have the same number of features

Output Shape

Creates 3D arrays with shape [batchSize=1, features, timeSteps]:

Features: [1, numFeatures, sequenceLength]
Labels: [1, numClasses, sequenceLength] for classification or [1, 1, sequenceLength] for regression

DataVecSequencePairDataSetFunction

Handles paired sequence data from two separate sources, supporting alignment modes for different length sequences.

public class DataVecSequencePairDataSetFunction 
    implements Function<Tuple2<List<List<Writable>>, List<List<Writable>>>, DataSet>, Serializable {
    
    public enum AlignmentMode {
        EQUAL_LENGTH,  // Default: assume input and labels have same length
        ALIGN_START,   // Align at first time step, pad shorter sequence at end
        ALIGN_END      // Align at last time step, pad shorter sequence at start
    }
    
    public DataVecSequencePairDataSetFunction();
    
    public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression);
    
    public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression, 
                                              AlignmentMode alignmentMode);
                                              
    public DataVecSequencePairDataSetFunction(int numPossibleLabels, boolean regression, 
                                              AlignmentMode alignmentMode, 
                                              DataSetPreProcessor preProcessor, 
                                              WritableConverter converter);
                                              
    public DataSet call(Tuple2<List<List<Writable>>, List<List<Writable>>> input) throws Exception;
}

Parameters

numPossibleLabels: Number of classes for classification (-1 for regression without conversion)
regression: false for classification (converts to one-hot), true for regression
alignmentMode: How to handle different sequence lengths
preProcessor: Optional data preprocessing
converter: Optional writable conversion

Alignment Modes

EQUAL_LENGTH

Assumes input and label sequences have the same length. No padding applied.

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
    10,    // numPossibleLabels
    false, // classification
    DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
);

ALIGN_START

Aligns sequences at the first time step. Shorter sequence is zero-padded at the end.

// For many-to-one scenarios (long input, single output at start)
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
    5,     // numPossibleLabels
    false, // classification
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START
);

ALIGN_END

Aligns sequences at the last time step. Shorter sequence is zero-padded at the start.

// For one-to-many scenarios (single input, long output sequence)
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
    3,     // numPossibleLabels
    false, // classification  
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
);

Usage Examples

Sequence-to-Sequence Classification

import scala.Tuple2;

// Input: features sequence, labels sequence
JavaRDD<Tuple2<List<List<Writable>>, List<List<Writable>>>> pairedSequences = // ... your data

DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
    10,    // 10 classes
    false, // classification
    DataVecSequencePairDataSetFunction.AlignmentMode.EQUAL_LENGTH
);

JavaRDD<DataSet> datasets = pairedSequences.map(transformer);

Many-to-One with Alignment

// Long input sequence, single label at the end
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
    2,     // binary classification
    false, // classification
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_END
);

// Input sequences: 100 time steps, Label sequences: 1 time step
// Result: Both padded to 100 time steps with automatic masking

Regression with Preprocessing

import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;

DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
DataVecSequencePairDataSetFunction transformer = new DataVecSequencePairDataSetFunction(
    -1,    // ignored for regression
    true,  // regression mode
    DataVecSequencePairDataSetFunction.AlignmentMode.ALIGN_START,
    normalizer,
    null   // no converter
);

Masking Support

When sequences have different lengths, the function automatically creates mask arrays:

Input Mask: Indicates valid time steps in feature sequences
Output Mask: Indicates valid time steps in label sequences
Masked time steps contain zeros and are ignored during training

NDArrayWritable Support

Both sequence functions support NDArrayWritable objects for complex feature representations:

// Each time step can contain NDArrayWritable for multi-dimensional features
List<List<Writable>> sequence = Arrays.asList(
    Arrays.asList(new NDArrayWritable(featureArray1), new DoubleWritable(label1)),
    Arrays.asList(new NDArrayWritable(featureArray2), new DoubleWritable(label2))
);

Error Handling

Handles UnsupportedOperationException for non-scalar Writables by checking for NDArrayWritable
Automatically creates appropriate tensor dimensions based on input data
Supports empty sequences (creates zero-length tensors)

Integration Patterns

// Complete sequence processing workflow
JavaRDD<List<List<Writable>>> sequences = // ... load time series data

DataVecSequenceDataSetFunction transformer = new DataVecSequenceDataSetFunction(0, 10, false);
JavaRDD<DataSet> sequenceDatasets = sequences.map(transformer);

// For RNN training
sequenceDatasets.cache(); // Cache for multiple epochs

Install with Tessl CLI

npx tessl i tessl/maven-org-deeplearning4j--dl4j-spark-2-11

docs

batch-export.md

data-transformation.md

index.md

sequence-processing.md

specialized-inputs.md

tile.json

tessl/maven-org-deeplearning4j--dl4j-spark-2-11