tessl/maven-org-deeplearning4j--dl4j-spark-2-11

DeepLearning4j Spark integration component providing DataVec transformations and DataSet operations for distributed deep learning in Apache Spark environments

—

Pending

Overview

Eval results

Files

Data Transformation Functions

Name: tessl/maven-org-deeplearning4j--dl4j-spark-2-11
Author: tessl

Data transformation functions convert DataVec Writable collections into DataSet objects suitable for deep learning training and inference. These functions handle both classification and regression tasks with support for data preprocessing and conversion.

DataVecDataSetFunction

The primary transformation function for converting List<Writable> collections to DataSet objects in Spark environments.

public class DataVecDataSetFunction implements Function<List<Writable>, DataSet>, Serializable {
    public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
    
    public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression, 
                                  DataSetPreProcessor preProcessor, WritableConverter converter);
                                  
    public DataVecDataSetFunction(int labelIndexFrom, int labelIndexTo, int numPossibleLabels, 
                                  boolean regression, DataSetPreProcessor preProcessor, 
                                  WritableConverter converter);
                                  
    public DataSet call(List<Writable> currList) throws Exception;
}

Parameters

labelIndex / labelIndexFrom: Column index where labels begin (0-based). Use -1 to infer from data size.
labelIndexTo: Column index where labels end (inclusive). For single labels, same as labelIndexFrom.
numPossibleLabels: Number of classes for classification tasks (ignored for regression).
regression: false for classification (creates one-hot encoded labels), true for regression.
preProcessor: Optional DataSetPreProcessor for data normalization/transformation.
converter: Optional WritableConverter for custom data type conversions.

Usage Examples

Basic Classification

// For classification with 10 classes, label in column 4
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);

JavaRDD<List<Writable>> records = // ... your input RDD
JavaRDD<DataSet> datasets = records.map(transformer);

Regression with Preprocessing

import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;

// For regression with data normalization
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
    -1,    // labelIndex: -1 infers last column as label
    -1,    // numPossibleLabels: ignored for regression  
    true,  // regression: true
    normalizer, 
    null   // no converter needed
);

JavaRDD<DataSet> datasets = records.map(transformer);

Multi-Label Regression

// For multi-output regression with labels in columns 5-7
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
    5,     // labelIndexFrom: start of label columns
    7,     // labelIndexTo: end of label columns (inclusive)
    -1,    // numPossibleLabels: ignored for regression
    true,  // regression: true
    null,  // no preprocessing
    null   // no converter
);

Custom Data Conversion

import org.datavec.api.io.converters.SelfWritableConverter;

// With custom writable converter
WritableConverter converter = new SelfWritableConverter();
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
    0,     // labelIndex
    2,     // numPossibleLabels  
    false, // classification
    null,  // no preprocessing
    converter
);

Behavior Details

Label Inference

When labelIndex is -1 and numPossibleLabels >= 1, the function automatically uses the last column as the label column.

NDArray Support

The function can handle NDArrayWritable objects directly:

If input contains 2 elements and both are NDArrayWritable with the same reference, treats as feature-only data
If first element is NDArrayWritable and second is a scalar, uses NDArray as features and scalar as label

Error Handling

Throws IllegalStateException if numPossibleLabels < 1 for classification
Throws IllegalStateException if label value exceeds numPossibleLabels - 1 for classification
Skips empty Writable values during processing
Supports UnsupportedOperationException recovery for NDArrayWritable objects

Data Processing Flow

Determines actual label index (handles -1 inference)
Processes special cases for NDArrayWritable inputs
Iterates through input list, separating features and labels
Creates appropriate label vectors (one-hot for classification, scalar/multi-dimensional for regression)
Constructs feature vectors from non-label columns
Applies preprocessing if configured
Returns constructed DataSet

Integration with Spark

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;

// Complete Spark workflow example
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<List<Writable>> inputRDD = // ... load your data

DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
JavaRDD<DataSet> datasetRDD = inputRDD.map(transformer);

// Continue with ML pipeline
datasetRDD.collect(); // or other Spark actions

Install with Tessl CLI

npx tessl i tessl/maven-org-deeplearning4j--dl4j-spark-2-11

docs

batch-export.md

data-transformation.md

index.md

sequence-processing.md

specialized-inputs.md

tile.json