CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-deeplearning4j--dl4j-spark-2-11

DeepLearning4j Spark integration component providing DataVec transformations and DataSet operations for distributed deep learning in Apache Spark environments

Pending
Overview
Eval results
Files

data-transformation.mddocs/

Data Transformation Functions

Data transformation functions convert DataVec Writable collections into DataSet objects suitable for deep learning training and inference. These functions handle both classification and regression tasks with support for data preprocessing and conversion.

DataVecDataSetFunction

The primary transformation function for converting List<Writable> collections to DataSet objects in Spark environments.

public class DataVecDataSetFunction implements Function<List<Writable>, DataSet>, Serializable {
    public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
    
    public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression, 
                                  DataSetPreProcessor preProcessor, WritableConverter converter);
                                  
    public DataVecDataSetFunction(int labelIndexFrom, int labelIndexTo, int numPossibleLabels, 
                                  boolean regression, DataSetPreProcessor preProcessor, 
                                  WritableConverter converter);
                                  
    public DataSet call(List<Writable> currList) throws Exception;
}

Parameters

  • labelIndex / labelIndexFrom: Column index where labels begin (0-based). Use -1 to infer from data size.
  • labelIndexTo: Column index where labels end (inclusive). For single labels, same as labelIndexFrom.
  • numPossibleLabels: Number of classes for classification tasks (ignored for regression).
  • regression: false for classification (creates one-hot encoded labels), true for regression.
  • preProcessor: Optional DataSetPreProcessor for data normalization/transformation.
  • converter: Optional WritableConverter for custom data type conversions.

Usage Examples

Basic Classification

// For classification with 10 classes, label in column 4
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);

JavaRDD<List<Writable>> records = // ... your input RDD
JavaRDD<DataSet> datasets = records.map(transformer);

Regression with Preprocessing

import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;

// For regression with data normalization
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
    -1,    // labelIndex: -1 infers last column as label
    -1,    // numPossibleLabels: ignored for regression  
    true,  // regression: true
    normalizer, 
    null   // no converter needed
);

JavaRDD<DataSet> datasets = records.map(transformer);

Multi-Label Regression

// For multi-output regression with labels in columns 5-7
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
    5,     // labelIndexFrom: start of label columns
    7,     // labelIndexTo: end of label columns (inclusive)
    -1,    // numPossibleLabels: ignored for regression
    true,  // regression: true
    null,  // no preprocessing
    null   // no converter
);

Custom Data Conversion

import org.datavec.api.io.converters.SelfWritableConverter;

// With custom writable converter
WritableConverter converter = new SelfWritableConverter();
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
    0,     // labelIndex
    2,     // numPossibleLabels  
    false, // classification
    null,  // no preprocessing
    converter
);

Behavior Details

Label Inference

When labelIndex is -1 and numPossibleLabels >= 1, the function automatically uses the last column as the label column.

NDArray Support

The function can handle NDArrayWritable objects directly:

  • If input contains 2 elements and both are NDArrayWritable with the same reference, treats as feature-only data
  • If first element is NDArrayWritable and second is a scalar, uses NDArray as features and scalar as label

Error Handling

  • Throws IllegalStateException if numPossibleLabels < 1 for classification
  • Throws IllegalStateException if label value exceeds numPossibleLabels - 1 for classification
  • Skips empty Writable values during processing
  • Supports UnsupportedOperationException recovery for NDArrayWritable objects

Data Processing Flow

  1. Determines actual label index (handles -1 inference)
  2. Processes special cases for NDArrayWritable inputs
  3. Iterates through input list, separating features and labels
  4. Creates appropriate label vectors (one-hot for classification, scalar/multi-dimensional for regression)
  5. Constructs feature vectors from non-label columns
  6. Applies preprocessing if configured
  7. Returns constructed DataSet

Integration with Spark

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;

// Complete Spark workflow example
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<List<Writable>> inputRDD = // ... load your data

DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
JavaRDD<DataSet> datasetRDD = inputRDD.map(transformer);

// Continue with ML pipeline
datasetRDD.collect(); // or other Spark actions

Install with Tessl CLI

npx tessl i tessl/maven-org-deeplearning4j--dl4j-spark-2-11

docs

batch-export.md

data-transformation.md

index.md

sequence-processing.md

specialized-inputs.md

tile.json