DeepLearning4j Spark integration component providing DataVec transformations and DataSet operations for distributed deep learning in Apache Spark environments
—
Data transformation functions convert DataVec Writable collections into DataSet objects suitable for deep learning training and inference. These functions handle both classification and regression tasks with support for data preprocessing and conversion.
The primary transformation function for converting List<Writable> collections to DataSet objects in Spark environments.
public class DataVecDataSetFunction implements Function<List<Writable>, DataSet>, Serializable {
public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression,
DataSetPreProcessor preProcessor, WritableConverter converter);
public DataVecDataSetFunction(int labelIndexFrom, int labelIndexTo, int numPossibleLabels,
boolean regression, DataSetPreProcessor preProcessor,
WritableConverter converter);
public DataSet call(List<Writable> currList) throws Exception;
}false for classification (creates one-hot encoded labels), true for regression.// For classification with 10 classes, label in column 4
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
JavaRDD<List<Writable>> records = // ... your input RDD
JavaRDD<DataSet> datasets = records.map(transformer);import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;
// For regression with data normalization
DataSetPreProcessor normalizer = new NormalizerMinMaxScaler();
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
-1, // labelIndex: -1 infers last column as label
-1, // numPossibleLabels: ignored for regression
true, // regression: true
normalizer,
null // no converter needed
);
JavaRDD<DataSet> datasets = records.map(transformer);// For multi-output regression with labels in columns 5-7
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
5, // labelIndexFrom: start of label columns
7, // labelIndexTo: end of label columns (inclusive)
-1, // numPossibleLabels: ignored for regression
true, // regression: true
null, // no preprocessing
null // no converter
);import org.datavec.api.io.converters.SelfWritableConverter;
// With custom writable converter
WritableConverter converter = new SelfWritableConverter();
DataVecDataSetFunction transformer = new DataVecDataSetFunction(
0, // labelIndex
2, // numPossibleLabels
false, // classification
null, // no preprocessing
converter
);When labelIndex is -1 and numPossibleLabels >= 1, the function automatically uses the last column as the label column.
The function can handle NDArrayWritable objects directly:
NDArrayWritable with the same reference, treats as feature-only dataNDArrayWritable and second is a scalar, uses NDArray as features and scalar as labelIllegalStateException if numPossibleLabels < 1 for classificationIllegalStateException if label value exceeds numPossibleLabels - 1 for classificationUnsupportedOperationException recovery for NDArrayWritable objectsimport org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
// Complete Spark workflow example
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<List<Writable>> inputRDD = // ... load your data
DataVecDataSetFunction transformer = new DataVecDataSetFunction(4, 10, false);
JavaRDD<DataSet> datasetRDD = inputRDD.map(transformer);
// Continue with ML pipeline
datasetRDD.collect(); // or other Spark actionsInstall with Tessl CLI
npx tessl i tessl/maven-org-deeplearning4j--dl4j-spark-2-11