CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-datavec--datavec-local

DataVec integration library providing data loading, transformation, and Spark processing capabilities for DeepLearning4j

Pending
Overview
Eval results
Files

dataset-iteration.mddocs/

DataSet Iteration

Core functionality for converting RecordReader data into DataSet objects suitable for DeepLearning4j training. The RecordReaderDataSetIterator provides a bridge between DataVec's data reading capabilities and DeepLearning4j's training requirements.

Capabilities

RecordReaderDataSetIterator

Main class for converting RecordReader data into DataSet objects for neural network training.

public class RecordReaderDataSetIterator implements DataSetIterator, Serializable {
    // Main constructors
    public RecordReaderDataSetIterator(RecordReader recordReader, int batchSize);
    public RecordReaderDataSetIterator(RecordReader recordReader, int batchSize, 
                                     int labelIndex, int numPossibleLabels);
    public RecordReaderDataSetIterator(RecordReader recordReader, int batchSize, 
                                     int labelIndex, int numPossibleLabels, 
                                     boolean regression);
    public RecordReaderDataSetIterator(RecordReader recordReader, 
                                     WritableConverter converter, int batchSize, 
                                     int labelIndex, int numPossibleLabels, 
                                     boolean regression);
    public RecordReaderDataSetIterator(RecordReader recordReader, int batchSize, 
                                     int labelIndexFrom, int labelIndexTo, 
                                     boolean regression);
    public RecordReaderDataSetIterator(RecordReader recordReader, int batchSize, 
                                     int labelIndex, int numPossibleLabels, 
                                     int maxNumBatches);
    
    // Iterator methods
    public boolean hasNext();
    public DataSet next();
    public DataSet next(int num);
    public void remove();
    
    // Configuration methods
    public void setPreProcessor(DataSetPreProcessor preProcessor);
    public DataSetPreProcessor getPreProcessor();
    public void setCollectMetaData(boolean collectMetaData);
    public boolean getCollectMetaData();
    
    // Information methods
    public int totalExamples();
    public int inputColumns();
    public int totalOutcomes();
    public int batch();
    public int cursor();
    public int numExamples();
    public List<String> getLabels();
    
    // Reset and async support
    public boolean resetSupported();
    public boolean asyncSupported();
    public void reset();
    
    // Metadata support
    public DataSet loadFromMetaData(RecordMetaData recordMetaData) throws IOException;
    public DataSet loadFromMetaData(List<RecordMetaData> recordMetaDatas) throws IOException;
}

Constructor Parameters

Basic Constructor

  • recordReader: The RecordReader to read data from
  • batchSize: Number of examples per batch

Classification Constructor

  • recordReader: The RecordReader to read data from
  • batchSize: Number of examples per batch
  • labelIndex: Column index containing the label (0-based)
  • numPossibleLabels: Number of possible label classes

Advanced Constructor

  • recordReader: The RecordReader to read data from
  • converter: WritableConverter for data type conversion (null for default)
  • batchSize: Number of examples per batch
  • labelIndex: Column index containing the label (0-based)
  • numPossibleLabels: Number of possible label classes
  • regression: true for regression, false for classification

Multi-Label Constructor

  • recordReader: The RecordReader to read data from
  • batchSize: Number of examples per batch
  • labelIndexFrom: Starting column index for labels (inclusive)
  • labelIndexTo: Ending column index for labels (inclusive)
  • regression: true for regression, false for classification

Batch-Limited Constructor

  • recordReader: The RecordReader to read data from
  • batchSize: Number of examples per batch
  • labelIndex: Column index containing the label (0-based)
  • numPossibleLabels: Number of possible label classes
  • maxNumBatches: Maximum number of batches to iterate over

Usage Examples

Basic CSV Classification

import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.api.split.FileSplit;
import org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator;

// Setup CSV reader
RecordReader csvReader = new CSVRecordReader();
csvReader.initialize(new FileSplit(new File("iris.csv")));

// Create iterator for classification
DataSetIterator iterator = new RecordReaderDataSetIterator(
    csvReader,      // recordReader
    32,             // batchSize  
    4,              // labelIndex (column 4 contains labels)
    3               // numPossibleLabels (3 classes)
);

// Use iterator
while (iterator.hasNext()) {
    DataSet dataSet = iterator.next();
    System.out.println("Features shape: " + Arrays.toString(dataSet.getFeatures().shape()));
    System.out.println("Labels shape: " + Arrays.toString(dataSet.getLabels().shape()));
}

Regression Example

// Setup for regression task
DataSetIterator regressionIterator = new RecordReaderDataSetIterator(
    csvReader,      // recordReader
    64,             // batchSize
    5,              // labelIndex (column 5 contains continuous target)
    1,              // numPossibleLabels (1 for regression)
    true            // regression = true
);

Multi-Label Classification

// Labels in columns 3, 4, and 5
DataSetIterator multiLabelIterator = new RecordReaderDataSetIterator(
    csvReader,      // recordReader
    32,             // batchSize
    3,              // labelIndexFrom (start of label columns)
    5,              // labelIndexTo (end of label columns)
    false           // regression = false (classification)
);

With Data Preprocessing

import org.nd4j.linalg.dataset.api.preprocessor.NormalizerMinMaxScaler;

// Create iterator
DataSetIterator iterator = new RecordReaderDataSetIterator(csvReader, 32, 4, 3);

// Add preprocessing
NormalizerMinMaxScaler scaler = new NormalizerMinMaxScaler();
iterator.setPreProcessor(scaler);

// First pass to calculate min/max
scaler.fit(iterator);
iterator.reset();

// Now use normalized data
while (iterator.hasNext()) {
    DataSet normalizedData = iterator.next();
    // Train with normalized data
}

Metadata Collection

// Enable metadata collection
RecordReaderDataSetIterator iterator = new RecordReaderDataSetIterator(
    csvReader, 32, 4, 3);
iterator.setCollectMetaData(true);

// Process data
DataSet batch = iterator.next();
List<RecordMetaData> metaData = batch.getExampleMetaData();

// Later, load specific examples by metadata
DataSet specificExample = iterator.loadFromMetaData(metaData.get(0));

Error Handling

The iterator handles various error conditions:

  • IOException: Thrown when RecordReader encounters file reading errors
  • IllegalArgumentException: Thrown for invalid constructor parameters
  • NoSuchElementException: Thrown when calling next() with no more data

Common validation performed:

  • batchSize must be positive
  • labelIndex must be valid column index
  • numPossibleLabels must be positive for classification
  • labelIndexFrom must be <= labelIndexTo for multi-label scenarios

Install with Tessl CLI

npx tessl i tessl/maven-org-datavec--datavec-local

docs

dataset-iteration.md

index.md

multi-input-output.md

sequence-processing.md

spark-integration.md

tile.json