DataVec integration library providing data loading, transformation, and Spark processing capabilities for DeepLearning4j
npx @tessl/cli install tessl/maven-org-datavec--datavec-local@0.9.00
# DataVec Local Integration
1
2
DataVec Local Integration provides comprehensive data loading, transformation, and processing capabilities for DeepLearning4j. It bridges DataVec's data processing capabilities with DeepLearning4j's neural network training, enabling seamless conversion of various data sources into DataSet and MultiDataSet objects for machine learning workflows.
3
4
## Package Information
5
6
- **Package Name**: org.datavec:datavec-local
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Installation**: Add to Maven dependencies:
10
11
```xml
12
<dependency>
13
<groupId>org.datavec</groupId>
14
<artifactId>datavec-local</artifactId>
15
<version>0.9.1</version>
16
</dependency>
17
```
18
19
## Core Imports
20
21
```java
22
import org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator;
23
import org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator;
24
import org.deeplearning4j.datasets.datavec.SequenceRecordReaderDataSetIterator;
25
```
26
27
For Spark integration:
28
29
```java
30
import org.deeplearning4j.spark.datavec.DataVecDataSetFunction;
31
import org.deeplearning4j.spark.datavec.DataVecSequenceDataSetFunction;
32
```
33
34
## Basic Usage
35
36
```java
37
import org.datavec.api.records.reader.RecordReader;
38
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
39
import org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator;
40
import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
41
42
// Create a CSV record reader
43
RecordReader recordReader = new CSVRecordReader();
44
recordReader.initialize(new FileSplit(new File("data.csv")));
45
46
// Create dataset iterator
47
int batchSize = 32;
48
int labelIndex = 4; // Index of label column
49
int numPossibleLabels = 3; // Number of classes
50
boolean regression = false;
51
52
DataSetIterator iterator = new RecordReaderDataSetIterator(
53
recordReader, batchSize, labelIndex, numPossibleLabels);
54
55
// Use with DeepLearning4j training
56
while (iterator.hasNext()) {
57
DataSet dataSet = iterator.next();
58
// Train your model with dataSet
59
}
60
```
61
62
## Architecture
63
64
DataVec integration is built around several key components:
65
66
- **DataSet Iterators**: Convert RecordReader data into DataSet objects for single-input neural networks
67
- **MultiDataSet Support**: Handle complex multi-input/multi-output scenarios through RecordReaderMultiDataSetIterator
68
- **Sequence Processing**: Time series and sequential data handling with alignment modes
69
- **Spark Integration**: Distributed data processing functions for large-scale training
70
- **Metadata Support**: Load specific records by metadata for debugging and reproducibility
71
72
## Capabilities
73
74
### DataSet Iteration
75
76
Core functionality for converting RecordReader data into DataSet objects suitable for DeepLearning4j training. Supports various data sources including CSV, images, and custom formats.
77
78
```java { .api }
79
public class RecordReaderDataSetIterator implements DataSetIterator {
80
public RecordReaderDataSetIterator(RecordReader recordReader, int batchSize,
81
int labelIndex, int numPossibleLabels);
82
public DataSet next();
83
public boolean hasNext();
84
public void reset();
85
}
86
```
87
88
[DataSet Iteration](./dataset-iteration.md)
89
90
### Sequence Processing
91
92
Time series and sequential data processing with configurable alignment modes. Handles variable-length sequences and provides multiple alignment strategies for batch processing.
93
94
```java { .api }
95
public class SequenceRecordReaderDataSetIterator implements DataSetIterator {
96
public SequenceRecordReaderDataSetIterator(SequenceRecordReader featuresReader,
97
SequenceRecordReader labelsReader,
98
int miniBatchSize, int numPossibleLabels);
99
public enum AlignmentMode { EQUAL_LENGTH, ALIGN_START, ALIGN_END }
100
}
101
```
102
103
[Sequence Processing](./sequence-processing.md)
104
105
### Multi-Input/Output Support
106
107
Advanced multi-modal data processing for complex neural network architectures with multiple inputs and outputs. Uses builder pattern for flexible configuration.
108
109
```java { .api }
110
public class RecordReaderMultiDataSetIterator implements MultiDataSetIterator {
111
public static class Builder {
112
public Builder addReader(String readerName, RecordReader recordReader);
113
public Builder addInput(String readerName, int columnFirst, int columnLast);
114
public Builder addOutput(String readerName, int column, int numClasses);
115
public RecordReaderMultiDataSetIterator build();
116
}
117
}
118
```
119
120
[Multi-Input/Output](./multi-input-output.md)
121
122
### Spark Integration
123
124
Distributed data processing functions for Apache Spark, enabling large-scale data processing and training across clusters.
125
126
```java { .api }
127
public class DataVecDataSetFunction implements Function<List<Writable>, DataSet> {
128
public DataVecDataSetFunction(int labelIndex, int numPossibleLabels, boolean regression);
129
public DataSet call(List<Writable> currList);
130
}
131
```
132
133
[Spark Integration](./spark-integration.md)
134
135
## Types
136
137
### Core Interfaces
138
139
```java { .api }
140
public interface DataSetIterator extends Iterator<DataSet> {
141
DataSet next(int num);
142
int totalExamples();
143
int inputColumns();
144
int totalOutcomes();
145
boolean resetSupported();
146
void reset();
147
boolean asyncSupported();
148
int batch();
149
int cursor();
150
void setPreProcessor(DataSetPreProcessor preProcessor);
151
DataSetPreProcessor getPreProcessor();
152
List<String> getLabels();
153
DataSet loadFromMetaData(RecordMetaData recordMetaData);
154
DataSet loadFromMetaData(List<RecordMetaData> list);
155
}
156
157
public interface MultiDataSetIterator extends Iterator<MultiDataSet> {
158
MultiDataSet next(int num);
159
boolean resetSupported();
160
boolean asyncSupported();
161
void reset();
162
void setPreProcessor(MultiDataSetPreProcessor preProcessor);
163
MultiDataSetPreProcessor getPreProcessor();
164
MultiDataSet loadFromMetaData(RecordMetaData recordMetaData);
165
MultiDataSet loadFromMetaData(List<RecordMetaData> list);
166
}
167
```
168
169
### Alignment Modes
170
171
```java { .api }
172
public enum AlignmentMode {
173
EQUAL_LENGTH, // Sequences must be same length
174
ALIGN_START, // Align sequences at start, pad end
175
ALIGN_END // Align sequences at end, pad start
176
}
177
```
178
179
### Exception Types
180
181
```java { .api }
182
public class ZeroLengthSequenceException extends RuntimeException {
183
public ZeroLengthSequenceException();
184
public ZeroLengthSequenceException(String type);
185
}
186
```