0
# DataVec API
1
2
DataVec is a comprehensive ETL (Extract, Transform, Load) library designed for machine learning data preprocessing across a wide variety of formats and files including HDFS, Spark, Images, Video, Audio, CSV, Excel and more. As part of the DL4J (DeepLearning4J) ecosystem, DataVec provides standardized interfaces for data readers, writers, and transformers that enable seamless data ingestion and preprocessing for machine learning workflows.
3
4
## Package Information
5
6
- **Package Name**: org.datavec:datavec-api
7
- **Package Type**: Maven
8
- **Language**: Java
9
- **Version**: 0.9.1
10
- **Installation**: Add to Maven dependencies:
11
12
```xml
13
<dependency>
14
<groupId>org.datavec</groupId>
15
<artifactId>datavec-api</artifactId>
16
<version>0.9.1</version>
17
</dependency>
18
```
19
20
## Core Imports
21
22
```java
23
import org.datavec.api.records.reader.RecordReader;
24
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
25
import org.datavec.api.split.FileSplit;
26
import org.datavec.api.writable.Writable;
27
```
28
29
For image processing:
30
31
```java
32
import org.datavec.image.recordreader.ImageRecordReader;
33
import org.datavec.image.loader.NativeImageLoader;
34
```
35
36
## Basic Usage
37
38
```java
39
import org.datavec.api.records.reader.RecordReader;
40
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
41
import org.datavec.api.split.FileSplit;
42
import org.datavec.api.writable.Writable;
43
import java.io.File;
44
import java.util.List;
45
46
// Create and initialize a CSV record reader
47
RecordReader recordReader = new CSVRecordReader();
48
recordReader.initialize(new FileSplit(new File("data.csv")));
49
50
// Read records
51
while (recordReader.hasNext()) {
52
List<Writable> record = recordReader.next();
53
// Process each record - contains data as Writable objects
54
for (Writable writable : record) {
55
System.out.println(writable.toString());
56
}
57
}
58
59
// Reset for reuse
60
recordReader.reset();
61
```
62
63
## Architecture
64
65
DataVec is built around several key design patterns and components:
66
67
- **RecordReader Interface**: Universal abstraction for reading data from various sources with consistent hasNext()/next() iteration pattern
68
- **Writable Type System**: Type-safe data containers that wrap primitive types (IntWritable, DoubleWritable) and complex objects (NDArrayWritable)
69
- **InputSplit Hierarchy**: Flexible data source specification supporting files, directories, streams, and distributed sources
70
- **Converter Pattern**: WritableConverter interface enables custom data type transformations during reading
71
- **Metadata Tracking**: Comprehensive data lineage support through RecordMetaData for debugging and provenance
72
- **Iterator Integration**: Seamless integration with DL4J DataSetIterator for machine learning pipelines
73
74
## Capabilities
75
76
### Record Readers
77
78
Core interfaces and implementations for reading structured data from various sources including CSV files, image directories, and in-memory collections. Provides consistent iteration patterns and metadata tracking.
79
80
```java { .api }
81
public interface RecordReader {
82
void initialize(InputSplit split) throws IOException;
83
List<Writable> next();
84
boolean hasNext();
85
void reset();
86
List<String> getLabels();
87
Record nextRecord();
88
}
89
90
public class CSVRecordReader implements RecordReader {
91
public CSVRecordReader();
92
public CSVRecordReader(int skipLines, String delimiter);
93
}
94
```
95
96
[Record Readers](./record-readers.md)
97
98
### Data Types and Writables
99
100
Type-safe data containers that wrap Java primitives and objects for DataVec compatibility. Includes specialized types for machine learning data like NDArrayWritable for tensor operations.
101
102
```java { .api }
103
public interface Writable {
104
void write(DataOutput out) throws IOException;
105
void readFields(DataInput in) throws IOException;
106
String toString();
107
}
108
109
public class IntWritable implements Writable {
110
public IntWritable(int value);
111
public int get();
112
}
113
114
public class DoubleWritable implements Writable {
115
public DoubleWritable(double value);
116
public double get();
117
}
118
119
public class NDArrayWritable implements Writable {
120
public NDArrayWritable(INDArray array);
121
public INDArray get();
122
}
123
```
124
125
[Data Types](./data-types.md)
126
127
### Input Sources and Splits
128
129
Flexible abstractions for specifying data sources including single files, file patterns, numbered sequences, and streaming data. Supports distributed processing and custom input source implementations.
130
131
```java { .api }
132
public interface InputSplit {
133
URI[] locations();
134
long length();
135
double getWeight();
136
}
137
138
public class FileSplit implements InputSplit {
139
public FileSplit(File file);
140
public FileSplit(File[] files);
141
}
142
143
public class NumberedFileInputSplit implements InputSplit {
144
public NumberedFileInputSplit(String basePattern, int minIndex, int maxIndex);
145
}
146
```
147
148
[Input Sources](./input-sources.md)
149
150
### Image Processing
151
152
Specialized record readers and utilities for processing image data including native image loading, format conversion, and integration with computer vision workflows.
153
154
```java { .api }
155
public class NativeImageLoader {
156
public NativeImageLoader(long height, long width);
157
public NativeImageLoader(long height, long width, long channels);
158
public INDArray asMatrix(File file) throws IOException;
159
}
160
161
public class ImageRecordReader implements RecordReader {
162
public ImageRecordReader(long height, long width, long channels, PathLabelGenerator labelGenerator);
163
}
164
```
165
166
[Image Processing](./image-processing.md)
167
168
### Data Transforms and Processing
169
170
Comprehensive transformation system for data preprocessing, cleaning, and feature engineering with column-level operations, mathematical transformations, and conditional logic.
171
172
```java { .api }
173
public class TransformProcess {
174
public static Builder builder(Schema initialSchema);
175
public List<Writable> execute(List<Writable> input);
176
public List<List<Writable>> execute(List<List<Writable>> input);
177
}
178
179
public interface Transform {
180
List<Writable> map(List<Writable> writables);
181
String[] outputColumnNames();
182
ColumnType[] outputColumnTypes();
183
}
184
185
public enum MathOp {
186
Add, Subtract, Multiply, Divide, Square, Sqrt, Log, Exp, Sin, Cos, Abs
187
}
188
189
public enum ReduceOp {
190
Min, Max, Sum, Mean, Stdev, Count, CountUnique
191
}
192
```
193
194
[Data Transforms](./transforms.md)
195
196
### Utilities and Helpers
197
198
Common utility classes for resource access, data conversion, and random operations. These support various DataVec operations including classpath resource loading and NDArray conversion.
199
200
```java { .api }
201
public class ClassPathResource {
202
public ClassPathResource(String path);
203
public File getTempFileFromArchive() throws IOException;
204
public InputStream getInputStream() throws IOException;
205
}
206
207
public class RecordConverter {
208
public static INDArray toArray(List<Writable> record);
209
public static List<Writable> toRecord(INDArray array);
210
}
211
212
public class RandomUtils {
213
public static void shuffle(List<?> list);
214
public static void shuffle(List<?> list, Random random);
215
}
216
```
217
218
These utilities enable efficient resource management, data format conversion, and randomization operations essential for machine learning data preprocessing workflows.
219
220
## Types
221
222
### Core Interfaces
223
224
```java { .api }
225
public interface RecordReader {
226
void initialize(InputSplit split) throws IOException;
227
List<Writable> next();
228
boolean hasNext();
229
void reset();
230
List<String> getLabels();
231
Record nextRecord();
232
boolean batchesSupported();
233
List<Writable> next(int numRecords);
234
}
235
236
public interface SequenceRecordReader extends RecordReader {
237
List<List<Writable>> sequenceRecord();
238
List<List<Writable>> sequenceRecord(URI uri, DataInputStream dataInputStream) throws IOException;
239
SequenceRecord nextSequence();
240
}
241
242
public interface InputSplit {
243
URI[] locations();
244
long length();
245
double getWeight();
246
}
247
248
public interface Writable {
249
void write(DataOutput out) throws IOException;
250
void readFields(DataInput in) throws IOException;
251
String toString();
252
}
253
```
254
255
### Data Containers
256
257
```java { .api }
258
public interface Record {
259
List<Writable> getRecord();
260
RecordMetaData getMetaData();
261
}
262
263
public interface SequenceRecord {
264
List<List<Writable>> getSequenceRecord();
265
RecordMetaData getMetaData();
266
}
267
```
268
269
### Transform System
270
271
```java { .api }
272
public class Schema {
273
public static class Builder {
274
public Builder addColumnString(String name);
275
public Builder addColumnInteger(String name);
276
public Builder addColumnDouble(String name);
277
public Builder addColumnCategorical(String name, List<String> categories);
278
public Schema build();
279
}
280
}
281
282
public class TransformProcess {
283
public static Builder builder(Schema initialSchema);
284
}
285
286
public interface Transform {
287
List<Writable> map(List<Writable> writables);
288
}
289
```
290
291
### Label Generation
292
293
```java { .api }
294
public interface PathLabelGenerator {
295
Writable getLabelForPath(String path);
296
Writable getLabelForPath(URI uri);
297
}
298
299
public class ParentPathLabelGenerator implements PathLabelGenerator {
300
public ParentPathLabelGenerator();
301
}
302
```