or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-datavec--datavec-api

ETL library for machine learning data preprocessing across diverse formats including HDFS, Spark, Images, Video, Audio, CSV, and Excel

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.datavec/datavec-api@0.9.x

To install, run

npx @tessl/cli install tessl/maven-org-datavec--datavec-api@0.9.0

0

# DataVec API

1

2

DataVec is a comprehensive ETL (Extract, Transform, Load) library designed for machine learning data preprocessing across a wide variety of formats and files including HDFS, Spark, Images, Video, Audio, CSV, Excel and more. As part of the DL4J (DeepLearning4J) ecosystem, DataVec provides standardized interfaces for data readers, writers, and transformers that enable seamless data ingestion and preprocessing for machine learning workflows.

3

4

## Package Information

5

6

- **Package Name**: org.datavec:datavec-api

7

- **Package Type**: Maven

8

- **Language**: Java

9

- **Version**: 0.9.1

10

- **Installation**: Add to Maven dependencies:

11

12

```xml

13

<dependency>

14

<groupId>org.datavec</groupId>

15

<artifactId>datavec-api</artifactId>

16

<version>0.9.1</version>

17

</dependency>

18

```

19

20

## Core Imports

21

22

```java

23

import org.datavec.api.records.reader.RecordReader;

24

import org.datavec.api.records.reader.impl.csv.CSVRecordReader;

25

import org.datavec.api.split.FileSplit;

26

import org.datavec.api.writable.Writable;

27

```

28

29

For image processing:

30

31

```java

32

import org.datavec.image.recordreader.ImageRecordReader;

33

import org.datavec.image.loader.NativeImageLoader;

34

```

35

36

## Basic Usage

37

38

```java

39

import org.datavec.api.records.reader.RecordReader;

40

import org.datavec.api.records.reader.impl.csv.CSVRecordReader;

41

import org.datavec.api.split.FileSplit;

42

import org.datavec.api.writable.Writable;

43

import java.io.File;

44

import java.util.List;

45

46

// Create and initialize a CSV record reader

47

RecordReader recordReader = new CSVRecordReader();

48

recordReader.initialize(new FileSplit(new File("data.csv")));

49

50

// Read records

51

while (recordReader.hasNext()) {

52

List<Writable> record = recordReader.next();

53

// Process each record - contains data as Writable objects

54

for (Writable writable : record) {

55

System.out.println(writable.toString());

56

}

57

}

58

59

// Reset for reuse

60

recordReader.reset();

61

```

62

63

## Architecture

64

65

DataVec is built around several key design patterns and components:

66

67

- **RecordReader Interface**: Universal abstraction for reading data from various sources with consistent hasNext()/next() iteration pattern

68

- **Writable Type System**: Type-safe data containers that wrap primitive types (IntWritable, DoubleWritable) and complex objects (NDArrayWritable)

69

- **InputSplit Hierarchy**: Flexible data source specification supporting files, directories, streams, and distributed sources

70

- **Converter Pattern**: WritableConverter interface enables custom data type transformations during reading

71

- **Metadata Tracking**: Comprehensive data lineage support through RecordMetaData for debugging and provenance

72

- **Iterator Integration**: Seamless integration with DL4J DataSetIterator for machine learning pipelines

73

74

## Capabilities

75

76

### Record Readers

77

78

Core interfaces and implementations for reading structured data from various sources including CSV files, image directories, and in-memory collections. Provides consistent iteration patterns and metadata tracking.

79

80

```java { .api }

81

public interface RecordReader {

82

void initialize(InputSplit split) throws IOException;

83

List<Writable> next();

84

boolean hasNext();

85

void reset();

86

List<String> getLabels();

87

Record nextRecord();

88

}

89

90

public class CSVRecordReader implements RecordReader {

91

public CSVRecordReader();

92

public CSVRecordReader(int skipLines, String delimiter);

93

}

94

```

95

96

[Record Readers](./record-readers.md)

97

98

### Data Types and Writables

99

100

Type-safe data containers that wrap Java primitives and objects for DataVec compatibility. Includes specialized types for machine learning data like NDArrayWritable for tensor operations.

101

102

```java { .api }

103

public interface Writable {

104

void write(DataOutput out) throws IOException;

105

void readFields(DataInput in) throws IOException;

106

String toString();

107

}

108

109

public class IntWritable implements Writable {

110

public IntWritable(int value);

111

public int get();

112

}

113

114

public class DoubleWritable implements Writable {

115

public DoubleWritable(double value);

116

public double get();

117

}

118

119

public class NDArrayWritable implements Writable {

120

public NDArrayWritable(INDArray array);

121

public INDArray get();

122

}

123

```

124

125

[Data Types](./data-types.md)

126

127

### Input Sources and Splits

128

129

Flexible abstractions for specifying data sources including single files, file patterns, numbered sequences, and streaming data. Supports distributed processing and custom input source implementations.

130

131

```java { .api }

132

public interface InputSplit {

133

URI[] locations();

134

long length();

135

double getWeight();

136

}

137

138

public class FileSplit implements InputSplit {

139

public FileSplit(File file);

140

public FileSplit(File[] files);

141

}

142

143

public class NumberedFileInputSplit implements InputSplit {

144

public NumberedFileInputSplit(String basePattern, int minIndex, int maxIndex);

145

}

146

```

147

148

[Input Sources](./input-sources.md)

149

150

### Image Processing

151

152

Specialized record readers and utilities for processing image data including native image loading, format conversion, and integration with computer vision workflows.

153

154

```java { .api }

155

public class NativeImageLoader {

156

public NativeImageLoader(long height, long width);

157

public NativeImageLoader(long height, long width, long channels);

158

public INDArray asMatrix(File file) throws IOException;

159

}

160

161

public class ImageRecordReader implements RecordReader {

162

public ImageRecordReader(long height, long width, long channels, PathLabelGenerator labelGenerator);

163

}

164

```

165

166

[Image Processing](./image-processing.md)

167

168

### Data Transforms and Processing

169

170

Comprehensive transformation system for data preprocessing, cleaning, and feature engineering with column-level operations, mathematical transformations, and conditional logic.

171

172

```java { .api }

173

public class TransformProcess {

174

public static Builder builder(Schema initialSchema);

175

public List<Writable> execute(List<Writable> input);

176

public List<List<Writable>> execute(List<List<Writable>> input);

177

}

178

179

public interface Transform {

180

List<Writable> map(List<Writable> writables);

181

String[] outputColumnNames();

182

ColumnType[] outputColumnTypes();

183

}

184

185

public enum MathOp {

186

Add, Subtract, Multiply, Divide, Square, Sqrt, Log, Exp, Sin, Cos, Abs

187

}

188

189

public enum ReduceOp {

190

Min, Max, Sum, Mean, Stdev, Count, CountUnique

191

}

192

```

193

194

[Data Transforms](./transforms.md)

195

196

### Utilities and Helpers

197

198

Common utility classes for resource access, data conversion, and random operations. These support various DataVec operations including classpath resource loading and NDArray conversion.

199

200

```java { .api }

201

public class ClassPathResource {

202

public ClassPathResource(String path);

203

public File getTempFileFromArchive() throws IOException;

204

public InputStream getInputStream() throws IOException;

205

}

206

207

public class RecordConverter {

208

public static INDArray toArray(List<Writable> record);

209

public static List<Writable> toRecord(INDArray array);

210

}

211

212

public class RandomUtils {

213

public static void shuffle(List<?> list);

214

public static void shuffle(List<?> list, Random random);

215

}

216

```

217

218

These utilities enable efficient resource management, data format conversion, and randomization operations essential for machine learning data preprocessing workflows.

219

220

## Types

221

222

### Core Interfaces

223

224

```java { .api }

225

public interface RecordReader {

226

void initialize(InputSplit split) throws IOException;

227

List<Writable> next();

228

boolean hasNext();

229

void reset();

230

List<String> getLabels();

231

Record nextRecord();

232

boolean batchesSupported();

233

List<Writable> next(int numRecords);

234

}

235

236

public interface SequenceRecordReader extends RecordReader {

237

List<List<Writable>> sequenceRecord();

238

List<List<Writable>> sequenceRecord(URI uri, DataInputStream dataInputStream) throws IOException;

239

SequenceRecord nextSequence();

240

}

241

242

public interface InputSplit {

243

URI[] locations();

244

long length();

245

double getWeight();

246

}

247

248

public interface Writable {

249

void write(DataOutput out) throws IOException;

250

void readFields(DataInput in) throws IOException;

251

String toString();

252

}

253

```

254

255

### Data Containers

256

257

```java { .api }

258

public interface Record {

259

List<Writable> getRecord();

260

RecordMetaData getMetaData();

261

}

262

263

public interface SequenceRecord {

264

List<List<Writable>> getSequenceRecord();

265

RecordMetaData getMetaData();

266

}

267

```

268

269

### Transform System

270

271

```java { .api }

272

public class Schema {

273

public static class Builder {

274

public Builder addColumnString(String name);

275

public Builder addColumnInteger(String name);

276

public Builder addColumnDouble(String name);

277

public Builder addColumnCategorical(String name, List<String> categories);

278

public Schema build();

279

}

280

}

281

282

public class TransformProcess {

283

public static Builder builder(Schema initialSchema);

284

}

285

286

public interface Transform {

287

List<Writable> map(List<Writable> writables);

288

}

289

```

290

291

### Label Generation

292

293

```java { .api }

294

public interface PathLabelGenerator {

295

Writable getLabelForPath(String path);

296

Writable getLabelForPath(URI uri);

297

}

298

299

public class ParentPathLabelGenerator implements PathLabelGenerator {

300

public ParentPathLabelGenerator();

301

}

302

```