or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

docs

data-types.mdimage-processing.mdindex.mdinput-sources.mdrecord-readers.mdtransforms.md
tile.json

tessl/maven-org-datavec--datavec-api

ETL library for machine learning data preprocessing across diverse formats including HDFS, Spark, Images, Video, Audio, CSV, and Excel

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.datavec/datavec-api@0.9.x

To install, run

npx @tessl/cli install tessl/maven-org-datavec--datavec-api@0.9.0

index.mddocs/

DataVec API

DataVec is a comprehensive ETL (Extract, Transform, Load) library designed for machine learning data preprocessing across a wide variety of formats and files including HDFS, Spark, Images, Video, Audio, CSV, Excel and more. As part of the DL4J (DeepLearning4J) ecosystem, DataVec provides standardized interfaces for data readers, writers, and transformers that enable seamless data ingestion and preprocessing for machine learning workflows.

Package Information

  • Package Name: org.datavec:datavec-api
  • Package Type: Maven
  • Language: Java
  • Version: 0.9.1
  • Installation: Add to Maven dependencies:
<dependency>
    <groupId>org.datavec</groupId>
    <artifactId>datavec-api</artifactId>
    <version>0.9.1</version>
</dependency>

Core Imports

import org.datavec.api.records.reader.RecordReader;
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.api.split.FileSplit;
import org.datavec.api.writable.Writable;

For image processing:

import org.datavec.image.recordreader.ImageRecordReader;
import org.datavec.image.loader.NativeImageLoader;

Basic Usage

import org.datavec.api.records.reader.RecordReader;
import org.datavec.api.records.reader.impl.csv.CSVRecordReader;
import org.datavec.api.split.FileSplit;
import org.datavec.api.writable.Writable;
import java.io.File;
import java.util.List;

// Create and initialize a CSV record reader
RecordReader recordReader = new CSVRecordReader();
recordReader.initialize(new FileSplit(new File("data.csv")));

// Read records
while (recordReader.hasNext()) {
    List<Writable> record = recordReader.next();
    // Process each record - contains data as Writable objects
    for (Writable writable : record) {
        System.out.println(writable.toString());
    }
}

// Reset for reuse
recordReader.reset();

Architecture

DataVec is built around several key design patterns and components:

  • RecordReader Interface: Universal abstraction for reading data from various sources with consistent hasNext()/next() iteration pattern
  • Writable Type System: Type-safe data containers that wrap primitive types (IntWritable, DoubleWritable) and complex objects (NDArrayWritable)
  • InputSplit Hierarchy: Flexible data source specification supporting files, directories, streams, and distributed sources
  • Converter Pattern: WritableConverter interface enables custom data type transformations during reading
  • Metadata Tracking: Comprehensive data lineage support through RecordMetaData for debugging and provenance
  • Iterator Integration: Seamless integration with DL4J DataSetIterator for machine learning pipelines

Capabilities

Record Readers

Core interfaces and implementations for reading structured data from various sources including CSV files, image directories, and in-memory collections. Provides consistent iteration patterns and metadata tracking.

public interface RecordReader {
    void initialize(InputSplit split) throws IOException;
    List<Writable> next();
    boolean hasNext();
    void reset();
    List<String> getLabels();
    Record nextRecord();
}

public class CSVRecordReader implements RecordReader {
    public CSVRecordReader();
    public CSVRecordReader(int skipLines, String delimiter);
}

Record Readers

Data Types and Writables

Type-safe data containers that wrap Java primitives and objects for DataVec compatibility. Includes specialized types for machine learning data like NDArrayWritable for tensor operations.

public interface Writable {
    void write(DataOutput out) throws IOException;
    void readFields(DataInput in) throws IOException;
    String toString();
}

public class IntWritable implements Writable {
    public IntWritable(int value);
    public int get();
}

public class DoubleWritable implements Writable {
    public DoubleWritable(double value);
    public double get();
}

public class NDArrayWritable implements Writable {
    public NDArrayWritable(INDArray array);
    public INDArray get();
}

Data Types

Input Sources and Splits

Flexible abstractions for specifying data sources including single files, file patterns, numbered sequences, and streaming data. Supports distributed processing and custom input source implementations.

public interface InputSplit {
    URI[] locations();
    long length();
    double getWeight();
}

public class FileSplit implements InputSplit {
    public FileSplit(File file);
    public FileSplit(File[] files);
}

public class NumberedFileInputSplit implements InputSplit {
    public NumberedFileInputSplit(String basePattern, int minIndex, int maxIndex);
}

Input Sources

Image Processing

Specialized record readers and utilities for processing image data including native image loading, format conversion, and integration with computer vision workflows.

public class NativeImageLoader {
    public NativeImageLoader(long height, long width);
    public NativeImageLoader(long height, long width, long channels);
    public INDArray asMatrix(File file) throws IOException;
}

public class ImageRecordReader implements RecordReader {
    public ImageRecordReader(long height, long width, long channels, PathLabelGenerator labelGenerator);
}

Image Processing

Data Transforms and Processing

Comprehensive transformation system for data preprocessing, cleaning, and feature engineering with column-level operations, mathematical transformations, and conditional logic.

public class TransformProcess {
    public static Builder builder(Schema initialSchema);
    public List<Writable> execute(List<Writable> input);
    public List<List<Writable>> execute(List<List<Writable>> input);
}

public interface Transform {
    List<Writable> map(List<Writable> writables);
    String[] outputColumnNames();
    ColumnType[] outputColumnTypes();
}

public enum MathOp {
    Add, Subtract, Multiply, Divide, Square, Sqrt, Log, Exp, Sin, Cos, Abs
}

public enum ReduceOp {
    Min, Max, Sum, Mean, Stdev, Count, CountUnique
}

Data Transforms

Utilities and Helpers

Common utility classes for resource access, data conversion, and random operations. These support various DataVec operations including classpath resource loading and NDArray conversion.

public class ClassPathResource {
    public ClassPathResource(String path);
    public File getTempFileFromArchive() throws IOException;
    public InputStream getInputStream() throws IOException;
}

public class RecordConverter {
    public static INDArray toArray(List<Writable> record);
    public static List<Writable> toRecord(INDArray array);
}

public class RandomUtils {
    public static void shuffle(List<?> list);
    public static void shuffle(List<?> list, Random random);
}

These utilities enable efficient resource management, data format conversion, and randomization operations essential for machine learning data preprocessing workflows.

Types

Core Interfaces

public interface RecordReader {
    void initialize(InputSplit split) throws IOException;
    List<Writable> next();
    boolean hasNext();
    void reset();
    List<String> getLabels();
    Record nextRecord();
    boolean batchesSupported();
    List<Writable> next(int numRecords);
}

public interface SequenceRecordReader extends RecordReader {
    List<List<Writable>> sequenceRecord();
    List<List<Writable>> sequenceRecord(URI uri, DataInputStream dataInputStream) throws IOException;
    SequenceRecord nextSequence();
}

public interface InputSplit {
    URI[] locations();
    long length();
    double getWeight();
}

public interface Writable {
    void write(DataOutput out) throws IOException;
    void readFields(DataInput in) throws IOException;
    String toString();
}

Data Containers

public interface Record {
    List<Writable> getRecord();
    RecordMetaData getMetaData();
}

public interface SequenceRecord {
    List<List<Writable>> getSequenceRecord();
    RecordMetaData getMetaData();
}

Transform System

public class Schema {
    public static class Builder {
        public Builder addColumnString(String name);
        public Builder addColumnInteger(String name);
        public Builder addColumnDouble(String name);
        public Builder addColumnCategorical(String name, List<String> categories);
        public Schema build();
    }
}

public class TransformProcess {
    public static Builder builder(Schema initialSchema);
}

public interface Transform {
    List<Writable> map(List<Writable> writables);
}

Label Generation

public interface PathLabelGenerator {
    Writable getLabelForPath(String path);
    Writable getLabelForPath(URI uri);
}

public class ParentPathLabelGenerator implements PathLabelGenerator {
    public ParentPathLabelGenerator();
}