tessl/maven-org-deeplearning4j--deeplearning4j-nlp

DeepLearning4J Natural Language Processing module providing word embeddings, document classification, and text processing capabilities for neural network applications.

—

Pending

Overview

Eval results

Files

Dataset Loading and Iteration

Name: tessl/maven-org-deeplearning4j--deeplearning4j-nlp
Author: tessl

Pre-built dataset loaders and iterators for common NLP datasets and data formats, designed for seamless integration with neural network training pipelines. Provides standardized access to benchmark datasets and custom data preparation utilities.

Capabilities

CNN Sentence Dataset Iterator

Dataset iterator specifically designed for Convolutional Neural Network sentence classification tasks with configurable preprocessing and batching.

/**
 * CNN sentence dataset iterator for sentence classification tasks
 * Provides standardized data preparation for CNN-based text classification
 */
public class CnnSentenceDataSetIterator {
    // CNN-specific dataset iteration with sentence-level batching and preprocessing
}

Reuters News Groups Dataset

Standardized access to the Reuters News Groups dataset for document classification and text analysis benchmarking.

/**
 * Reuters News Groups dataset iterator
 * Provides access to Reuters news articles with category labels
 */
public class ReutersNewsGroupsDataSetIterator {
    // Iterator for Reuters dataset with automatic downloading and preprocessing
}

/**
 * Reuters News Groups dataset loader
 * Handles downloading, extraction, and preparation of Reuters dataset
 */
public class ReutersNewsGroupsLoader {
    // Dataset loading utilities for Reuters news groups data
}

Labeled Sentence Providers

Interface and implementations for providing labeled sentences to dataset iterators with various data source options.

/**
 * Interface for providing labeled sentences to dataset iterators
 */
public interface LabeledSentenceProvider {
    /**
     * Get total number of labeled sentences
     * @return Total count of available labeled sentences
     */
    int totalNumSentences();
    
    /**
     * Get all available sentence labels
     * @return List of all unique labels in the dataset
     */
    List<String> allLabels();
    
    /**
     * Get sentence at specific index
     * @param index Index of sentence to retrieve
     * @return Sentence string at specified index
     */
    String sentenceAt(int index);
    
    /**
     * Get label for sentence at specific index
     * @param index Index of sentence label to retrieve
     * @return Label string for sentence at specified index
     */
    String labelAt(int index);
}

/**
 * Collection-based labeled sentence provider
 * Provides labeled sentences from in-memory collections
 */
public class CollectionLabeledSentenceProvider implements LabeledSentenceProvider {
    
    /**
     * Create provider from sentence and label collections
     * @param sentences Collection of sentence strings
     * @param labels Collection of corresponding label strings
     */
    public CollectionLabeledSentenceProvider(Collection<String> sentences, Collection<String> labels);
    
    /**
     * Create provider from labeled document collection
     * @param documents Collection of LabelledDocument instances
     */
    public CollectionLabeledSentenceProvider(Collection<LabelledDocument> documents);
}

/**
 * File-based labeled sentence provider
 * Reads labeled sentences from file system with various formats
 */
public class FileLabeledSentenceProvider implements LabeledSentenceProvider {
    
    /**
     * Create provider from file with specified format
     * @param file File containing labeled sentences
     * @param format Format specification for parsing labeled data
     */
    public FileLabeledSentenceProvider(File file, LabeledSentenceFormat format);
    
    /**
     * Create provider from directory with label-based organization
     * @param directory Directory containing subdirectories for each label
     */
    public FileLabeledSentenceProvider(File directory);
}

Label-Aware Data Conversion

Utilities for converting between different labeled data formats and iterator types.

/**
 * Converter for label-aware data formats
 * Handles conversion between different labeled data representations
 */
public class LabelAwareConverter {
    
    /**
     * Convert label-aware iterator to standard format
     * @param iterator LabelAwareIterator to convert
     * @return Converted data in standard format
     */
    public static ConvertedData convert(LabelAwareIterator iterator);
    
    /**
     * Convert labeled document collection to provider format
     * @param documents Collection of LabelledDocument instances
     * @return LabeledSentenceProvider for the documents
     */
    public static LabeledSentenceProvider convert(Collection<LabelledDocument> documents);
}

Usage Examples:

import org.deeplearning4j.datasets.iterator.ReutersNewsGroupsDataSetIterator;
import org.deeplearning4j.datasets.loader.ReutersNewsGroupsLoader;
import org.deeplearning4j.iterator.*;
import org.deeplearning4j.iterator.provider.*;

// Reuters News Groups dataset usage
ReutersNewsGroupsDataSetIterator reutersIterator = new ReutersNewsGroupsDataSetIterator(
    32,      // batch size
    100,     // truncate length
    true,    // train set
    new DefaultTokenizerFactory()
);

while (reutersIterator.hasNext()) {
    DataSet batch = reutersIterator.next();
    // Process batch for training
    System.out.println("Batch size: " + batch.numExamples());
}

// Custom labeled sentence provider from collections
Collection<String> sentences = Arrays.asList(
    "This is a positive example",
    "This is a negative example",
    "Another positive case"
);

Collection<String> labels = Arrays.asList(
    "positive",
    "negative", 
    "positive"
);

LabeledSentenceProvider provider = new CollectionLabeledSentenceProvider(sentences, labels);
System.out.println("Total sentences: " + provider.totalNumSentences());
System.out.println("Available labels: " + provider.allLabels());

// Access specific sentences and labels
for (int i = 0; i < provider.totalNumSentences(); i++) {
    String sentence = provider.sentenceAt(i);
    String label = provider.labelAt(i);
    System.out.println("Sentence: " + sentence + " -> Label: " + label);
}

// File-based labeled sentence provider
File labeledDataFile = new File("labeled_data.txt");
FileLabeledSentenceProvider fileProvider = new FileLabeledSentenceProvider(
    labeledDataFile, 
    LabeledSentenceFormat.TAB_SEPARATED // or other format
);

// Directory-based provider (subdirectories as labels)
File dataDirectory = new File("data/");
// Expected structure:
// data/
//   positive/
//     file1.txt
//     file2.txt
//   negative/
//     file3.txt
//     file4.txt

FileLabeledSentenceProvider dirProvider = new FileLabeledSentenceProvider(dataDirectory);

// CNN sentence dataset iterator configuration
LabeledSentenceProvider sentenceProvider = new CollectionLabeledSentenceProvider(
    sentences, labels
);

CnnSentenceDataSetIterator cnnIterator = new CnnSentenceDataSetIterator.Builder()
    .sentenceProvider(sentenceProvider)
    .tokenizerFactory(new DefaultTokenizerFactory())
    .maxSentenceLength(100)
    .minibatchSize(32)
    .build();

// Use with neural network training
while (cnnIterator.hasNext()) {
    DataSet batch = cnnIterator.next();
    // Train CNN model with batch
}

// Reuters dataset downloading and preparation
ReutersNewsGroupsLoader loader = new ReutersNewsGroupsLoader();
// loader.downloadAndExtract(); // Downloads Reuters data if not present

// Advanced iterator configuration with custom preprocessing
TokenizerFactory customTokenizer = new DefaultTokenizerFactory();
customTokenizer.setTokenPreProcessor(new CommonPreprocessor());

CnnSentenceDataSetIterator advancedIterator = new CnnSentenceDataSetIterator.Builder()
    .sentenceProvider(provider)
    .tokenizerFactory(customTokenizer)
    .maxSentenceLength(150)
    .minibatchSize(64)
    .useNormalizedWordVectors(true)
    .build();

Dataset Integration Patterns

The dataset loading components support several common patterns:

Benchmark Dataset Access

Automatic downloading: Datasets are downloaded automatically when first accessed
Standardized preprocessing: Consistent text cleaning and tokenization across datasets
Train/test splits: Pre-defined data splits for reproducible experiments
Label encoding: Automatic conversion of text labels to numerical representations

Custom Dataset Integration

Flexible input formats: Support for various file formats and directory structures
Label discovery: Automatic label extraction from filenames, directories, or file content
Memory efficiency: Streaming access to large datasets without loading everything into memory
Preprocessing pipelines: Integration with tokenization and text processing components

Neural Network Integration

Batch preparation: Automatic batching with configurable sizes for efficient training
Sequence padding: Handling variable-length text sequences with padding strategies
Label encoding: One-hot encoding and other label representations for classification
Memory management: Efficient data loading that scales to large datasets

These dataset utilities provide the foundation for training and evaluating NLP models on both standard benchmarks and custom datasets, ensuring consistent data preparation across different model types and training scenarios.

Install with Tessl CLI

npx tessl i tessl/maven-org-deeplearning4j--deeplearning4j-nlp

docs

bag-of-words.md

dataset-loading.md

document-embeddings.md