tessl/maven-org-deeplearning4j--deeplearning4j-nlp

DeepLearning4J Natural Language Processing module providing word embeddings, document classification, and text processing capabilities for neural network applications.

—

Pending

Overview

Eval results

Files

Bag of Words Vectorization

Name: tessl/maven-org-deeplearning4j--deeplearning4j-nlp
Author: tessl

Traditional text vectorization methods including TF-IDF and bag-of-words representations for document classification, information retrieval, and feature extraction tasks. Provides sparse vector representations that complement dense neural embeddings.

Capabilities

Text Vectorization Interface

Base interface for text vectorization implementations providing consistent API across different vectorization strategies.

/**
 * Text vectorization interface for converting text to numerical representations
 */
public interface TextVectorizer {
    // Core vectorization interface - implementations provide specific vectorization methods
}

Bag of Words Vectorizer

Classic bag-of-words vectorization creating sparse representations based on word frequency counts.

/**
 * Bag of Words vectorization implementation
 * Creates sparse vector representations based on word frequency counts
 */
public class BagOfWordsVectorizer implements TextVectorizer {
    // Bag of words implementation with configurable vocabulary and normalization
}

TF-IDF Vectorizer

Term Frequency-Inverse Document Frequency vectorization for weighted sparse representations emphasizing discriminative terms.

/**
 * TF-IDF vectorization implementation
 * Creates weighted sparse vectors emphasizing discriminative terms
 */
public class TfidfVectorizer implements TextVectorizer {
    // TF-IDF implementation with configurable term weighting schemes
}

Base Text Vectorizer

Abstract base implementation providing common functionality for text vectorization implementations.

/**
 * Abstract base class for text vectorizers
 * Provides common functionality and configuration patterns
 */
public class BaseTextVectorizer implements TextVectorizer {
    // Common vectorization infrastructure and utilities
}

Vectorizer Builder

Builder pattern for configuring text vectorizers with various parameters and data sources.

/**
 * Builder for text vectorizer configuration
 * Supports various vectorization strategies and parameters
 */
public class Builder {
    // Configurable builder for text vectorization components
}

Input Stream Creator

Utility for creating input streams from various text sources for vectorization processing.

/**
 * Default input stream creator for text sources
 * Handles various input formats and encodings
 */
public class DefaultInputStreamCreator {
    // Input stream creation utilities for text processing
}

Usage Examples:

import org.deeplearning4j.bagofwords.vectorizer.*;

// Basic bag of words vectorization
Collection<String> documents = Arrays.asList(
    "The quick brown fox jumps over the lazy dog",
    "A fast brown animal leaps over the sleeping canine",
    "Natural language processing with machine learning"
);

// Configure bag of words vectorizer
BagOfWordsVectorizer bowVectorizer = new BagOfWordsVectorizer();
// Additional configuration would be done here based on actual API

// TF-IDF vectorization for document similarity
TfidfVectorizer tfidfVectorizer = new TfidfVectorizer();
// Configure TF-IDF parameters based on actual implementation

// Example usage pattern (actual API may vary):
// INDArray vectors = bowVectorizer.vectorize(documents);
// INDArray tfidfVectors = tfidfVectorizer.vectorize(documents);

// Builder pattern usage example
Builder vectorizerBuilder = new Builder()
    // Configure vectorization parameters
    // .vocabulary(customVocabulary)
    // .minWordFrequency(2)
    // .maxFeatures(10000)
    ;

// TextVectorizer vectorizer = vectorizerBuilder.build();

Integration with Neural Models

Bag of words and TF-IDF vectorizers can be used as:

Feature extraction: Converting text to numerical features for traditional ML algorithms
Preprocessing: Initial text representation before neural processing
Baseline comparison: Comparing neural embeddings against classical methods
Hybrid approaches: Combining sparse and dense representations for improved performance
Cold start solutions: Handling new documents without neural model inference overhead

The sparse representations from these vectorizers complement the dense embeddings from Word2Vec, GloVe, and ParagraphVectors, providing multiple perspectives on text data for different use cases and performance requirements.

Install with Tessl CLI

npx tessl i tessl/maven-org-deeplearning4j--deeplearning4j-nlp

docs

bag-of-words.md

dataset-loading.md

document-embeddings.md