DeepLearning4J Natural Language Processing module providing word embeddings, document classification, and text processing capabilities for neural network applications.
—
Traditional text vectorization methods including TF-IDF and bag-of-words representations for document classification, information retrieval, and feature extraction tasks. Provides sparse vector representations that complement dense neural embeddings.
Base interface for text vectorization implementations providing consistent API across different vectorization strategies.
/**
* Text vectorization interface for converting text to numerical representations
*/
public interface TextVectorizer {
// Core vectorization interface - implementations provide specific vectorization methods
}Classic bag-of-words vectorization creating sparse representations based on word frequency counts.
/**
* Bag of Words vectorization implementation
* Creates sparse vector representations based on word frequency counts
*/
public class BagOfWordsVectorizer implements TextVectorizer {
// Bag of words implementation with configurable vocabulary and normalization
}Term Frequency-Inverse Document Frequency vectorization for weighted sparse representations emphasizing discriminative terms.
/**
* TF-IDF vectorization implementation
* Creates weighted sparse vectors emphasizing discriminative terms
*/
public class TfidfVectorizer implements TextVectorizer {
// TF-IDF implementation with configurable term weighting schemes
}Abstract base implementation providing common functionality for text vectorization implementations.
/**
* Abstract base class for text vectorizers
* Provides common functionality and configuration patterns
*/
public class BaseTextVectorizer implements TextVectorizer {
// Common vectorization infrastructure and utilities
}Builder pattern for configuring text vectorizers with various parameters and data sources.
/**
* Builder for text vectorizer configuration
* Supports various vectorization strategies and parameters
*/
public class Builder {
// Configurable builder for text vectorization components
}Utility for creating input streams from various text sources for vectorization processing.
/**
* Default input stream creator for text sources
* Handles various input formats and encodings
*/
public class DefaultInputStreamCreator {
// Input stream creation utilities for text processing
}Usage Examples:
import org.deeplearning4j.bagofwords.vectorizer.*;
// Basic bag of words vectorization
Collection<String> documents = Arrays.asList(
"The quick brown fox jumps over the lazy dog",
"A fast brown animal leaps over the sleeping canine",
"Natural language processing with machine learning"
);
// Configure bag of words vectorizer
BagOfWordsVectorizer bowVectorizer = new BagOfWordsVectorizer();
// Additional configuration would be done here based on actual API
// TF-IDF vectorization for document similarity
TfidfVectorizer tfidfVectorizer = new TfidfVectorizer();
// Configure TF-IDF parameters based on actual implementation
// Example usage pattern (actual API may vary):
// INDArray vectors = bowVectorizer.vectorize(documents);
// INDArray tfidfVectors = tfidfVectorizer.vectorize(documents);
// Builder pattern usage example
Builder vectorizerBuilder = new Builder()
// Configure vectorization parameters
// .vocabulary(customVocabulary)
// .minWordFrequency(2)
// .maxFeatures(10000)
;
// TextVectorizer vectorizer = vectorizerBuilder.build();Bag of words and TF-IDF vectorizers can be used as:
The sparse representations from these vectorizers complement the dense embeddings from Word2Vec, GloVe, and ParagraphVectors, providing multiple perspectives on text data for different use cases and performance requirements.
Install with Tessl CLI
npx tessl i tessl/maven-org-deeplearning4j--deeplearning4j-nlp