tessl/maven-org-deeplearning4j--deeplearning4j-nlp

DeepLearning4J Natural Language Processing module providing word embeddings, document classification, and text processing capabilities for neural network applications.

—

Pending

Overview

Eval results

Files

Document Embeddings (ParagraphVectors)

Name: tessl/maven-org-deeplearning4j--deeplearning4j-nlp
Author: tessl

Document-level embeddings implementation (Doc2Vec) that creates vector representations for entire documents, sentences, or paragraphs. Enables document similarity comparison, classification, clustering, and information retrieval tasks with neural embeddings.

Capabilities

ParagraphVectors Model

Main ParagraphVectors implementation extending Word2Vec with document-level representation learning and inference capabilities.

/**
 * ParagraphVectors (Doc2Vec) implementation extending Word2Vec
 * Provides document-level embeddings and classification capabilities
 */
public class ParagraphVectors extends Word2Vec {
    
    /**
     * Predict label for raw text (deprecated - use predict with document types)
     * @param rawText Raw text string to classify
     * @return Most probable label string
     */
    @Deprecated
    public String predict(String rawText);
    
    /**
     * Predict label for labeled document
     * @param document LabelledDocument instance to classify
     * @return Most probable label string
     */
    public String predict(LabelledDocument document);
    
    /**
     * Predict label for list of vocabulary words
     * @param document List of VocabWord instances
     * @return Most probable label string
     */
    public String predict(List<VocabWord> document);
    
    /**
     * Predict multiple labels for labeled document
     * @param document LabelledDocument to classify
     * @param limit Maximum number of labels to return
     * @return Collection of probable labels in descending order
     */
    public Collection<String> predictSeveral(LabelledDocument document, int limit);
    
    /**
     * Predict multiple labels for raw text
     * @param rawText Raw text string to classify
     * @param limit Maximum number of labels to return
     * @return Collection of probable labels in descending order
     */
    public Collection<String> predictSeveral(String rawText, int limit);
    
    /**
     * Predict multiple labels for word list
     * @param document List of VocabWord instances
     * @param limit Maximum number of labels to return
     * @return Collection of probable labels in descending order
     */
    public Collection<String> predictSeveral(List<VocabWord> document, int limit);
    
    /**
     * Calculate inferred vector for text with custom training parameters
     * @param text Raw text string to vectorize
     * @param learningRate Learning rate for inference training
     * @param minLearningRate Minimum learning rate threshold
     * @param iterations Number of inference iterations
     * @return INDArray vector representation of the text
     */
    public INDArray inferVector(String text, double learningRate, double minLearningRate, int iterations);
    
    /**
     * Calculate inferred vector for document with custom parameters
     * @param document LabelledDocument to vectorize
     * @param learningRate Learning rate for inference training
     * @param minLearningRate Minimum learning rate threshold
     * @param iterations Number of inference iterations
     * @return INDArray vector representation of the document
     */
    public INDArray inferVector(LabelledDocument document, double learningRate, double minLearningRate, int iterations);
    
    /**
     * Calculate inferred vector for word list with custom parameters
     * @param document List of VocabWord instances to vectorize
     * @param learningRate Learning rate for inference training
     * @param minLearningRate Minimum learning rate threshold
     * @param iterations Number of inference iterations
     * @return INDArray vector representation of the word list
     */
    public INDArray inferVector(List<VocabWord> document, double learningRate, double minLearningRate, int iterations);
    
    /**
     * Calculate inferred vector for text with default parameters
     * @param text Raw text string to vectorize
     * @return INDArray vector representation using default parameters
     */
    public INDArray inferVector(String text);
    
    /**
     * Calculate inferred vector for document with default parameters
     * @param document LabelledDocument to vectorize
     * @return INDArray vector representation using default parameters
     */
    public INDArray inferVector(LabelledDocument document);
    
    /**
     * Calculate inferred vector for word list with default parameters
     * @param document List of VocabWord instances to vectorize
     * @return INDArray vector representation using default parameters
     */
    public INDArray inferVector(List<VocabWord> document);
    
    /**
     * Batched inference for labeled document returning Future with ID and vector
     * @param document LabelledDocument with ID field defined
     * @return Future containing Pair of document ID and inferred vector
     */
    public Future<Pair<String, INDArray>> inferVectorBatched(LabelledDocument document);
    
    /**
     * Batched inference for text string returning Future with vector
     * @param document Raw text string to vectorize
     * @return Future containing inferred vector
     */
    public Future<INDArray> inferVectorBatched(String document);
    
    /**
     * Batched inference for multiple text strings
     * @param documents List of text strings to vectorize
     * @return List of INDArray vectors in same order as input
     */
    public List<INDArray> inferVectorBatched(List<String> documents);
    
    /**
     * Find top N labels nearest to labeled document
     * @param document LabelledDocument to compare
     * @param topN Number of nearest labels to return
     * @return Collection of nearest label strings
     */
    public Collection<String> nearestLabels(LabelledDocument document, int topN);
    
    /**
     * Find top N labels nearest to raw text
     * @param rawText Raw text string to compare
     * @param topN Number of nearest labels to return
     * @return Collection of nearest label strings
     */
    public Collection<String> nearestLabels(String rawText, int topN);
    
    /**
     * Find top N labels nearest to vocabulary word collection
     * @param document Collection of VocabWord instances
     * @param topN Number of nearest labels to return
     * @return Collection of nearest label strings
     */
    public Collection<String> nearestLabels(Collection<VocabWord> document, int topN);
    
    /**
     * Find top N labels nearest to feature vector
     * @param labelVector INDArray feature vector
     * @param topN Number of nearest labels to return
     * @return Collection of nearest label strings
     */
    public Collection<String> nearestLabels(INDArray labelVector, int topN);
    
    /**
     * Calculate similarity between document and specific label
     * @param document LabelledDocument to compare
     * @param label Target label string
     * @return Similarity score between document and label
     */
    public double similarityToLabel(LabelledDocument document, String label);
    
    /**
     * Calculate similarity between word list and specific label
     * @param document List of VocabWord instances
     * @param label Target label string
     * @return Similarity score between document and label
     */
    public double similarityToLabel(List<VocabWord> document, String label);
    
    /**
     * Calculate similarity between raw text and specific label (deprecated)
     * @param rawText Raw text string
     * @param label Target label string  
     * @return Similarity score between text and label
     */
    @Deprecated
    public double similarityToLabel(String rawText, String label);
    
    /**
     * Extract label vectors from vocabulary for nearest neighbor operations
     * Populates internal labels matrix for efficient similarity calculations
     */
    public void extractLabels();
    
    /**
     * Set sequence iterator for pre-tokenized training data
     * @param iterator SequenceIterator providing tokenized sequences
     */
    public void setSequenceIterator(SequenceIterator<VocabWord> iterator);
}

ParagraphVectors Builder

Extended builder for ParagraphVectors with document-specific configuration options and label handling.

/**
 * Builder for ParagraphVectors configuration extending Word2Vec.Builder
 */
public static class ParagraphVectors.Builder extends Word2Vec.Builder {
    
    /**
     * Build configured ParagraphVectors instance
     * @return Configured ParagraphVectors model ready for training
     */
    public ParagraphVectors build();
    
    /**
     * Use pre-built WordVectors model for ParagraphVectors initialization
     * @param vec Existing WordVectors model (Word2Vec or GloVe)
     * @return Builder instance for method chaining
     */
    public Builder useExistingWordVectors(WordVectors vec);
    
    /**
     * Define whether word representations should be trained with documents
     * @param trainElements Whether to train word vectors alongside document vectors
     * @return Builder instance for method chaining
     */
    public Builder trainWordVectors(boolean trainElements);
    
    /**
     * Attach pre-defined labels source to ParagraphVectors
     * @param source LabelsSource instance containing available labels
     * @return Builder instance for method chaining  
     */
    public Builder labelsSource(LabelsSource source);
    
    /**
     * Build LabelSource from labels list (deprecated due to order synchronization issues)
     * @param labels List of label strings
     * @return Builder instance for method chaining
     */
    @Deprecated
    public Builder labels(List<String> labels);
    
    /**
     * Set label-aware document iterator for training
     * @param iterator LabelAwareDocumentIterator with labeled documents
     * @return Builder instance for method chaining
     */
    public Builder iterate(LabelAwareDocumentIterator iterator);
    
    /**
     * Set label-aware sentence iterator for training  
     * @param iterator LabelAwareSentenceIterator with labeled sentences
     * @return Builder instance for method chaining
     */
    public Builder iterate(LabelAwareSentenceIterator iterator);
    
    /**
     * Set general label-aware iterator for training
     * @param iterator LabelAwareIterator providing labeled training data
     * @return Builder instance for method chaining
     */
    public Builder iterate(LabelAwareIterator iterator);
    
    /**
     * Set document iterator for training (unlabeled documents)
     * @param iterator DocumentIterator providing training documents
     * @return Builder instance for method chaining
     */
    public Builder iterate(DocumentIterator iterator);
    
    /**
     * Set sentence iterator for training (unlabeled sentences)
     * @param iterator SentenceIterator providing training sentences
     * @return Builder instance for method chaining
     */
    public Builder iterate(SentenceIterator iterator);
    
    // Inherits all Word2Vec.Builder methods with appropriate return types
}

Usage Examples:

import org.deeplearning4j.models.paragraphvectors.ParagraphVectors;
import org.deeplearning4j.text.documentiterator.*;
import org.deeplearning4j.text.tokenization.tokenizerfactory.DefaultTokenizerFactory;

// Basic document classification training
Collection<LabelledDocument> labeledDocs = Arrays.asList(
    new LabelledDocument("This is a positive review", "positive"),
    new LabelledDocument("This is a negative review", "negative"),
    new LabelledDocument("Great product, highly recommend", "positive")
);

LabelAwareIterator iterator = new BasicLabelAwareIterator(labeledDocs);

ParagraphVectors paragraphVectors = new ParagraphVectors.Builder()
    .minWordFrequency(1)
    .iterations(5)
    .epochs(10)
    .layerSize(100)
    .learningRate(0.025)
    .windowSize(5)
    .iterate(iterator)
    .tokenizerFactory(new DefaultTokenizerFactory())
    .trainWordVectors(true)
    .build();

paragraphVectors.fit();

// Document inference and classification
String newDocument = "This product is amazing";
INDArray docVector = paragraphVectors.inferVector(newDocument);
String predictedLabel = paragraphVectors.predict(newDocument);
Collection<String> topLabels = paragraphVectors.predictSeveral(newDocument, 3);

System.out.println("Predicted label: " + predictedLabel);
System.out.println("Top labels: " + topLabels);

// Document similarity using inferred vectors
String doc1 = "Great product quality";
String doc2 = "Excellent item, very satisfied";

INDArray vec1 = paragraphVectors.inferVector(doc1);
INDArray vec2 = paragraphVectors.inferVector(doc2);

// Calculate cosine similarity
double similarity = Transforms.cosineSim(vec1, vec2);
System.out.println("Document similarity: " + similarity);

// Batch inference for multiple documents
List<String> documents = Arrays.asList(
    "First document text",
    "Second document text", 
    "Third document text"
);

List<INDArray> vectors = paragraphVectors.inferVectorBatched(documents);
System.out.println("Processed " + vectors.size() + " documents");

// Find nearest labels to a document
Collection<String> nearestLabels = paragraphVectors.nearestLabels(newDocument, 5);
System.out.println("Nearest labels: " + nearestLabels);

// Advanced configuration with existing word vectors
Word2Vec existingWord2Vec = new Word2Vec.Builder()
    .layerSize(300)
    .windowSize(10)
    // ... other configuration
    .build();
existingWord2Vec.fit(); // Train on large corpus

ParagraphVectors advancedPV = new ParagraphVectors.Builder()
    .useExistingWordVectors(existingWord2Vec)
    .trainWordVectors(false) // Don't retrain word vectors
    .layerSize(300)
    .iterate(labeledDocumentIterator)
    .tokenizerFactory(new DefaultTokenizerFactory())
    .build();

advancedPV.fit();

Document Types

Supporting classes for labeled document handling and training data preparation.

/**
 * Document with label information for supervised training
 */
public class LabelledDocument {
    
    /**
     * Get document content as string
     * @return Document text content
     */
    public String getContent();
    
    /**
     * Get document identifier
     * @return String identifier for the document
     */
    public String getId();
    
    /**
     * Get document labels
     * @return List of label strings associated with document
     */
    public List<String> getLabels();
    
    /**
     * Get referenced content as vocabulary words
     * @return List of VocabWord instances from document
     */
    public List<VocabWord> getReferencedContent();
}

/**
 * Source of labels for document classification
 */
public class LabelsSource {
    
    /**
     * Create empty labels source
     */
    public LabelsSource();
    
    /**
     * Create labels source with predefined labels
     * @param labels List of available label strings
     */
    public LabelsSource(List<String> labels);
    
    /**
     * Get available labels
     * @return List of label strings
     */
    public List<String> getLabels();
}

Install with Tessl CLI

npx tessl i tessl/maven-org-deeplearning4j--deeplearning4j-nlp

docs

bag-of-words.md

dataset-loading.md

document-embeddings.md