tessl/maven-dev-langchain4j--langchain4j-embeddings-all-minilm-l6-v2-q

In-process all-minilm-l6-v2 (quantized) embedding model

Overview

Eval results

Files

LangChain4j All-MiniLM-L6-v2 Quantized Embedding Model

Name: tessl/maven-dev-langchain4j--langchain4j-embeddings-all-minilm-l6-v2-q
Author: tessl

A quantized version of the SentenceTransformers all-MiniLM-L6-v2 embedding model that runs directly within Java applications without requiring external services. This package generates 384-dimensional embeddings for text using ONNX Runtime, with the quantized model providing efficient in-process execution suitable for semantic search, similarity matching, RAG (Retrieval-Augmented Generation) applications, and other NLP tasks.

Package Information

Package Name: langchain4j-embeddings-all-minilm-l6-v2-q
Group ID: dev.langchain4j
Artifact ID: langchain4j-embeddings-all-minilm-l6-v2-q
Package Type: Maven
Language: Java
Minimum Java Version: Java 8
Installation: Add the following dependency to your pom.xml:

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-embeddings-all-minilm-l6-v2-q</artifactId>
    <version>1.11.0</version>
</dependency>

Or for Gradle:

implementation 'dev.langchain4j:langchain4j-embeddings-all-minilm-l6-v2-q:1.11.0'

Core Imports

import dev.langchain4j.model.embedding.onnx.allminilml6v2q.AllMiniLmL6V2QuantizedEmbeddingModel;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.output.Response;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.model.output.TokenUsage;
import dev.langchain4j.model.output.FinishReason;

Basic Usage

import dev.langchain4j.model.embedding.onnx.allminilml6v2q.AllMiniLmL6V2QuantizedEmbeddingModel;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.output.Response;

// Create the embedding model with default settings
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();

// Embed a single text string
Response<Embedding> response = model.embed("Hello, world!");
Embedding embedding = response.content();

// Access the vector
float[] vector = embedding.vector();
int dimension = embedding.dimension(); // Returns 384

// Get embedding dimension without generating embeddings
int dim = model.dimension(); // Returns 384

Model Characteristics

Embedding Dimensions: 384
Maximum Recommended Token Length: 256 tokens (unlimited supported but quality degrades beyond this)
Pooling Mode: MEAN pooling
Model Type: Quantized ONNX model (smaller size, slightly reduced accuracy vs. non-quantized)
Thread Safety: Thread-safe and supports concurrent embedding operations
Normalization: All embeddings are normalized (magnitude ≈ 1.0)
Model Source: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Special Tokens: Uses [CLS] and [SEP] tokens internally (excluded from token counts)
Model Files:
- Model: all-minilm-l6-v2-q.onnx (loaded from classpath)
- Tokenizer: all-minilm-l6-v2-q-tokenizer.json (loaded from classpath)

Dependencies

This package has the following key dependencies that are automatically included:

ONNX Runtime Java: For model inference
Tokenizers: For text tokenization
LangChain4j Core: For core types and interfaces

Note: This package bundles the ONNX model files within the JAR. No additional model downloads are required at runtime.

Potential Conflicts:

If using multiple ONNX-based embedding models, ensure they use compatible ONNX Runtime versions
Memory usage scales with concurrent model instances (each model loads its own copy)

Capabilities

Model Instantiation

Create embedding model instances with default or custom executor settings.

// Default constructor - uses cached thread pool with threads = available processors
public AllMiniLmL6V2QuantizedEmbeddingModel()

// Constructor with custom executor for parallel processing control
public AllMiniLmL6V2QuantizedEmbeddingModel(java.util.concurrent.Executor executor)

Parameters:

executor (Executor): Custom executor for parallelizing the embedding process. Must not be null.

Throws:

NullPointerException: If executor is null (when using second constructor)

Default Executor Behavior:

Thread pool size: Number of available processors
Thread caching: Threads are cached for 1 second
Thread pool type: Cached thread pool with core thread timeout enabled

Usage Examples:

Default instantiation:

EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();

Custom executor for controlled parallelization:

import java.util.concurrent.Executors;
import java.util.concurrent.Executor;

Executor customExecutor = Executors.newFixedThreadPool(4);
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(customExecutor);

Null Handling:

Passing null executor throws NullPointerException
Always validate executor before passing to constructor

Resource Management:

The model instance shares a static ONNX model and tokenizer loaded once at class initialization
Creating multiple instances is safe but each will use the configured executor
Custom executors should be managed and shutdown by the caller

Single Text Embedding

Embed a single text string or TextSegment to generate a 384-dimensional vector representation.

// Embed a plain string
Response<Embedding> embed(String text)

// Embed a TextSegment (text with metadata)
Response<Embedding> embed(TextSegment textSegment)

Parameters:

text (String): The text to embed. Can be null, empty, or any length.
textSegment (TextSegment): A text segment containing text and optional metadata. Can contain null text.

Returns: Response<Embedding> containing:

content(): The generated Embedding (never null)
tokenUsage(): Token usage statistics (input tokens only, excludes special tokens [CLS] and [SEP])
finishReason(): Always null for embedding models
metadata(): Empty map for this model

Null Handling:

null text: Treated as empty string, produces valid embedding
null TextSegment: May throw NullPointerException
Empty string: Produces valid embedding for empty input

Edge Cases:

Empty text: Returns valid 384-dimensional embedding
Very long text (>510 tokens): Automatically split and averaged
Special characters: Handled by tokenizer, no preprocessing needed
Unicode: Fully supported including emoji and non-Latin scripts
Whitespace-only text: Produces valid embedding

Performance:

Typical single text embedding: 10-50ms depending on text length and hardware
Memory per embedding: ~1.5KB (float array)

Usage Examples:

Embedding a string:

Response<Embedding> response = model.embed("The quick brown fox jumps over the lazy dog");
Embedding embedding = response.content();
float[] vector = embedding.vector(); // 384-dimensional float array

Embedding a TextSegment:

import dev.langchain4j.data.segment.TextSegment;

TextSegment segment = TextSegment.from("Machine learning is fascinating");
Response<Embedding> response = model.embed(segment);
Embedding embedding = response.content();

Handling null or empty text:

// Empty text
Response<Embedding> response1 = model.embed("");
Embedding emb1 = response1.content(); // Valid embedding

// Null text treated as empty
Response<Embedding> response2 = model.embed((String) null);
Embedding emb2 = response2.content(); // Valid embedding

Batch Text Embedding

Embed multiple text segments in a single call with automatic parallel processing for efficiency.

// Embed multiple text segments
Response<java.util.List<Embedding>> embedAll(java.util.List<TextSegment> textSegments)

Parameters:

textSegments (List<TextSegment>): List of text segments to embed. Must not be null or empty.

Returns: Response<List<Embedding>> containing:

content(): List of Embedding objects, one per input segment in the same order (never null)
tokenUsage(): Aggregated token usage across all segments (input tokens only, excludes special tokens)
finishReason(): Always null for embedding models
metadata(): Empty map for this model

Throws:

IllegalArgumentException: If textSegments is null or empty
NullPointerException: If textSegments list contains null elements

Behavior:

Single segment: Processed in the same thread (no parallelization overhead)
Multiple segments: Processed in parallel using the configured Executor
Token count excludes special tokens [CLS] and [SEP]
Order preservation: Output embeddings match input order exactly

Null Handling:

null list: Throws exception
Empty list: Throws exception
null elements in list: Throws exception
TextSegment with null text: Treated as empty string

Edge Cases:

Single segment: No parallelization, direct processing
Very large batch (1000+ segments): May cause memory pressure; consider batching
Mixed text lengths: Short and long texts can be mixed; parallelization handles this efficiently
Duplicate texts: Each is embedded independently (no deduplication)

Performance:

Parallelization benefit increases with batch size and text length
Optimal batch size: 10-100 segments depending on text length
Memory usage: ~1.5KB per embedding + temporary processing buffers

Usage Examples:

import dev.langchain4j.data.segment.TextSegment;
import java.util.List;
import java.util.Arrays;

List<TextSegment> segments = Arrays.asList(
    TextSegment.from("First document about artificial intelligence"),
    TextSegment.from("Second document about machine learning"),
    TextSegment.from("Third document about deep learning")
);

Response<List<Embedding>> response = model.embedAll(segments);
List<Embedding> embeddings = response.content(); // 3 embeddings

// Access individual embeddings
Embedding firstEmbedding = embeddings.get(0);
Embedding secondEmbedding = embeddings.get(1);

// Check token usage
Integer inputTokens = response.tokenUsage().inputTokenCount();

Handling errors:

try {
    List<TextSegment> segments = Arrays.asList(/* ... */);
    Response<List<Embedding>> response = model.embedAll(segments);
    // Process embeddings
} catch (IllegalArgumentException e) {
    // Handle null or empty list
    System.err.println("Invalid input: " + e.getMessage());
}

Large batch processing with memory management:

import java.util.List;
import java.util.ArrayList;

List<TextSegment> allSegments = /* large list */;
int batchSize = 50;
List<Embedding> allEmbeddings = new ArrayList<>();

// Process in batches to manage memory
for (int i = 0; i < allSegments.size(); i += batchSize) {
    int end = Math.min(i + batchSize, allSegments.size());
    List<TextSegment> batch = allSegments.subList(i, end);

    Response<List<Embedding>> response = model.embedAll(batch);
    allEmbeddings.addAll(response.content());
}

Embedding Dimension Query

Get the dimension of embeddings produced by this model without generating embeddings.

// Returns the embedding dimension
int dimension()

Returns: int - The embedding dimension (always 384 for this model)

Usage Example:

int dim = model.dimension(); // Returns 384

Use Cases:

Pre-allocate storage for embeddings
Validate compatibility with vector databases
Configure neural network input layers

Model Name

Get the name identifier of the underlying embedding model.

// Returns the model name
String modelName()

Returns: String - The model name or "unknown" if not specified by the implementation

Usage Example:

String name = model.modelName();

Note: The returned name is implementation-specific and may be "unknown" for this model.

Listener Support

Wrap the embedding model with listeners to observe and monitor embedding operations.

// Add a single listener
EmbeddingModel addListener(dev.langchain4j.model.embedding.listener.EmbeddingModelListener listener)

// Add multiple listeners
EmbeddingModel addListeners(java.util.List<dev.langchain4j.model.embedding.listener.EmbeddingModelListener> listeners)

Parameters:

listener (EmbeddingModelListener): A listener to observe embedding operations. If null, returns the model unchanged.
listeners (List<EmbeddingModelListener>): List of listeners to observe embedding operations. Called in iteration order. If null or empty, returns the model unchanged.

Returns: EmbeddingModel - An observing embedding model that dispatches events to the provided listener(s)

Null Handling:

null listener: Returns original model unchanged (no-op)
null or empty listeners list: Returns original model unchanged (no-op)
null elements in listeners list: Skipped during event dispatch

Annotation: @Experimental (since v1.11.0)

Behavior:

Listeners are called synchronously during embedding operations
Exceptions in listeners do not propagate to the embedding operation
Multiple listeners are called in the order they were added
Listeners can access and modify the attributes map for passing data between callbacks

Usage Example:

import dev.langchain4j.model.embedding.listener.EmbeddingModelListener;
import dev.langchain4j.model.embedding.listener.EmbeddingModelRequestContext;
import dev.langchain4j.model.embedding.listener.EmbeddingModelResponseContext;
import dev.langchain4j.model.embedding.listener.EmbeddingModelErrorContext;

// Add a listener to monitor embedding operations
EmbeddingModel observedModel = model.addListener(new EmbeddingModelListener() {
    @Override
    public void onRequest(EmbeddingModelRequestContext ctx) {
        System.out.println("Embedding " + ctx.textSegments().size() + " segments");
        // Store start time in attributes for performance tracking
        ctx.attributes().put("startTime", System.currentTimeMillis());
    }

    @Override
    public void onResponse(EmbeddingModelResponseContext ctx) {
        long startTime = (Long) ctx.attributes().get("startTime");
        long duration = System.currentTimeMillis() - startTime;
        System.out.println("Completed in " + duration + "ms");
    }

    @Override
    public void onError(EmbeddingModelErrorContext ctx) {
        System.err.println("Error: " + ctx.error().getMessage());
    }
});

// Use the observed model
Response<Embedding> response = observedModel.embed("test");

Factory Class

AllMiniLmL6V2QuantizedEmbeddingModelFactory

Factory class for creating model instances via the SPI (Service Provider Interface) mechanism.

public class AllMiniLmL6V2QuantizedEmbeddingModelFactory implements dev.langchain4j.spi.model.embedding.EmbeddingModelFactory

// Create a new model instance with default settings
public EmbeddingModel create()

Package: dev.langchain4j.model.embedding.onnx.allminilml6v2q

Returns: EmbeddingModel - A new AllMiniLmL6V2QuantizedEmbeddingModel instance with default settings (default executor)

Usage: Typically used by frameworks and service loaders rather than direct instantiation.

Example:

import dev.langchain4j.spi.model.embedding.EmbeddingModelFactory;
import java.util.ServiceLoader;

// Load via SPI
ServiceLoader<EmbeddingModelFactory> loader = ServiceLoader.load(EmbeddingModelFactory.class);
for (EmbeddingModelFactory factory : loader) {
    if (factory instanceof AllMiniLmL6V2QuantizedEmbeddingModelFactory) {
        EmbeddingModel model = factory.create();
        break;
    }
}

Core Types

FinishReason

Represents the reason why a model call finished.

public enum FinishReason {
    // The model call finished because the model decided the request was done
    STOP,

    // The call finished because the token length was reached
    LENGTH,

    // The call finished signalling a need for tool execution
    TOOL_EXECUTION,

    // The call finished signalling a need for content filtering
    CONTENT_FILTER,

    // The call finished for some other reason
    OTHER
}

Package: dev.langchain4j.model.output

Note: For embedding models, the finish reason is always null.

Embedding

Represents a dense vector embedding of text.

public class Embedding {
    // Constructor
    public Embedding(float[] vector)

    // Get the vector array
    public float[] vector()

    // Get vector as a list
    public java.util.List<Float> vectorAsList()

    // Get embedding dimension
    public int dimension()

    // Normalize the vector in-place
    public void normalize()

    // Factory methods
    public static Embedding from(float[] vector)
    public static Embedding from(java.util.List<Float> vector)
}

Package: dev.langchain4j.data.embedding

Key Methods:

vector(): Returns the raw float array representing the embedding. The returned array is the internal array (not a copy), so modifications will affect the embedding.
vectorAsList(): Returns a copy of the vector as a List<Float>. This is a defensive copy, so modifications won't affect the embedding.
dimension(): Returns the length of the vector (384 for this model)
normalize(): Normalizes the vector to unit length (magnitude = 1.0) in-place. This model already produces normalized vectors, so calling this is typically unnecessary.
from(float[] vector): Static factory method to create an Embedding from a float array. The array is stored directly (not copied).
from(List<Float> vector): Static factory method to create an Embedding from a list. The list is converted to a float array.

Null Handling:

Constructor with null vector: Throws NullPointerException
from() methods with null: Throw NullPointerException

Important Notes:

The vector array returned by vector() is mutable; avoid modifying it unless you intend to change the embedding
This model produces pre-normalized embeddings; calling normalize() is unnecessary
For immutable access, use vectorAsList() which returns a defensive copy

Metadata

Represents metadata associated with a Document or TextSegment as key-value pairs.

public class Metadata {
    // Constructors
    public Metadata()
    public Metadata(java.util.Map<String, ?> metadata)

    // Getter methods for typed access
    public String getString(String key)
    public java.util.UUID getUUID(String key)
    public Integer getInteger(String key)
    public Long getLong(String key)
    public Float getFloat(String key)
    public Double getDouble(String key)

    // Check for key existence
    public boolean containsKey(String key)

    // Add key-value pairs (fluent API)
    public Metadata put(String key, String value)
    public Metadata put(String key, java.util.UUID value)
    public Metadata put(String key, int value)
    public Metadata put(String key, long value)
    public Metadata put(String key, float value)
    public Metadata put(String key, double value)
    public Metadata putAll(java.util.Map<String, Object> metadata)

    // Remove a key
    public Metadata remove(String key)

    // Copy and convert
    public Metadata copy()
    public java.util.Map<String, Object> toMap()

    // Merge with another Metadata object
    public Metadata merge(Metadata another)

    // Factory methods
    public static Metadata from(String key, String value)
    public static Metadata from(java.util.Map<String, ?> metadata)
    public static Metadata metadata(String key, String value)
}

Package: dev.langchain4j.data.document

Supported Value Types: String, UUID, Integer, Long, Float, Double

Key Methods:

getString(String key), getInteger(String key), etc.: Returns typed values. Returns null if key not present or value cannot be cast to the requested type.
put(String key, T value): Adds key-value pair, returns this for chaining (fluent API)
containsKey(String key): Checks if key exists
toMap(): Returns copy as Map<String, Object>
merge(Metadata another): Merges two Metadata objects. Throws exception if keys overlap.

Null Handling:

null key in put/get methods: May throw NullPointerException (depends on internal map implementation)
null value in put methods: Stores null value
null Metadata in merge: Returns this unchanged
Getter methods return null for missing keys

Edge Cases:

Type mismatch in getters: Returns null if stored type doesn't match requested type
Duplicate keys in merge: Throws IllegalArgumentException
putAll with null map: May throw NullPointerException

Usage Examples:

// Create metadata with fluent API
Metadata meta = new Metadata()
    .put("source", "document.pdf")
    .put("page", 5)
    .put("score", 0.95);

// Type-safe retrieval
String source = meta.getString("source");
Integer page = meta.getInteger("page");
Float score = meta.getFloat("score");

// Null handling
Integer missing = meta.getInteger("nonexistent"); // Returns null

TextSegment

Represents a semantically meaningful segment of text with optional metadata.

public class TextSegment {
    // Constructor
    public TextSegment(String text, Metadata metadata)

    // Get the text content
    public String text()

    // Get the metadata
    public Metadata metadata()

    // Factory methods
    public static TextSegment from(String text)
    public static TextSegment from(String text, Metadata metadata)
    public static TextSegment textSegment(String text)
    public static TextSegment textSegment(String text, Metadata metadata)
}

Package: dev.langchain4j.data.segment

Key Methods:

text(): Returns the text content. May return null if TextSegment was created with null text.
metadata(): Returns the associated metadata. Never null; returns empty Metadata if none provided.
from(String text): Creates a TextSegment with empty metadata
from(String text, Metadata metadata): Creates a TextSegment with specified metadata
textSegment(String text): Alternative factory method (same as from(String text))
textSegment(String text, Metadata metadata): Alternative factory method (same as from(String text, Metadata metadata))

Null Handling:

null text: Accepted; stored as-is (embedding models typically treat as empty)
null metadata: Replaced with empty Metadata instance

Usage Examples:

// Simple text segment
TextSegment segment1 = TextSegment.from("This is a document");

// Text segment with metadata
Metadata meta = new Metadata().put("source", "doc1.txt");
TextSegment segment2 = TextSegment.from("Content here", meta);

// Accessing content
String text = segment2.text();
String source = segment2.metadata().getString("source");

Response<T>

Generic wrapper for model responses containing the generated content and metadata.

public class Response<T> {
    // Constructors
    public Response(T content)
    public Response(T content, TokenUsage tokenUsage, FinishReason finishReason)
    public Response(T content, TokenUsage tokenUsage, FinishReason finishReason, java.util.Map<String, Object> metadata)

    // Get the content
    public T content()

    // Get token usage statistics
    public TokenUsage tokenUsage()

    // Get finish reason
    public FinishReason finishReason()

    // Get response metadata
    public java.util.Map<String, Object> metadata()

    // Factory methods
    public static <T> Response<T> from(T content)
    public static <T> Response<T> from(T content, TokenUsage tokenUsage)
    public static <T> Response<T> from(T content, TokenUsage tokenUsage, FinishReason finishReason)
    public static <T> Response<T> from(T content, TokenUsage tokenUsage, FinishReason finishReason, java.util.Map<String, Object> metadata)
}

Package: dev.langchain4j.model.output

Type Parameter:

T: The type of content (Embedding or List<Embedding> for this model)

Key Methods:

content(): Returns the generated content (Embedding or List<Embedding>). Never null.
tokenUsage(): Returns token usage statistics. May be null if not provided.
finishReason(): Returns the finish reason. Always null for embedding models.
metadata(): Returns response metadata. Returns empty map if not provided (never null).

Null Handling:

null content: Stored as-is (may cause issues downstream)
null tokenUsage: Accepted and stored
null finishReason: Accepted and stored
null metadata map: Replaced with empty map

Usage Example:

Response<Embedding> response = model.embed("test");

Embedding emb = response.content(); // Never null
TokenUsage usage = response.tokenUsage(); // May be null
FinishReason reason = response.finishReason(); // Always null for embeddings
Map<String, Object> meta = response.metadata(); // Empty map for this model

TokenUsage

Represents token usage statistics for a model response.

public class TokenUsage {
    // Constructors
    public TokenUsage()
    public TokenUsage(Integer inputTokenCount)
    public TokenUsage(Integer inputTokenCount, Integer outputTokenCount)
    public TokenUsage(Integer inputTokenCount, Integer outputTokenCount, Integer totalTokenCount)

    // Get input token count
    public Integer inputTokenCount()

    // Get output token count (always null for embedding models)
    public Integer outputTokenCount()

    // Get total token count
    public Integer totalTokenCount()

    // Add two TokenUsage instances
    public TokenUsage add(TokenUsage that)

    // Static method to sum two TokenUsage instances
    public static TokenUsage sum(TokenUsage first, TokenUsage second)
}

Package: dev.langchain4j.model.output

Key Methods:

inputTokenCount(): Returns the number of input tokens. May be null. For this model, excludes special tokens [CLS] and [SEP].
outputTokenCount(): Returns the number of output tokens. Always null for embedding models.
totalTokenCount(): Returns the total token count. May be null. For this model, equals inputTokenCount when populated.
add(TokenUsage that): Adds the token usage of another TokenUsage instance to this one, returning a new TokenUsage with summed values. Returns this instance unchanged if that is null.
sum(TokenUsage first, TokenUsage second): Static method to add two TokenUsage instances. Returns the non-null instance if one is null, or a new TokenUsage with summed values if both are non-null. Returns null if both are null.

Null Handling:

Constructor with null values: Accepted and stored
add(null): Returns this unchanged
sum(null, null): Returns null
All getter methods may return null

Note: For embedding models, only inputTokenCount is populated, representing the number of tokens in the input text (excluding special tokens).

Usage Example:

Response<Embedding> response = model.embed("test text");
TokenUsage usage = response.tokenUsage();

if (usage != null) {
    Integer inputTokens = usage.inputTokenCount(); // May be null
    Integer totalTokens = usage.totalTokenCount(); // Equals inputTokens
}

// Summing token usage from multiple responses
TokenUsage total = usage1.add(usage2).add(usage3);

EmbeddingModelRequestContext

Context object containing the input text segments and attributes for embedding model requests.

public class EmbeddingModelRequestContext {
    // Get the input text segments to be embedded
    public java.util.List<TextSegment> textSegments()

    // Get the embedding model instance
    public EmbeddingModel embeddingModel()

    // Get the attributes map for passing data between listeners
    public java.util.Map<Object, Object> attributes()

    // Builder pattern for constructing instances
    public static Builder builder()

    // Inner Builder class
    public static class Builder {
        public Builder textSegments(java.util.List<TextSegment> textSegments)
        public Builder embeddingModel(EmbeddingModel embeddingModel)
        public Builder attributes(java.util.Map<Object, Object> attributes)
        public EmbeddingModelRequestContext build()
    }
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

textSegments(): Returns the list of input text segments to be embedded. Never null.
embeddingModel(): Returns the embedding model that will process the request. Never null.
attributes(): Returns a mutable map for passing data between listener methods (e.g., for logging context, timing information). Never null; modifications are visible to subsequent callbacks.
builder(): Static factory method to create a new Builder instance for constructing the context

Usage: This context is passed to EmbeddingModelListener.onRequest() before the embedding operation begins. Listeners can use the attributes map to store request-specific data that will be available in subsequent response or error callbacks.

Example:

@Override
public void onRequest(EmbeddingModelRequestContext ctx) {
    // Store timing information
    ctx.attributes().put("startTime", System.currentTimeMillis());

    // Log request details
    int segmentCount = ctx.textSegments().size();
    String modelName = ctx.embeddingModel().modelName();
    System.out.println("Embedding " + segmentCount + " segments with " + modelName);
}

EmbeddingModelResponseContext

Context object containing the embedding response, input text segments, and attributes for successful embedding operations.

public class EmbeddingModelResponseContext {
    // Get the embedding response containing the list of embeddings
    public Response<java.util.List<Embedding>> response()

    // Get the input text segments that were embedded
    public java.util.List<TextSegment> textSegments()

    // Get the embedding model instance
    public EmbeddingModel embeddingModel()

    // Get the attributes map for passing data between listeners
    public java.util.Map<Object, Object> attributes()

    // Builder pattern for constructing instances
    public static Builder builder()

    // Inner Builder class
    public static class Builder {
        public Builder response(Response<java.util.List<Embedding>> response)
        public Builder textSegments(java.util.List<TextSegment> textSegments)
        public Builder embeddingModel(EmbeddingModel embeddingModel)
        public Builder attributes(java.util.Map<Object, Object> attributes)
        public EmbeddingModelResponseContext build()
    }
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

response(): Returns the Response object containing the list of generated embeddings and metadata (token usage, etc.). Never null.
textSegments(): Returns the input text segments that were successfully embedded. Never null.
embeddingModel(): Returns the embedding model that processed the request. Never null.
attributes(): Returns the attributes map that was passed through from the request context. Never null; contains any data stored during onRequest().
builder(): Static factory method to create a new Builder instance for constructing the context

Usage: This context is passed to EmbeddingModelListener.onResponse() after a successful embedding operation. It provides access to both the request data and the resulting embeddings.

Example:

@Override
public void onResponse(EmbeddingModelResponseContext ctx) {
    // Retrieve timing information from request
    Long startTime = (Long) ctx.attributes().get("startTime");
    long duration = System.currentTimeMillis() - startTime;

    // Access response data
    List<Embedding> embeddings = ctx.response().content();
    TokenUsage usage = ctx.response().tokenUsage();

    System.out.println("Generated " + embeddings.size() + " embeddings in " + duration + "ms");
    System.out.println("Token usage: " + usage.inputTokenCount() + " tokens");
}

EmbeddingModelErrorContext

Context object containing the error, input text segments, and attributes when an embedding operation fails.

public class EmbeddingModelErrorContext {
    // Get the error that occurred during the embedding operation
    public Throwable error()

    // Get the input text segments that caused the error
    public java.util.List<TextSegment> textSegments()

    // Get the embedding model instance
    public EmbeddingModel embeddingModel()

    // Get the attributes map for passing data between listeners
    public java.util.Map<Object, Object> attributes()

    // Builder pattern for constructing instances
    public static Builder builder()

    // Inner Builder class
    public static class Builder {
        public Builder error(Throwable error)
        public Builder textSegments(java.util.List<TextSegment> textSegments)
        public Builder embeddingModel(EmbeddingModel embeddingModel)
        public Builder attributes(java.util.Map<Object, Object> attributes)
        public EmbeddingModelErrorContext build()
    }
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

error(): Returns the Throwable (exception or error) that occurred during the embedding operation. Never null.
textSegments(): Returns the input text segments that caused the error. Never null.
embeddingModel(): Returns the embedding model that encountered the error. Never null.
attributes(): Returns the attributes map that was passed through from the request context. Never null; contains any data stored during onRequest().
builder(): Static factory method to create a new Builder instance for constructing the context

Usage: This context is passed to EmbeddingModelListener.onError() when an embedding operation fails. It provides access to both the request data and the error details for logging, monitoring, or recovery purposes.

Example:

@Override
public void onError(EmbeddingModelErrorContext ctx) {
    // Log error details
    Throwable error = ctx.error();
    int segmentCount = ctx.textSegments().size();

    System.err.println("Error embedding " + segmentCount + " segments: " + error.getMessage());
    error.printStackTrace();

    // Could implement retry logic, fallback, or alerting here
}

EmbeddingModelListener

Interface for listening to embedding model requests, responses, and errors.

public interface EmbeddingModelListener {
    // Called before the request is executed against the embedding model
    default void onRequest(EmbeddingModelRequestContext requestContext) {}

    // Called after a successful embedding operation completes
    default void onResponse(EmbeddingModelResponseContext responseContext) {}

    // Called when an error occurs during interaction with the embedding model
    default void onError(EmbeddingModelErrorContext errorContext) {}
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

onRequest(EmbeddingModelRequestContext requestContext): Called before embedding execution. The request context contains input and attributes for passing data between listeners.
onResponse(EmbeddingModelResponseContext responseContext): Called after successful embedding. The response context contains the response, corresponding request, and attributes.
onError(EmbeddingModelErrorContext errorContext): Called when an error occurs. The error context contains the error, corresponding request, and attributes.

Important Characteristics:

All methods have default implementations (no-op), so you only need to override the methods you need
Listeners are called synchronously during the embedding operation
Exceptions thrown in listener methods are caught and logged but do not propagate to the caller
The attributes map can be used to pass data between onRequest, onResponse, and onError
Multiple listeners are called in the order they were added

Thread Safety: Listener methods may be called from multiple threads concurrently if the model is used concurrently. Implementations should be thread-safe or synchronized as needed.

Usage Example:

public class LoggingListener implements EmbeddingModelListener {
    @Override
    public void onRequest(EmbeddingModelRequestContext ctx) {
        ctx.attributes().put("startTime", System.nanoTime());
        System.out.println("[REQUEST] Embedding " + ctx.textSegments().size() + " segments");
    }

    @Override
    public void onResponse(EmbeddingModelResponseContext ctx) {
        long duration = System.nanoTime() - (Long) ctx.attributes().get("startTime");
        System.out.println("[RESPONSE] Completed in " + (duration / 1_000_000) + "ms");
    }

    @Override
    public void onError(EmbeddingModelErrorContext ctx) {
        System.err.println("[ERROR] Failed: " + ctx.error().getMessage());
    }
}

// Usage
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();
EmbeddingModel observedModel = model.addListener(new LoggingListener());

Error Handling

Common Exceptions

This section documents exceptions that may be thrown during embedding operations.

NullPointerException

When Thrown:

Passing null executor to constructor
Passing null to embedAll() (null list)
Creating Embedding with null vector
Various internal operations with null values

Example:

try {
    EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(null);
} catch (NullPointerException e) {
    System.err.println("Executor cannot be null");
}

IllegalArgumentException

When Thrown:

Empty list passed to embedAll()
Merging Metadata with duplicate keys
Invalid configuration values

Example:

try {
    Response<List<Embedding>> response = model.embedAll(Collections.emptyList());
} catch (IllegalArgumentException e) {
    System.err.println("Cannot embed empty list: " + e.getMessage());
}

OutOfMemoryError

When Thrown:

Embedding very large batches (thousands of segments)
Very long text inputs that create large token sequences
Concurrent embedding operations exhausting heap

Prevention:

// Batch processing to prevent OOM
int batchSize = 100;
for (int i = 0; i < allSegments.size(); i += batchSize) {
    int end = Math.min(i + batchSize, allSegments.size());
    List<TextSegment> batch = allSegments.subList(i, end);

    try {
        Response<List<Embedding>> response = model.embedAll(batch);
        // Process batch
    } catch (OutOfMemoryError e) {
        // Reduce batch size and retry
        System.err.println("OOM error, reducing batch size");
        break;
    }
}

Model Loading Exceptions

When Thrown:

Model files missing from classpath (corrupted JAR)
ONNX Runtime initialization failures
Incompatible ONNX Runtime version

Note: These exceptions typically occur during class initialization and cannot be caught in normal operation.

Example:

try {
    EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();
} catch (ExceptionInInitializerError e) {
    System.err.println("Failed to initialize model: " + e.getCause().getMessage());
    // This indicates a serious environment problem (missing dependencies, etc.)
}

Error Handling Patterns

Graceful Degradation

public List<Embedding> embedWithFallback(List<TextSegment> segments) {
    try {
        Response<List<Embedding>> response = model.embedAll(segments);
        return response.content();
    } catch (OutOfMemoryError e) {
        // Fall back to sequential processing
        List<Embedding> embeddings = new ArrayList<>();
        for (TextSegment segment : segments) {
            Response<Embedding> response = model.embed(segment);
            embeddings.add(response.content());
        }
        return embeddings;
    } catch (Exception e) {
        System.err.println("Embedding failed: " + e.getMessage());
        // Return empty list or throw custom exception
        return Collections.emptyList();
    }
}

Retry Logic

public Response<Embedding> embedWithRetry(String text, int maxRetries) {
    int attempts = 0;
    Exception lastException = null;

    while (attempts < maxRetries) {
        try {
            return model.embed(text);
        } catch (Exception e) {
            lastException = e;
            attempts++;
            if (attempts < maxRetries) {
                try {
                    Thread.sleep(1000 * attempts); // Exponential backoff
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("Interrupted during retry", ie);
                }
            }
        }
    }

    throw new RuntimeException("Failed after " + maxRetries + " attempts", lastException);
}

Listener-Based Error Handling

public class RetryListener implements EmbeddingModelListener {
    private final int maxRetries;
    private final Map<Object, Integer> attemptCounts = new ConcurrentHashMap<>();

    public RetryListener(int maxRetries) {
        this.maxRetries = maxRetries;
    }

    @Override
    public void onError(EmbeddingModelErrorContext ctx) {
        Object requestId = ctx.attributes().get("requestId");
        int attempts = attemptCounts.getOrDefault(requestId, 0) + 1;

        if (attempts < maxRetries) {
            attemptCounts.put(requestId, attempts);
            System.out.println("Retrying (attempt " + attempts + ")");
            // Trigger retry (would need custom retry logic)
        } else {
            System.err.println("Failed after " + maxRetries + " attempts");
            attemptCounts.remove(requestId);
        }
    }
}

Troubleshooting

Common Issues and Solutions

Issue: Model initialization fails with ClassNotFoundException

Cause: ONNX Runtime or other dependencies are missing from classpath

Solution:

<!-- Ensure all transitive dependencies are resolved -->
<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-embeddings-all-minilm-l6-v2-q</artifactId>
    <version>1.11.0</version>
</dependency>
<!-- No exclusions should be applied to ONNX Runtime -->

Issue: OutOfMemoryError when embedding large batches

Cause: Insufficient heap memory for large batch processing

Solution:

# Increase JVM heap size
java -Xmx4g -jar your-application.jar

# Or use batch processing in code
int batchSize = 50; // Adjust based on available memory

Issue: Embeddings are not consistent across runs

Cause: This model is deterministic; inconsistency suggests concurrent modification or model reloading

Solution:

// Ensure model instance is reused (thread-safe)
private static final EmbeddingModel MODEL = new AllMiniLmL6V2QuantizedEmbeddingModel();

// Do not modify embedding vectors after generation
Embedding emb = model.embed("text").content();
float[] vector = emb.vector();
// Do not modify 'vector' array

Issue: Poor embedding quality for long documents

Cause: Text exceeds recommended 256 token limit

Solution:

// Split long documents into chunks
public List<Embedding> embedLongDocument(String longText) {
    // Split into ~200 token chunks (roughly 150 words)
    String[] chunks = splitIntoChunks(longText, 150);

    List<TextSegment> segments = Arrays.stream(chunks)
        .map(TextSegment::from)
        .collect(Collectors.toList());

    Response<List<Embedding>> response = model.embedAll(segments);
    return response.content();
}

Issue: Slow embedding performance

Cause: Sequential processing or suboptimal executor configuration

Solution:

// Use custom executor with appropriate thread pool size
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(threads);
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(executor);

// Use batch embedding for multiple segments
Response<List<Embedding>> response = model.embedAll(segments); // Parallelized

Issue: Null pointer exceptions when accessing response fields

Cause: Optional fields (tokenUsage, finishReason, metadata) may be null

Solution:

Response<Embedding> response = model.embed("text");

// Always check for null
TokenUsage usage = response.tokenUsage();
if (usage != null && usage.inputTokenCount() != null) {
    int tokens = usage.inputTokenCount();
    System.out.println("Used " + tokens + " tokens");
}

Debugging Tips

Enable Listener Logging:

EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel()
    .addListener(new EmbeddingModelListener() {
        @Override
        public void onRequest(EmbeddingModelRequestContext ctx) {
            System.out.println("Request: " + ctx.textSegments().size() + " segments");
        }

        @Override
        public void onResponse(EmbeddingModelResponseContext ctx) {
            System.out.println("Response: " + ctx.response().content().size() + " embeddings");
        }

        @Override
        public void onError(EmbeddingModelErrorContext ctx) {
            ctx.error().printStackTrace();
        }
    });

Check Model Dimension:

int dim = model.dimension(); // Should always be 384
assert dim == 384 : "Unexpected dimension: " + dim;

Validate Embeddings:

Embedding emb = model.embed("test").content();
float[] vector = emb.vector();

// Check dimension
assert vector.length == 384;

// Check normalization (magnitude ≈ 1.0)
double magnitude = 0.0;
for (float v : vector) {
    magnitude += v * v;
}
magnitude = Math.sqrt(magnitude);
System.out.println("Magnitude: " + magnitude); // Should be ≈ 1.0

Monitor Memory Usage:

Runtime runtime = Runtime.getRuntime();
long usedMemory = runtime.totalMemory() - runtime.freeMemory();
System.out.println("Memory used: " + (usedMemory / 1024 / 1024) + " MB");

Advanced Usage

Handling Long Text

The model can handle unlimited text length, but quality degrades beyond 256 tokens. For long texts (over 510 tokens), the model automatically splits the text and averages the embeddings.

// Long text is automatically handled
String longText = "...text with more than 510 tokens...";
Response<Embedding> response = model.embed(longText);
Embedding embedding = response.content(); // Still 384-dimensional, averaged if needed

Best Practice for Long Documents:

public List<Embedding> embedLongDocumentWithChunking(String document) {
    // Split document into semantic chunks (e.g., paragraphs or sentences)
    List<String> chunks = splitIntoSemanticChunks(document, 200); // ~200 words per chunk

    List<TextSegment> segments = chunks.stream()
        .map(TextSegment::from)
        .collect(Collectors.toList());

    Response<List<Embedding>> response = model.embedAll(segments);
    return response.content();
}

// For document-level embedding, average the chunk embeddings
public Embedding getDocumentEmbedding(List<Embedding> chunkEmbeddings) {
    int dim = chunkEmbeddings.get(0).dimension();
    float[] avgVector = new float[dim];

    for (Embedding emb : chunkEmbeddings) {
        float[] vector = emb.vector();
        for (int i = 0; i < dim; i++) {
            avgVector[i] += vector[i];
        }
    }

    for (int i = 0; i < dim; i++) {
        avgVector[i] /= chunkEmbeddings.size();
    }

    Embedding docEmbedding = Embedding.from(avgVector);
    docEmbedding.normalize(); // Normalize after averaging
    return docEmbedding;
}

Computing Similarity

Use cosine similarity to compare embeddings (since they're normalized, dot product equals cosine similarity).

import dev.langchain4j.store.embedding.CosineSimilarity;
import dev.langchain4j.store.embedding.RelevanceScore;

Embedding emb1 = model.embed("Hello world").content();
Embedding emb2 = model.embed("Hi there").content();

// Compute cosine similarity
double cosineSim = CosineSimilarity.between(emb1, emb2);

// Convert to relevance score (0 to 1 scale)
double relevance = RelevanceScore.fromCosineSimilarity(cosineSim);

// Manual cosine similarity (since vectors are normalized, just dot product)
float[] v1 = emb1.vector();
float[] v2 = emb2.vector();
double dotProduct = 0.0;
for (int i = 0; i < v1.length; i++) {
    dotProduct += v1[i] * v2[i];
}
// dotProduct is the cosine similarity (vectors are unit length)

Similarity Thresholds:

High similarity: cosine > 0.7 (similar meaning)
Medium similarity: 0.4 < cosine < 0.7 (related topics)
Low similarity: cosine < 0.4 (different topics)

Thread Safety and Concurrency

The model is thread-safe and supports concurrent embedding operations:

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.List;
import java.util.ArrayList;

EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Future<Embedding>> futures = new ArrayList<>();

// Submit multiple embedding tasks concurrently
for (String text : texts) {
    futures.add(executor.submit(() -> model.embed(text).content()));
}

// Collect results
for (Future<Embedding> future : futures) {
    Embedding embedding = future.get();
    // Process embedding
}

executor.shutdown();

Thread Safety Notes:

The model instance is fully thread-safe
Multiple threads can call embed() or embedAll() concurrently
The underlying ONNX model is loaded once and shared (static)
Each embedding operation is independent

Custom Parallelization

Control the parallel processing behavior by providing a custom executor:

import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;

// Create custom executor with specific thread pool size
ExecutorService customExecutor = Executors.newFixedThreadPool(8);

// Pass to model constructor
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(customExecutor);

// When embedding multiple segments, uses custom executor
Response<List<Embedding>> response = model.embedAll(segments);

// Don't forget to shutdown when done (or use try-with-resources pattern)
customExecutor.shutdown();

Executor Selection Guidelines:

Fixed thread pool: Best for consistent workload, predictable resource usage
Cached thread pool (default): Good for variable workload, may create many threads
Single thread executor: For sequential processing, no parallelism
ForkJoinPool: Good for recursive divide-and-conquer tasks

Performance Tuning:

// For CPU-bound tasks, use core count
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(threads);

// For mixed workloads, use slightly more threads
int threads = Runtime.getRuntime().availableProcessors() + 2;
ExecutorService executor = Executors.newFixedThreadPool(threads);

Integration with Vector Databases

Storing Embeddings

import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;

// Create embedding store
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();

// Embed and store documents
List<TextSegment> documents = Arrays.asList(
    TextSegment.from("First document", new Metadata().put("id", "doc1")),
    TextSegment.from("Second document", new Metadata().put("id", "doc2"))
);

Response<List<Embedding>> response = model.embedAll(documents);
List<Embedding> embeddings = response.content();

// Store embeddings with their documents
for (int i = 0; i < documents.size(); i++) {
    embeddingStore.add(embeddings.get(i), documents.get(i));
}

Similarity Search

// Query embedding
Embedding queryEmbedding = model.embed("search query").content();

// Find similar documents
int maxResults = 5;
double minScore = 0.7;
List<EmbeddingMatch<TextSegment>> matches = embeddingStore.findRelevant(
    queryEmbedding,
    maxResults,
    minScore
);

// Process results
for (EmbeddingMatch<TextSegment> match : matches) {
    TextSegment segment = match.embedded();
    double score = match.score();
    System.out.println("Score: " + score + ", Text: " + segment.text());
}

Caching Embeddings

To avoid recomputing embeddings for the same text:

import java.util.concurrent.ConcurrentHashMap;
import java.util.Map;

public class CachedEmbeddingModel {
    private final EmbeddingModel model;
    private final Map<String, Embedding> cache;

    public CachedEmbeddingModel(EmbeddingModel model) {
        this.model = model;
        this.cache = new ConcurrentHashMap<>();
    }

    public Embedding embed(String text) {
        return cache.computeIfAbsent(text, t ->
            model.embed(t).content()
        );
    }

    public void clearCache() {
        cache.clear();
    }

    public int getCacheSize() {
        return cache.size();
    }
}

// Usage
EmbeddingModel baseModel = new AllMiniLmL6V2QuantizedEmbeddingModel();
CachedEmbeddingModel cachedModel = new CachedEmbeddingModel(baseModel);

Embedding emb1 = cachedModel.embed("test"); // Computed
Embedding emb2 = cachedModel.embed("test"); // Retrieved from cache
assert emb1 == emb2; // Same instance

Cache Considerations:

Memory usage: Each embedding is ~1.5KB (384 floats × 4 bytes)
Cache eviction: Implement LRU or size-based eviction for large caches
Thread safety: Use ConcurrentHashMap for concurrent access

Batch Processing Strategies

Fixed-Size Batching

public List<Embedding> embedAllInBatches(List<String> texts, int batchSize) {
    List<Embedding> allEmbeddings = new ArrayList<>();

    for (int i = 0; i < texts.size(); i += batchSize) {
        int end = Math.min(i + batchSize, texts.size());
        List<TextSegment> batch = texts.subList(i, end).stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());

        Response<List<Embedding>> response = model.embedAll(batch);
        allEmbeddings.addAll(response.content());
    }

    return allEmbeddings;
}

Adaptive Batching

public List<Embedding> embedAllAdaptive(List<String> texts) {
    int batchSize = 100;
    List<Embedding> allEmbeddings = new ArrayList<>();

    for (int i = 0; i < texts.size(); i += batchSize) {
        int end = Math.min(i + batchSize, texts.size());
        List<TextSegment> batch = texts.subList(i, end).stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());

        try {
            Response<List<Embedding>> response = model.embedAll(batch);
            allEmbeddings.addAll(response.content());
        } catch (OutOfMemoryError e) {
            // Reduce batch size and retry
            batchSize = batchSize / 2;
            i -= batchSize; // Retry current batch with smaller size
            System.err.println("OOM: reducing batch size to " + batchSize);
        }
    }

    return allEmbeddings;
}

Performance Monitoring

public class PerformanceMonitoringListener implements EmbeddingModelListener {
    private final AtomicLong totalRequests = new AtomicLong(0);
    private final AtomicLong totalTime = new AtomicLong(0);
    private final AtomicLong totalTokens = new AtomicLong(0);

    @Override
    public void onRequest(EmbeddingModelRequestContext ctx) {
        totalRequests.incrementAndGet();
        ctx.attributes().put("startTime", System.nanoTime());
    }

    @Override
    public void onResponse(EmbeddingModelResponseContext ctx) {
        long startTime = (Long) ctx.attributes().get("startTime");
        long duration = System.nanoTime() - startTime;
        totalTime.addAndGet(duration);

        TokenUsage usage = ctx.response().tokenUsage();
        if (usage != null && usage.inputTokenCount() != null) {
            totalTokens.addAndGet(usage.inputTokenCount());
        }
    }

    public void printStats() {
        long requests = totalRequests.get();
        long avgTimeMs = totalTime.get() / requests / 1_000_000;
        double avgTokens = (double) totalTokens.get() / requests;

        System.out.println("Total requests: " + requests);
        System.out.println("Average time: " + avgTimeMs + "ms");
        System.out.println("Average tokens: " + avgTokens);
    }
}

// Usage
PerformanceMonitoringListener monitor = new PerformanceMonitoringListener();
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel()
    .addListener(monitor);

// ... use model ...

monitor.printStats();

Implementation Notes

Model Loading: The ONNX model file (all-minilm-l6-v2-q.onnx) and tokenizer (all-minilm-l6-v2-q-tokenizer.json) are loaded from the JAR's classpath during class initialization
Model Download: The model is automatically downloaded from HuggingFace during the Maven build process and bundled into the JAR
Static Model Instance: The model and tokenizer are loaded once as static instances and shared across all instances of AllMiniLmL6V2QuantizedEmbeddingModel. This means:
- First instantiation triggers model loading (one-time cost)
- Subsequent instantiations are fast (no reload)
- Multiple instances share the same underlying model (memory efficient)
Token Counting: Token counts exclude the special tokens [CLS] and [SEP] that BERT models use internally
Vector Normalization: All embeddings produced are already normalized to unit length (magnitude ≈ 1.0). Calling normalize() on embeddings from this model is unnecessary.
Quantization: This is a quantized version of the model, providing smaller size and faster inference with a slight reduction in accuracy compared to the non-quantized version
Memory Footprint:
- Model size: ~90MB (loaded once, shared across instances)
- Per-embedding memory: ~1.5KB (384 floats × 4 bytes)
- Temporary processing buffers: Varies with batch size
ONNX Runtime: Uses ONNX Runtime Java bindings for model inference. The runtime is automatically included as a transitive dependency.
Tokenizer: Uses a fast WordPiece tokenizer compatible with BERT-based models
No External Services: Runs entirely in-process; no network calls or external services required

Performance Characteristics

Embedding Speed

Single text (short, <50 tokens): 10-20ms
Single text (medium, 50-200 tokens): 20-40ms
Single text (long, 200-500 tokens): 40-100ms
Batch (10 segments, medium length): 50-150ms (parallelized)
Batch (100 segments, medium length): 300-800ms (parallelized)

Note: Times vary significantly with hardware (CPU speed, cores) and JVM configuration.

Memory Usage

Model loading: ~90MB (one-time, shared)
Per embedding: ~1.5KB (384 floats)
Batch processing overhead: ~10-50MB temporary buffers (depends on batch size)
Recommended heap: Minimum 512MB for basic usage, 2-4GB for large-scale processing

Scaling Considerations

Horizontal scaling: Create multiple model instances (each shares the static model but has its own executor)
Vertical scaling: Increase heap size and thread pool size for larger batches
Optimal batch size: 10-100 segments for best throughput/latency tradeoff
Maximum practical batch size: ~1000 segments (limited by memory)

Optimization Tips

Reuse model instances: Model instantiation is lightweight, but reusing instances avoids executor overhead
Batch when possible: embedAll() is more efficient than multiple embed() calls
Tune thread pool: Match thread pool size to workload and hardware
Cache embeddings: Cache frequently-used embeddings to avoid recomputation
Warm up the model: First embedding is slower due to JIT compilation; run a warmup embedding at startup

Version History

1.11.0: Current version
- Added listener support (experimental)
- Added context classes for request/response/error tracking
- Core embedding functionality stable

Related Packages

langchain4j-embeddings-all-minilm-l6-v2: Non-quantized version (higher accuracy, larger size, slower)
langchain4j-embeddings: Core embedding interfaces and utilities
langchain4j-core: Core LangChain4j types and abstractions
langchain4j-store-embedding: Embedding store implementations for vector databases

tessl/maven-dev-langchain4j--langchain4j-embeddings-all-minilm-l6-v2-q

LangChain4j All-MiniLM-L6-v2 Quantized Embedding Model

Package Information

Core Imports

Basic Usage

Model Characteristics

Dependencies

Capabilities

Model Instantiation

Single Text Embedding

Batch Text Embedding

Embedding Dimension Query

Model Name

Listener Support

Factory Class

AllMiniLmL6V2QuantizedEmbeddingModelFactory

Core Types

FinishReason

Embedding

Metadata

TextSegment

Response<T>

TokenUsage

EmbeddingModelRequestContext

EmbeddingModelResponseContext

EmbeddingModelErrorContext

EmbeddingModelListener

Error Handling

Common Exceptions

NullPointerException

IllegalArgumentException

OutOfMemoryError

Model Loading Exceptions

Error Handling Patterns

Graceful Degradation

Retry Logic

Listener-Based Error Handling

Troubleshooting

Common Issues and Solutions

Issue: Model initialization fails with ClassNotFoundException

Issue: OutOfMemoryError when embedding large batches

Issue: Embeddings are not consistent across runs

Issue: Poor embedding quality for long documents

Issue: Slow embedding performance

Issue: Null pointer exceptions when accessing response fields

Debugging Tips

Advanced Usage

Handling Long Text

Computing Similarity

Thread Safety and Concurrency

Custom Parallelization

Integration with Vector Databases

Storing Embeddings

Similarity Search

Caching Embeddings

Batch Processing Strategies

Fixed-Size Batching

Adaptive Batching

Performance Monitoring

Implementation Notes

Performance Characteristics

Embedding Speed

Memory Usage

Scaling Considerations

Optimization Tips

Version History

Related Packages

See Also