CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j-embeddings-all-minilm-l6-v2-q

In-process all-minilm-l6-v2 (quantized) embedding model

Overview
Eval results
Files

LangChain4j All-MiniLM-L6-v2 Quantized Embedding Model

A quantized version of the SentenceTransformers all-MiniLM-L6-v2 embedding model that runs directly within Java applications without requiring external services. This package generates 384-dimensional embeddings for text using ONNX Runtime, with the quantized model providing efficient in-process execution suitable for semantic search, similarity matching, RAG (Retrieval-Augmented Generation) applications, and other NLP tasks.

Package Information

  • Package Name: langchain4j-embeddings-all-minilm-l6-v2-q
  • Group ID: dev.langchain4j
  • Artifact ID: langchain4j-embeddings-all-minilm-l6-v2-q
  • Package Type: Maven
  • Language: Java
  • Minimum Java Version: Java 8
  • Installation: Add the following dependency to your pom.xml:
<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-embeddings-all-minilm-l6-v2-q</artifactId>
    <version>1.11.0</version>
</dependency>

Or for Gradle:

implementation 'dev.langchain4j:langchain4j-embeddings-all-minilm-l6-v2-q:1.11.0'

Core Imports

import dev.langchain4j.model.embedding.onnx.allminilml6v2q.AllMiniLmL6V2QuantizedEmbeddingModel;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.output.Response;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.model.output.TokenUsage;
import dev.langchain4j.model.output.FinishReason;

Basic Usage

import dev.langchain4j.model.embedding.onnx.allminilml6v2q.AllMiniLmL6V2QuantizedEmbeddingModel;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.output.Response;

// Create the embedding model with default settings
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();

// Embed a single text string
Response<Embedding> response = model.embed("Hello, world!");
Embedding embedding = response.content();

// Access the vector
float[] vector = embedding.vector();
int dimension = embedding.dimension(); // Returns 384

// Get embedding dimension without generating embeddings
int dim = model.dimension(); // Returns 384

Model Characteristics

  • Embedding Dimensions: 384
  • Maximum Recommended Token Length: 256 tokens (unlimited supported but quality degrades beyond this)
  • Pooling Mode: MEAN pooling
  • Model Type: Quantized ONNX model (smaller size, slightly reduced accuracy vs. non-quantized)
  • Thread Safety: Thread-safe and supports concurrent embedding operations
  • Normalization: All embeddings are normalized (magnitude ≈ 1.0)
  • Model Source: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
  • Special Tokens: Uses [CLS] and [SEP] tokens internally (excluded from token counts)
  • Model Files:
    • Model: all-minilm-l6-v2-q.onnx (loaded from classpath)
    • Tokenizer: all-minilm-l6-v2-q-tokenizer.json (loaded from classpath)

Dependencies

This package has the following key dependencies that are automatically included:

  • ONNX Runtime Java: For model inference
  • Tokenizers: For text tokenization
  • LangChain4j Core: For core types and interfaces

Note: This package bundles the ONNX model files within the JAR. No additional model downloads are required at runtime.

Potential Conflicts:

  • If using multiple ONNX-based embedding models, ensure they use compatible ONNX Runtime versions
  • Memory usage scales with concurrent model instances (each model loads its own copy)

Capabilities

Model Instantiation

Create embedding model instances with default or custom executor settings.

// Default constructor - uses cached thread pool with threads = available processors
public AllMiniLmL6V2QuantizedEmbeddingModel()

// Constructor with custom executor for parallel processing control
public AllMiniLmL6V2QuantizedEmbeddingModel(java.util.concurrent.Executor executor)

Parameters:

  • executor (Executor): Custom executor for parallelizing the embedding process. Must not be null.

Throws:

  • NullPointerException: If executor is null (when using second constructor)

Default Executor Behavior:

  • Thread pool size: Number of available processors
  • Thread caching: Threads are cached for 1 second
  • Thread pool type: Cached thread pool with core thread timeout enabled

Usage Examples:

Default instantiation:

EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();

Custom executor for controlled parallelization:

import java.util.concurrent.Executors;
import java.util.concurrent.Executor;

Executor customExecutor = Executors.newFixedThreadPool(4);
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(customExecutor);

Null Handling:

  • Passing null executor throws NullPointerException
  • Always validate executor before passing to constructor

Resource Management:

  • The model instance shares a static ONNX model and tokenizer loaded once at class initialization
  • Creating multiple instances is safe but each will use the configured executor
  • Custom executors should be managed and shutdown by the caller

Single Text Embedding

Embed a single text string or TextSegment to generate a 384-dimensional vector representation.

// Embed a plain string
Response<Embedding> embed(String text)

// Embed a TextSegment (text with metadata)
Response<Embedding> embed(TextSegment textSegment)

Parameters:

  • text (String): The text to embed. Can be null, empty, or any length.
  • textSegment (TextSegment): A text segment containing text and optional metadata. Can contain null text.

Returns: Response<Embedding> containing:

  • content(): The generated Embedding (never null)
  • tokenUsage(): Token usage statistics (input tokens only, excludes special tokens [CLS] and [SEP])
  • finishReason(): Always null for embedding models
  • metadata(): Empty map for this model

Null Handling:

  • null text: Treated as empty string, produces valid embedding
  • null TextSegment: May throw NullPointerException
  • Empty string: Produces valid embedding for empty input

Edge Cases:

  • Empty text: Returns valid 384-dimensional embedding
  • Very long text (>510 tokens): Automatically split and averaged
  • Special characters: Handled by tokenizer, no preprocessing needed
  • Unicode: Fully supported including emoji and non-Latin scripts
  • Whitespace-only text: Produces valid embedding

Performance:

  • Typical single text embedding: 10-50ms depending on text length and hardware
  • Memory per embedding: ~1.5KB (float array)

Usage Examples:

Embedding a string:

Response<Embedding> response = model.embed("The quick brown fox jumps over the lazy dog");
Embedding embedding = response.content();
float[] vector = embedding.vector(); // 384-dimensional float array

Embedding a TextSegment:

import dev.langchain4j.data.segment.TextSegment;

TextSegment segment = TextSegment.from("Machine learning is fascinating");
Response<Embedding> response = model.embed(segment);
Embedding embedding = response.content();

Handling null or empty text:

// Empty text
Response<Embedding> response1 = model.embed("");
Embedding emb1 = response1.content(); // Valid embedding

// Null text treated as empty
Response<Embedding> response2 = model.embed((String) null);
Embedding emb2 = response2.content(); // Valid embedding

Batch Text Embedding

Embed multiple text segments in a single call with automatic parallel processing for efficiency.

// Embed multiple text segments
Response<java.util.List<Embedding>> embedAll(java.util.List<TextSegment> textSegments)

Parameters:

  • textSegments (List<TextSegment>): List of text segments to embed. Must not be null or empty.

Returns: Response<List<Embedding>> containing:

  • content(): List of Embedding objects, one per input segment in the same order (never null)
  • tokenUsage(): Aggregated token usage across all segments (input tokens only, excludes special tokens)
  • finishReason(): Always null for embedding models
  • metadata(): Empty map for this model

Throws:

  • IllegalArgumentException: If textSegments is null or empty
  • NullPointerException: If textSegments list contains null elements

Behavior:

  • Single segment: Processed in the same thread (no parallelization overhead)
  • Multiple segments: Processed in parallel using the configured Executor
  • Token count excludes special tokens [CLS] and [SEP]
  • Order preservation: Output embeddings match input order exactly

Null Handling:

  • null list: Throws exception
  • Empty list: Throws exception
  • null elements in list: Throws exception
  • TextSegment with null text: Treated as empty string

Edge Cases:

  • Single segment: No parallelization, direct processing
  • Very large batch (1000+ segments): May cause memory pressure; consider batching
  • Mixed text lengths: Short and long texts can be mixed; parallelization handles this efficiently
  • Duplicate texts: Each is embedded independently (no deduplication)

Performance:

  • Parallelization benefit increases with batch size and text length
  • Optimal batch size: 10-100 segments depending on text length
  • Memory usage: ~1.5KB per embedding + temporary processing buffers

Usage Examples:

import dev.langchain4j.data.segment.TextSegment;
import java.util.List;
import java.util.Arrays;

List<TextSegment> segments = Arrays.asList(
    TextSegment.from("First document about artificial intelligence"),
    TextSegment.from("Second document about machine learning"),
    TextSegment.from("Third document about deep learning")
);

Response<List<Embedding>> response = model.embedAll(segments);
List<Embedding> embeddings = response.content(); // 3 embeddings

// Access individual embeddings
Embedding firstEmbedding = embeddings.get(0);
Embedding secondEmbedding = embeddings.get(1);

// Check token usage
Integer inputTokens = response.tokenUsage().inputTokenCount();

Handling errors:

try {
    List<TextSegment> segments = Arrays.asList(/* ... */);
    Response<List<Embedding>> response = model.embedAll(segments);
    // Process embeddings
} catch (IllegalArgumentException e) {
    // Handle null or empty list
    System.err.println("Invalid input: " + e.getMessage());
}

Large batch processing with memory management:

import java.util.List;
import java.util.ArrayList;

List<TextSegment> allSegments = /* large list */;
int batchSize = 50;
List<Embedding> allEmbeddings = new ArrayList<>();

// Process in batches to manage memory
for (int i = 0; i < allSegments.size(); i += batchSize) {
    int end = Math.min(i + batchSize, allSegments.size());
    List<TextSegment> batch = allSegments.subList(i, end);

    Response<List<Embedding>> response = model.embedAll(batch);
    allEmbeddings.addAll(response.content());
}

Embedding Dimension Query

Get the dimension of embeddings produced by this model without generating embeddings.

// Returns the embedding dimension
int dimension()

Returns: int - The embedding dimension (always 384 for this model)

Usage Example:

int dim = model.dimension(); // Returns 384

Use Cases:

  • Pre-allocate storage for embeddings
  • Validate compatibility with vector databases
  • Configure neural network input layers

Model Name

Get the name identifier of the underlying embedding model.

// Returns the model name
String modelName()

Returns: String - The model name or "unknown" if not specified by the implementation

Usage Example:

String name = model.modelName();

Note: The returned name is implementation-specific and may be "unknown" for this model.

Listener Support

Wrap the embedding model with listeners to observe and monitor embedding operations.

// Add a single listener
EmbeddingModel addListener(dev.langchain4j.model.embedding.listener.EmbeddingModelListener listener)

// Add multiple listeners
EmbeddingModel addListeners(java.util.List<dev.langchain4j.model.embedding.listener.EmbeddingModelListener> listeners)

Parameters:

  • listener (EmbeddingModelListener): A listener to observe embedding operations. If null, returns the model unchanged.
  • listeners (List<EmbeddingModelListener>): List of listeners to observe embedding operations. Called in iteration order. If null or empty, returns the model unchanged.

Returns: EmbeddingModel - An observing embedding model that dispatches events to the provided listener(s)

Null Handling:

  • null listener: Returns original model unchanged (no-op)
  • null or empty listeners list: Returns original model unchanged (no-op)
  • null elements in listeners list: Skipped during event dispatch

Annotation: @Experimental (since v1.11.0)

Behavior:

  • Listeners are called synchronously during embedding operations
  • Exceptions in listeners do not propagate to the embedding operation
  • Multiple listeners are called in the order they were added
  • Listeners can access and modify the attributes map for passing data between callbacks

Usage Example:

import dev.langchain4j.model.embedding.listener.EmbeddingModelListener;
import dev.langchain4j.model.embedding.listener.EmbeddingModelRequestContext;
import dev.langchain4j.model.embedding.listener.EmbeddingModelResponseContext;
import dev.langchain4j.model.embedding.listener.EmbeddingModelErrorContext;

// Add a listener to monitor embedding operations
EmbeddingModel observedModel = model.addListener(new EmbeddingModelListener() {
    @Override
    public void onRequest(EmbeddingModelRequestContext ctx) {
        System.out.println("Embedding " + ctx.textSegments().size() + " segments");
        // Store start time in attributes for performance tracking
        ctx.attributes().put("startTime", System.currentTimeMillis());
    }

    @Override
    public void onResponse(EmbeddingModelResponseContext ctx) {
        long startTime = (Long) ctx.attributes().get("startTime");
        long duration = System.currentTimeMillis() - startTime;
        System.out.println("Completed in " + duration + "ms");
    }

    @Override
    public void onError(EmbeddingModelErrorContext ctx) {
        System.err.println("Error: " + ctx.error().getMessage());
    }
});

// Use the observed model
Response<Embedding> response = observedModel.embed("test");

Factory Class

AllMiniLmL6V2QuantizedEmbeddingModelFactory

Factory class for creating model instances via the SPI (Service Provider Interface) mechanism.

public class AllMiniLmL6V2QuantizedEmbeddingModelFactory implements dev.langchain4j.spi.model.embedding.EmbeddingModelFactory

// Create a new model instance with default settings
public EmbeddingModel create()

Package: dev.langchain4j.model.embedding.onnx.allminilml6v2q

Returns: EmbeddingModel - A new AllMiniLmL6V2QuantizedEmbeddingModel instance with default settings (default executor)

Usage: Typically used by frameworks and service loaders rather than direct instantiation.

Example:

import dev.langchain4j.spi.model.embedding.EmbeddingModelFactory;
import java.util.ServiceLoader;

// Load via SPI
ServiceLoader<EmbeddingModelFactory> loader = ServiceLoader.load(EmbeddingModelFactory.class);
for (EmbeddingModelFactory factory : loader) {
    if (factory instanceof AllMiniLmL6V2QuantizedEmbeddingModelFactory) {
        EmbeddingModel model = factory.create();
        break;
    }
}

Core Types

FinishReason

Represents the reason why a model call finished.

public enum FinishReason {
    // The model call finished because the model decided the request was done
    STOP,

    // The call finished because the token length was reached
    LENGTH,

    // The call finished signalling a need for tool execution
    TOOL_EXECUTION,

    // The call finished signalling a need for content filtering
    CONTENT_FILTER,

    // The call finished for some other reason
    OTHER
}

Package: dev.langchain4j.model.output

Note: For embedding models, the finish reason is always null.

Embedding

Represents a dense vector embedding of text.

public class Embedding {
    // Constructor
    public Embedding(float[] vector)

    // Get the vector array
    public float[] vector()

    // Get vector as a list
    public java.util.List<Float> vectorAsList()

    // Get embedding dimension
    public int dimension()

    // Normalize the vector in-place
    public void normalize()

    // Factory methods
    public static Embedding from(float[] vector)
    public static Embedding from(java.util.List<Float> vector)
}

Package: dev.langchain4j.data.embedding

Key Methods:

  • vector(): Returns the raw float array representing the embedding. The returned array is the internal array (not a copy), so modifications will affect the embedding.
  • vectorAsList(): Returns a copy of the vector as a List<Float>. This is a defensive copy, so modifications won't affect the embedding.
  • dimension(): Returns the length of the vector (384 for this model)
  • normalize(): Normalizes the vector to unit length (magnitude = 1.0) in-place. This model already produces normalized vectors, so calling this is typically unnecessary.
  • from(float[] vector): Static factory method to create an Embedding from a float array. The array is stored directly (not copied).
  • from(List<Float> vector): Static factory method to create an Embedding from a list. The list is converted to a float array.

Null Handling:

  • Constructor with null vector: Throws NullPointerException
  • from() methods with null: Throw NullPointerException

Important Notes:

  • The vector array returned by vector() is mutable; avoid modifying it unless you intend to change the embedding
  • This model produces pre-normalized embeddings; calling normalize() is unnecessary
  • For immutable access, use vectorAsList() which returns a defensive copy

Metadata

Represents metadata associated with a Document or TextSegment as key-value pairs.

public class Metadata {
    // Constructors
    public Metadata()
    public Metadata(java.util.Map<String, ?> metadata)

    // Getter methods for typed access
    public String getString(String key)
    public java.util.UUID getUUID(String key)
    public Integer getInteger(String key)
    public Long getLong(String key)
    public Float getFloat(String key)
    public Double getDouble(String key)

    // Check for key existence
    public boolean containsKey(String key)

    // Add key-value pairs (fluent API)
    public Metadata put(String key, String value)
    public Metadata put(String key, java.util.UUID value)
    public Metadata put(String key, int value)
    public Metadata put(String key, long value)
    public Metadata put(String key, float value)
    public Metadata put(String key, double value)
    public Metadata putAll(java.util.Map<String, Object> metadata)

    // Remove a key
    public Metadata remove(String key)

    // Copy and convert
    public Metadata copy()
    public java.util.Map<String, Object> toMap()

    // Merge with another Metadata object
    public Metadata merge(Metadata another)

    // Factory methods
    public static Metadata from(String key, String value)
    public static Metadata from(java.util.Map<String, ?> metadata)
    public static Metadata metadata(String key, String value)
}

Package: dev.langchain4j.data.document

Supported Value Types: String, UUID, Integer, Long, Float, Double

Key Methods:

  • getString(String key), getInteger(String key), etc.: Returns typed values. Returns null if key not present or value cannot be cast to the requested type.
  • put(String key, T value): Adds key-value pair, returns this for chaining (fluent API)
  • containsKey(String key): Checks if key exists
  • toMap(): Returns copy as Map<String, Object>
  • merge(Metadata another): Merges two Metadata objects. Throws exception if keys overlap.

Null Handling:

  • null key in put/get methods: May throw NullPointerException (depends on internal map implementation)
  • null value in put methods: Stores null value
  • null Metadata in merge: Returns this unchanged
  • Getter methods return null for missing keys

Edge Cases:

  • Type mismatch in getters: Returns null if stored type doesn't match requested type
  • Duplicate keys in merge: Throws IllegalArgumentException
  • putAll with null map: May throw NullPointerException

Usage Examples:

// Create metadata with fluent API
Metadata meta = new Metadata()
    .put("source", "document.pdf")
    .put("page", 5)
    .put("score", 0.95);

// Type-safe retrieval
String source = meta.getString("source");
Integer page = meta.getInteger("page");
Float score = meta.getFloat("score");

// Null handling
Integer missing = meta.getInteger("nonexistent"); // Returns null

TextSegment

Represents a semantically meaningful segment of text with optional metadata.

public class TextSegment {
    // Constructor
    public TextSegment(String text, Metadata metadata)

    // Get the text content
    public String text()

    // Get the metadata
    public Metadata metadata()

    // Factory methods
    public static TextSegment from(String text)
    public static TextSegment from(String text, Metadata metadata)
    public static TextSegment textSegment(String text)
    public static TextSegment textSegment(String text, Metadata metadata)
}

Package: dev.langchain4j.data.segment

Key Methods:

  • text(): Returns the text content. May return null if TextSegment was created with null text.
  • metadata(): Returns the associated metadata. Never null; returns empty Metadata if none provided.
  • from(String text): Creates a TextSegment with empty metadata
  • from(String text, Metadata metadata): Creates a TextSegment with specified metadata
  • textSegment(String text): Alternative factory method (same as from(String text))
  • textSegment(String text, Metadata metadata): Alternative factory method (same as from(String text, Metadata metadata))

Null Handling:

  • null text: Accepted; stored as-is (embedding models typically treat as empty)
  • null metadata: Replaced with empty Metadata instance

Usage Examples:

// Simple text segment
TextSegment segment1 = TextSegment.from("This is a document");

// Text segment with metadata
Metadata meta = new Metadata().put("source", "doc1.txt");
TextSegment segment2 = TextSegment.from("Content here", meta);

// Accessing content
String text = segment2.text();
String source = segment2.metadata().getString("source");

Response<T>

Generic wrapper for model responses containing the generated content and metadata.

public class Response<T> {
    // Constructors
    public Response(T content)
    public Response(T content, TokenUsage tokenUsage, FinishReason finishReason)
    public Response(T content, TokenUsage tokenUsage, FinishReason finishReason, java.util.Map<String, Object> metadata)

    // Get the content
    public T content()

    // Get token usage statistics
    public TokenUsage tokenUsage()

    // Get finish reason
    public FinishReason finishReason()

    // Get response metadata
    public java.util.Map<String, Object> metadata()

    // Factory methods
    public static <T> Response<T> from(T content)
    public static <T> Response<T> from(T content, TokenUsage tokenUsage)
    public static <T> Response<T> from(T content, TokenUsage tokenUsage, FinishReason finishReason)
    public static <T> Response<T> from(T content, TokenUsage tokenUsage, FinishReason finishReason, java.util.Map<String, Object> metadata)
}

Package: dev.langchain4j.model.output

Type Parameter:

  • T: The type of content (Embedding or List<Embedding> for this model)

Key Methods:

  • content(): Returns the generated content (Embedding or List<Embedding>). Never null.
  • tokenUsage(): Returns token usage statistics. May be null if not provided.
  • finishReason(): Returns the finish reason. Always null for embedding models.
  • metadata(): Returns response metadata. Returns empty map if not provided (never null).

Null Handling:

  • null content: Stored as-is (may cause issues downstream)
  • null tokenUsage: Accepted and stored
  • null finishReason: Accepted and stored
  • null metadata map: Replaced with empty map

Usage Example:

Response<Embedding> response = model.embed("test");

Embedding emb = response.content(); // Never null
TokenUsage usage = response.tokenUsage(); // May be null
FinishReason reason = response.finishReason(); // Always null for embeddings
Map<String, Object> meta = response.metadata(); // Empty map for this model

TokenUsage

Represents token usage statistics for a model response.

public class TokenUsage {
    // Constructors
    public TokenUsage()
    public TokenUsage(Integer inputTokenCount)
    public TokenUsage(Integer inputTokenCount, Integer outputTokenCount)
    public TokenUsage(Integer inputTokenCount, Integer outputTokenCount, Integer totalTokenCount)

    // Get input token count
    public Integer inputTokenCount()

    // Get output token count (always null for embedding models)
    public Integer outputTokenCount()

    // Get total token count
    public Integer totalTokenCount()

    // Add two TokenUsage instances
    public TokenUsage add(TokenUsage that)

    // Static method to sum two TokenUsage instances
    public static TokenUsage sum(TokenUsage first, TokenUsage second)
}

Package: dev.langchain4j.model.output

Key Methods:

  • inputTokenCount(): Returns the number of input tokens. May be null. For this model, excludes special tokens [CLS] and [SEP].
  • outputTokenCount(): Returns the number of output tokens. Always null for embedding models.
  • totalTokenCount(): Returns the total token count. May be null. For this model, equals inputTokenCount when populated.
  • add(TokenUsage that): Adds the token usage of another TokenUsage instance to this one, returning a new TokenUsage with summed values. Returns this instance unchanged if that is null.
  • sum(TokenUsage first, TokenUsage second): Static method to add two TokenUsage instances. Returns the non-null instance if one is null, or a new TokenUsage with summed values if both are non-null. Returns null if both are null.

Null Handling:

  • Constructor with null values: Accepted and stored
  • add(null): Returns this unchanged
  • sum(null, null): Returns null
  • All getter methods may return null

Note: For embedding models, only inputTokenCount is populated, representing the number of tokens in the input text (excluding special tokens).

Usage Example:

Response<Embedding> response = model.embed("test text");
TokenUsage usage = response.tokenUsage();

if (usage != null) {
    Integer inputTokens = usage.inputTokenCount(); // May be null
    Integer totalTokens = usage.totalTokenCount(); // Equals inputTokens
}

// Summing token usage from multiple responses
TokenUsage total = usage1.add(usage2).add(usage3);

EmbeddingModelRequestContext

Context object containing the input text segments and attributes for embedding model requests.

public class EmbeddingModelRequestContext {
    // Get the input text segments to be embedded
    public java.util.List<TextSegment> textSegments()

    // Get the embedding model instance
    public EmbeddingModel embeddingModel()

    // Get the attributes map for passing data between listeners
    public java.util.Map<Object, Object> attributes()

    // Builder pattern for constructing instances
    public static Builder builder()

    // Inner Builder class
    public static class Builder {
        public Builder textSegments(java.util.List<TextSegment> textSegments)
        public Builder embeddingModel(EmbeddingModel embeddingModel)
        public Builder attributes(java.util.Map<Object, Object> attributes)
        public EmbeddingModelRequestContext build()
    }
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

  • textSegments(): Returns the list of input text segments to be embedded. Never null.
  • embeddingModel(): Returns the embedding model that will process the request. Never null.
  • attributes(): Returns a mutable map for passing data between listener methods (e.g., for logging context, timing information). Never null; modifications are visible to subsequent callbacks.
  • builder(): Static factory method to create a new Builder instance for constructing the context

Usage: This context is passed to EmbeddingModelListener.onRequest() before the embedding operation begins. Listeners can use the attributes map to store request-specific data that will be available in subsequent response or error callbacks.

Example:

@Override
public void onRequest(EmbeddingModelRequestContext ctx) {
    // Store timing information
    ctx.attributes().put("startTime", System.currentTimeMillis());

    // Log request details
    int segmentCount = ctx.textSegments().size();
    String modelName = ctx.embeddingModel().modelName();
    System.out.println("Embedding " + segmentCount + " segments with " + modelName);
}

EmbeddingModelResponseContext

Context object containing the embedding response, input text segments, and attributes for successful embedding operations.

public class EmbeddingModelResponseContext {
    // Get the embedding response containing the list of embeddings
    public Response<java.util.List<Embedding>> response()

    // Get the input text segments that were embedded
    public java.util.List<TextSegment> textSegments()

    // Get the embedding model instance
    public EmbeddingModel embeddingModel()

    // Get the attributes map for passing data between listeners
    public java.util.Map<Object, Object> attributes()

    // Builder pattern for constructing instances
    public static Builder builder()

    // Inner Builder class
    public static class Builder {
        public Builder response(Response<java.util.List<Embedding>> response)
        public Builder textSegments(java.util.List<TextSegment> textSegments)
        public Builder embeddingModel(EmbeddingModel embeddingModel)
        public Builder attributes(java.util.Map<Object, Object> attributes)
        public EmbeddingModelResponseContext build()
    }
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

  • response(): Returns the Response object containing the list of generated embeddings and metadata (token usage, etc.). Never null.
  • textSegments(): Returns the input text segments that were successfully embedded. Never null.
  • embeddingModel(): Returns the embedding model that processed the request. Never null.
  • attributes(): Returns the attributes map that was passed through from the request context. Never null; contains any data stored during onRequest().
  • builder(): Static factory method to create a new Builder instance for constructing the context

Usage: This context is passed to EmbeddingModelListener.onResponse() after a successful embedding operation. It provides access to both the request data and the resulting embeddings.

Example:

@Override
public void onResponse(EmbeddingModelResponseContext ctx) {
    // Retrieve timing information from request
    Long startTime = (Long) ctx.attributes().get("startTime");
    long duration = System.currentTimeMillis() - startTime;

    // Access response data
    List<Embedding> embeddings = ctx.response().content();
    TokenUsage usage = ctx.response().tokenUsage();

    System.out.println("Generated " + embeddings.size() + " embeddings in " + duration + "ms");
    System.out.println("Token usage: " + usage.inputTokenCount() + " tokens");
}

EmbeddingModelErrorContext

Context object containing the error, input text segments, and attributes when an embedding operation fails.

public class EmbeddingModelErrorContext {
    // Get the error that occurred during the embedding operation
    public Throwable error()

    // Get the input text segments that caused the error
    public java.util.List<TextSegment> textSegments()

    // Get the embedding model instance
    public EmbeddingModel embeddingModel()

    // Get the attributes map for passing data between listeners
    public java.util.Map<Object, Object> attributes()

    // Builder pattern for constructing instances
    public static Builder builder()

    // Inner Builder class
    public static class Builder {
        public Builder error(Throwable error)
        public Builder textSegments(java.util.List<TextSegment> textSegments)
        public Builder embeddingModel(EmbeddingModel embeddingModel)
        public Builder attributes(java.util.Map<Object, Object> attributes)
        public EmbeddingModelErrorContext build()
    }
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

  • error(): Returns the Throwable (exception or error) that occurred during the embedding operation. Never null.
  • textSegments(): Returns the input text segments that caused the error. Never null.
  • embeddingModel(): Returns the embedding model that encountered the error. Never null.
  • attributes(): Returns the attributes map that was passed through from the request context. Never null; contains any data stored during onRequest().
  • builder(): Static factory method to create a new Builder instance for constructing the context

Usage: This context is passed to EmbeddingModelListener.onError() when an embedding operation fails. It provides access to both the request data and the error details for logging, monitoring, or recovery purposes.

Example:

@Override
public void onError(EmbeddingModelErrorContext ctx) {
    // Log error details
    Throwable error = ctx.error();
    int segmentCount = ctx.textSegments().size();

    System.err.println("Error embedding " + segmentCount + " segments: " + error.getMessage());
    error.printStackTrace();

    // Could implement retry logic, fallback, or alerting here
}

EmbeddingModelListener

Interface for listening to embedding model requests, responses, and errors.

public interface EmbeddingModelListener {
    // Called before the request is executed against the embedding model
    default void onRequest(EmbeddingModelRequestContext requestContext) {}

    // Called after a successful embedding operation completes
    default void onResponse(EmbeddingModelResponseContext responseContext) {}

    // Called when an error occurs during interaction with the embedding model
    default void onError(EmbeddingModelErrorContext errorContext) {}
}

Package: dev.langchain4j.model.embedding.listener

Annotation: @Experimental (since v1.11.0)

Key Methods:

  • onRequest(EmbeddingModelRequestContext requestContext): Called before embedding execution. The request context contains input and attributes for passing data between listeners.
  • onResponse(EmbeddingModelResponseContext responseContext): Called after successful embedding. The response context contains the response, corresponding request, and attributes.
  • onError(EmbeddingModelErrorContext errorContext): Called when an error occurs. The error context contains the error, corresponding request, and attributes.

Important Characteristics:

  • All methods have default implementations (no-op), so you only need to override the methods you need
  • Listeners are called synchronously during the embedding operation
  • Exceptions thrown in listener methods are caught and logged but do not propagate to the caller
  • The attributes map can be used to pass data between onRequest, onResponse, and onError
  • Multiple listeners are called in the order they were added

Thread Safety: Listener methods may be called from multiple threads concurrently if the model is used concurrently. Implementations should be thread-safe or synchronized as needed.

Usage Example:

public class LoggingListener implements EmbeddingModelListener {
    @Override
    public void onRequest(EmbeddingModelRequestContext ctx) {
        ctx.attributes().put("startTime", System.nanoTime());
        System.out.println("[REQUEST] Embedding " + ctx.textSegments().size() + " segments");
    }

    @Override
    public void onResponse(EmbeddingModelResponseContext ctx) {
        long duration = System.nanoTime() - (Long) ctx.attributes().get("startTime");
        System.out.println("[RESPONSE] Completed in " + (duration / 1_000_000) + "ms");
    }

    @Override
    public void onError(EmbeddingModelErrorContext ctx) {
        System.err.println("[ERROR] Failed: " + ctx.error().getMessage());
    }
}

// Usage
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();
EmbeddingModel observedModel = model.addListener(new LoggingListener());

Error Handling

Common Exceptions

This section documents exceptions that may be thrown during embedding operations.

NullPointerException

When Thrown:

  • Passing null executor to constructor
  • Passing null to embedAll() (null list)
  • Creating Embedding with null vector
  • Various internal operations with null values

Example:

try {
    EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(null);
} catch (NullPointerException e) {
    System.err.println("Executor cannot be null");
}

IllegalArgumentException

When Thrown:

  • Empty list passed to embedAll()
  • Merging Metadata with duplicate keys
  • Invalid configuration values

Example:

try {
    Response<List<Embedding>> response = model.embedAll(Collections.emptyList());
} catch (IllegalArgumentException e) {
    System.err.println("Cannot embed empty list: " + e.getMessage());
}

OutOfMemoryError

When Thrown:

  • Embedding very large batches (thousands of segments)
  • Very long text inputs that create large token sequences
  • Concurrent embedding operations exhausting heap

Prevention:

// Batch processing to prevent OOM
int batchSize = 100;
for (int i = 0; i < allSegments.size(); i += batchSize) {
    int end = Math.min(i + batchSize, allSegments.size());
    List<TextSegment> batch = allSegments.subList(i, end);

    try {
        Response<List<Embedding>> response = model.embedAll(batch);
        // Process batch
    } catch (OutOfMemoryError e) {
        // Reduce batch size and retry
        System.err.println("OOM error, reducing batch size");
        break;
    }
}

Model Loading Exceptions

When Thrown:

  • Model files missing from classpath (corrupted JAR)
  • ONNX Runtime initialization failures
  • Incompatible ONNX Runtime version

Note: These exceptions typically occur during class initialization and cannot be caught in normal operation.

Example:

try {
    EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();
} catch (ExceptionInInitializerError e) {
    System.err.println("Failed to initialize model: " + e.getCause().getMessage());
    // This indicates a serious environment problem (missing dependencies, etc.)
}

Error Handling Patterns

Graceful Degradation

public List<Embedding> embedWithFallback(List<TextSegment> segments) {
    try {
        Response<List<Embedding>> response = model.embedAll(segments);
        return response.content();
    } catch (OutOfMemoryError e) {
        // Fall back to sequential processing
        List<Embedding> embeddings = new ArrayList<>();
        for (TextSegment segment : segments) {
            Response<Embedding> response = model.embed(segment);
            embeddings.add(response.content());
        }
        return embeddings;
    } catch (Exception e) {
        System.err.println("Embedding failed: " + e.getMessage());
        // Return empty list or throw custom exception
        return Collections.emptyList();
    }
}

Retry Logic

public Response<Embedding> embedWithRetry(String text, int maxRetries) {
    int attempts = 0;
    Exception lastException = null;

    while (attempts < maxRetries) {
        try {
            return model.embed(text);
        } catch (Exception e) {
            lastException = e;
            attempts++;
            if (attempts < maxRetries) {
                try {
                    Thread.sleep(1000 * attempts); // Exponential backoff
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("Interrupted during retry", ie);
                }
            }
        }
    }

    throw new RuntimeException("Failed after " + maxRetries + " attempts", lastException);
}

Listener-Based Error Handling

public class RetryListener implements EmbeddingModelListener {
    private final int maxRetries;
    private final Map<Object, Integer> attemptCounts = new ConcurrentHashMap<>();

    public RetryListener(int maxRetries) {
        this.maxRetries = maxRetries;
    }

    @Override
    public void onError(EmbeddingModelErrorContext ctx) {
        Object requestId = ctx.attributes().get("requestId");
        int attempts = attemptCounts.getOrDefault(requestId, 0) + 1;

        if (attempts < maxRetries) {
            attemptCounts.put(requestId, attempts);
            System.out.println("Retrying (attempt " + attempts + ")");
            // Trigger retry (would need custom retry logic)
        } else {
            System.err.println("Failed after " + maxRetries + " attempts");
            attemptCounts.remove(requestId);
        }
    }
}

Troubleshooting

Common Issues and Solutions

Issue: Model initialization fails with ClassNotFoundException

Cause: ONNX Runtime or other dependencies are missing from classpath

Solution:

<!-- Ensure all transitive dependencies are resolved -->
<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-embeddings-all-minilm-l6-v2-q</artifactId>
    <version>1.11.0</version>
</dependency>
<!-- No exclusions should be applied to ONNX Runtime -->

Issue: OutOfMemoryError when embedding large batches

Cause: Insufficient heap memory for large batch processing

Solution:

# Increase JVM heap size
java -Xmx4g -jar your-application.jar

# Or use batch processing in code
int batchSize = 50; // Adjust based on available memory

Issue: Embeddings are not consistent across runs

Cause: This model is deterministic; inconsistency suggests concurrent modification or model reloading

Solution:

// Ensure model instance is reused (thread-safe)
private static final EmbeddingModel MODEL = new AllMiniLmL6V2QuantizedEmbeddingModel();

// Do not modify embedding vectors after generation
Embedding emb = model.embed("text").content();
float[] vector = emb.vector();
// Do not modify 'vector' array

Issue: Poor embedding quality for long documents

Cause: Text exceeds recommended 256 token limit

Solution:

// Split long documents into chunks
public List<Embedding> embedLongDocument(String longText) {
    // Split into ~200 token chunks (roughly 150 words)
    String[] chunks = splitIntoChunks(longText, 150);

    List<TextSegment> segments = Arrays.stream(chunks)
        .map(TextSegment::from)
        .collect(Collectors.toList());

    Response<List<Embedding>> response = model.embedAll(segments);
    return response.content();
}

Issue: Slow embedding performance

Cause: Sequential processing or suboptimal executor configuration

Solution:

// Use custom executor with appropriate thread pool size
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(threads);
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(executor);

// Use batch embedding for multiple segments
Response<List<Embedding>> response = model.embedAll(segments); // Parallelized

Issue: Null pointer exceptions when accessing response fields

Cause: Optional fields (tokenUsage, finishReason, metadata) may be null

Solution:

Response<Embedding> response = model.embed("text");

// Always check for null
TokenUsage usage = response.tokenUsage();
if (usage != null && usage.inputTokenCount() != null) {
    int tokens = usage.inputTokenCount();
    System.out.println("Used " + tokens + " tokens");
}

Debugging Tips

  1. Enable Listener Logging:
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel()
    .addListener(new EmbeddingModelListener() {
        @Override
        public void onRequest(EmbeddingModelRequestContext ctx) {
            System.out.println("Request: " + ctx.textSegments().size() + " segments");
        }

        @Override
        public void onResponse(EmbeddingModelResponseContext ctx) {
            System.out.println("Response: " + ctx.response().content().size() + " embeddings");
        }

        @Override
        public void onError(EmbeddingModelErrorContext ctx) {
            ctx.error().printStackTrace();
        }
    });
  1. Check Model Dimension:
int dim = model.dimension(); // Should always be 384
assert dim == 384 : "Unexpected dimension: " + dim;
  1. Validate Embeddings:
Embedding emb = model.embed("test").content();
float[] vector = emb.vector();

// Check dimension
assert vector.length == 384;

// Check normalization (magnitude ≈ 1.0)
double magnitude = 0.0;
for (float v : vector) {
    magnitude += v * v;
}
magnitude = Math.sqrt(magnitude);
System.out.println("Magnitude: " + magnitude); // Should be ≈ 1.0
  1. Monitor Memory Usage:
Runtime runtime = Runtime.getRuntime();
long usedMemory = runtime.totalMemory() - runtime.freeMemory();
System.out.println("Memory used: " + (usedMemory / 1024 / 1024) + " MB");

Advanced Usage

Handling Long Text

The model can handle unlimited text length, but quality degrades beyond 256 tokens. For long texts (over 510 tokens), the model automatically splits the text and averages the embeddings.

// Long text is automatically handled
String longText = "...text with more than 510 tokens...";
Response<Embedding> response = model.embed(longText);
Embedding embedding = response.content(); // Still 384-dimensional, averaged if needed

Best Practice for Long Documents:

public List<Embedding> embedLongDocumentWithChunking(String document) {
    // Split document into semantic chunks (e.g., paragraphs or sentences)
    List<String> chunks = splitIntoSemanticChunks(document, 200); // ~200 words per chunk

    List<TextSegment> segments = chunks.stream()
        .map(TextSegment::from)
        .collect(Collectors.toList());

    Response<List<Embedding>> response = model.embedAll(segments);
    return response.content();
}

// For document-level embedding, average the chunk embeddings
public Embedding getDocumentEmbedding(List<Embedding> chunkEmbeddings) {
    int dim = chunkEmbeddings.get(0).dimension();
    float[] avgVector = new float[dim];

    for (Embedding emb : chunkEmbeddings) {
        float[] vector = emb.vector();
        for (int i = 0; i < dim; i++) {
            avgVector[i] += vector[i];
        }
    }

    for (int i = 0; i < dim; i++) {
        avgVector[i] /= chunkEmbeddings.size();
    }

    Embedding docEmbedding = Embedding.from(avgVector);
    docEmbedding.normalize(); // Normalize after averaging
    return docEmbedding;
}

Computing Similarity

Use cosine similarity to compare embeddings (since they're normalized, dot product equals cosine similarity).

import dev.langchain4j.store.embedding.CosineSimilarity;
import dev.langchain4j.store.embedding.RelevanceScore;

Embedding emb1 = model.embed("Hello world").content();
Embedding emb2 = model.embed("Hi there").content();

// Compute cosine similarity
double cosineSim = CosineSimilarity.between(emb1, emb2);

// Convert to relevance score (0 to 1 scale)
double relevance = RelevanceScore.fromCosineSimilarity(cosineSim);

// Manual cosine similarity (since vectors are normalized, just dot product)
float[] v1 = emb1.vector();
float[] v2 = emb2.vector();
double dotProduct = 0.0;
for (int i = 0; i < v1.length; i++) {
    dotProduct += v1[i] * v2[i];
}
// dotProduct is the cosine similarity (vectors are unit length)

Similarity Thresholds:

  • High similarity: cosine > 0.7 (similar meaning)
  • Medium similarity: 0.4 < cosine < 0.7 (related topics)
  • Low similarity: cosine < 0.4 (different topics)

Thread Safety and Concurrency

The model is thread-safe and supports concurrent embedding operations:

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.List;
import java.util.ArrayList;

EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel();
ExecutorService executor = Executors.newFixedThreadPool(10);
List<Future<Embedding>> futures = new ArrayList<>();

// Submit multiple embedding tasks concurrently
for (String text : texts) {
    futures.add(executor.submit(() -> model.embed(text).content()));
}

// Collect results
for (Future<Embedding> future : futures) {
    Embedding embedding = future.get();
    // Process embedding
}

executor.shutdown();

Thread Safety Notes:

  • The model instance is fully thread-safe
  • Multiple threads can call embed() or embedAll() concurrently
  • The underlying ONNX model is loaded once and shared (static)
  • Each embedding operation is independent

Custom Parallelization

Control the parallel processing behavior by providing a custom executor:

import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;

// Create custom executor with specific thread pool size
ExecutorService customExecutor = Executors.newFixedThreadPool(8);

// Pass to model constructor
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel(customExecutor);

// When embedding multiple segments, uses custom executor
Response<List<Embedding>> response = model.embedAll(segments);

// Don't forget to shutdown when done (or use try-with-resources pattern)
customExecutor.shutdown();

Executor Selection Guidelines:

  • Fixed thread pool: Best for consistent workload, predictable resource usage
  • Cached thread pool (default): Good for variable workload, may create many threads
  • Single thread executor: For sequential processing, no parallelism
  • ForkJoinPool: Good for recursive divide-and-conquer tasks

Performance Tuning:

// For CPU-bound tasks, use core count
int threads = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(threads);

// For mixed workloads, use slightly more threads
int threads = Runtime.getRuntime().availableProcessors() + 2;
ExecutorService executor = Executors.newFixedThreadPool(threads);

Integration with Vector Databases

Storing Embeddings

import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;

// Create embedding store
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();

// Embed and store documents
List<TextSegment> documents = Arrays.asList(
    TextSegment.from("First document", new Metadata().put("id", "doc1")),
    TextSegment.from("Second document", new Metadata().put("id", "doc2"))
);

Response<List<Embedding>> response = model.embedAll(documents);
List<Embedding> embeddings = response.content();

// Store embeddings with their documents
for (int i = 0; i < documents.size(); i++) {
    embeddingStore.add(embeddings.get(i), documents.get(i));
}

Similarity Search

// Query embedding
Embedding queryEmbedding = model.embed("search query").content();

// Find similar documents
int maxResults = 5;
double minScore = 0.7;
List<EmbeddingMatch<TextSegment>> matches = embeddingStore.findRelevant(
    queryEmbedding,
    maxResults,
    minScore
);

// Process results
for (EmbeddingMatch<TextSegment> match : matches) {
    TextSegment segment = match.embedded();
    double score = match.score();
    System.out.println("Score: " + score + ", Text: " + segment.text());
}

Caching Embeddings

To avoid recomputing embeddings for the same text:

import java.util.concurrent.ConcurrentHashMap;
import java.util.Map;

public class CachedEmbeddingModel {
    private final EmbeddingModel model;
    private final Map<String, Embedding> cache;

    public CachedEmbeddingModel(EmbeddingModel model) {
        this.model = model;
        this.cache = new ConcurrentHashMap<>();
    }

    public Embedding embed(String text) {
        return cache.computeIfAbsent(text, t ->
            model.embed(t).content()
        );
    }

    public void clearCache() {
        cache.clear();
    }

    public int getCacheSize() {
        return cache.size();
    }
}

// Usage
EmbeddingModel baseModel = new AllMiniLmL6V2QuantizedEmbeddingModel();
CachedEmbeddingModel cachedModel = new CachedEmbeddingModel(baseModel);

Embedding emb1 = cachedModel.embed("test"); // Computed
Embedding emb2 = cachedModel.embed("test"); // Retrieved from cache
assert emb1 == emb2; // Same instance

Cache Considerations:

  • Memory usage: Each embedding is ~1.5KB (384 floats × 4 bytes)
  • Cache eviction: Implement LRU or size-based eviction for large caches
  • Thread safety: Use ConcurrentHashMap for concurrent access

Batch Processing Strategies

Fixed-Size Batching

public List<Embedding> embedAllInBatches(List<String> texts, int batchSize) {
    List<Embedding> allEmbeddings = new ArrayList<>();

    for (int i = 0; i < texts.size(); i += batchSize) {
        int end = Math.min(i + batchSize, texts.size());
        List<TextSegment> batch = texts.subList(i, end).stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());

        Response<List<Embedding>> response = model.embedAll(batch);
        allEmbeddings.addAll(response.content());
    }

    return allEmbeddings;
}

Adaptive Batching

public List<Embedding> embedAllAdaptive(List<String> texts) {
    int batchSize = 100;
    List<Embedding> allEmbeddings = new ArrayList<>();

    for (int i = 0; i < texts.size(); i += batchSize) {
        int end = Math.min(i + batchSize, texts.size());
        List<TextSegment> batch = texts.subList(i, end).stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());

        try {
            Response<List<Embedding>> response = model.embedAll(batch);
            allEmbeddings.addAll(response.content());
        } catch (OutOfMemoryError e) {
            // Reduce batch size and retry
            batchSize = batchSize / 2;
            i -= batchSize; // Retry current batch with smaller size
            System.err.println("OOM: reducing batch size to " + batchSize);
        }
    }

    return allEmbeddings;
}

Performance Monitoring

public class PerformanceMonitoringListener implements EmbeddingModelListener {
    private final AtomicLong totalRequests = new AtomicLong(0);
    private final AtomicLong totalTime = new AtomicLong(0);
    private final AtomicLong totalTokens = new AtomicLong(0);

    @Override
    public void onRequest(EmbeddingModelRequestContext ctx) {
        totalRequests.incrementAndGet();
        ctx.attributes().put("startTime", System.nanoTime());
    }

    @Override
    public void onResponse(EmbeddingModelResponseContext ctx) {
        long startTime = (Long) ctx.attributes().get("startTime");
        long duration = System.nanoTime() - startTime;
        totalTime.addAndGet(duration);

        TokenUsage usage = ctx.response().tokenUsage();
        if (usage != null && usage.inputTokenCount() != null) {
            totalTokens.addAndGet(usage.inputTokenCount());
        }
    }

    public void printStats() {
        long requests = totalRequests.get();
        long avgTimeMs = totalTime.get() / requests / 1_000_000;
        double avgTokens = (double) totalTokens.get() / requests;

        System.out.println("Total requests: " + requests);
        System.out.println("Average time: " + avgTimeMs + "ms");
        System.out.println("Average tokens: " + avgTokens);
    }
}

// Usage
PerformanceMonitoringListener monitor = new PerformanceMonitoringListener();
EmbeddingModel model = new AllMiniLmL6V2QuantizedEmbeddingModel()
    .addListener(monitor);

// ... use model ...

monitor.printStats();

Implementation Notes

  • Model Loading: The ONNX model file (all-minilm-l6-v2-q.onnx) and tokenizer (all-minilm-l6-v2-q-tokenizer.json) are loaded from the JAR's classpath during class initialization
  • Model Download: The model is automatically downloaded from HuggingFace during the Maven build process and bundled into the JAR
  • Static Model Instance: The model and tokenizer are loaded once as static instances and shared across all instances of AllMiniLmL6V2QuantizedEmbeddingModel. This means:
    • First instantiation triggers model loading (one-time cost)
    • Subsequent instantiations are fast (no reload)
    • Multiple instances share the same underlying model (memory efficient)
  • Token Counting: Token counts exclude the special tokens [CLS] and [SEP] that BERT models use internally
  • Vector Normalization: All embeddings produced are already normalized to unit length (magnitude ≈ 1.0). Calling normalize() on embeddings from this model is unnecessary.
  • Quantization: This is a quantized version of the model, providing smaller size and faster inference with a slight reduction in accuracy compared to the non-quantized version
  • Memory Footprint:
    • Model size: ~90MB (loaded once, shared across instances)
    • Per-embedding memory: ~1.5KB (384 floats × 4 bytes)
    • Temporary processing buffers: Varies with batch size
  • ONNX Runtime: Uses ONNX Runtime Java bindings for model inference. The runtime is automatically included as a transitive dependency.
  • Tokenizer: Uses a fast WordPiece tokenizer compatible with BERT-based models
  • No External Services: Runs entirely in-process; no network calls or external services required

Performance Characteristics

Embedding Speed

  • Single text (short, <50 tokens): 10-20ms
  • Single text (medium, 50-200 tokens): 20-40ms
  • Single text (long, 200-500 tokens): 40-100ms
  • Batch (10 segments, medium length): 50-150ms (parallelized)
  • Batch (100 segments, medium length): 300-800ms (parallelized)

Note: Times vary significantly with hardware (CPU speed, cores) and JVM configuration.

Memory Usage

  • Model loading: ~90MB (one-time, shared)
  • Per embedding: ~1.5KB (384 floats)
  • Batch processing overhead: ~10-50MB temporary buffers (depends on batch size)
  • Recommended heap: Minimum 512MB for basic usage, 2-4GB for large-scale processing

Scaling Considerations

  • Horizontal scaling: Create multiple model instances (each shares the static model but has its own executor)
  • Vertical scaling: Increase heap size and thread pool size for larger batches
  • Optimal batch size: 10-100 segments for best throughput/latency tradeoff
  • Maximum practical batch size: ~1000 segments (limited by memory)

Optimization Tips

  1. Reuse model instances: Model instantiation is lightweight, but reusing instances avoids executor overhead
  2. Batch when possible: embedAll() is more efficient than multiple embed() calls
  3. Tune thread pool: Match thread pool size to workload and hardware
  4. Cache embeddings: Cache frequently-used embeddings to avoid recomputation
  5. Warm up the model: First embedding is slower due to JIT compilation; run a warmup embedding at startup

Version History

  • 1.11.0: Current version
    • Added listener support (experimental)
    • Added context classes for request/response/error tracking
    • Core embedding functionality stable

Related Packages

  • langchain4j-embeddings-all-minilm-l6-v2: Non-quantized version (higher accuracy, larger size, slower)
  • langchain4j-embeddings: Core embedding interfaces and utilities
  • langchain4j-core: Core LangChain4j types and abstractions
  • langchain4j-store-embedding: Embedding store implementations for vector databases

See Also

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-embeddings-all-minilm-l6-v2-q@1.11.0
Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/dev.langchain4j/langchain4j-embeddings-all-minilm-l6-v2-q@1.11.x