CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j

Build LLM-powered applications in Java with support for chatbots, agents, RAG, tools, and much more

Overview
Eval results
Files

models.mddocs/

Chat and Language Models

Core model interfaces for interacting with language models. These interfaces provide the foundation for all LLM interactions in LangChain4j, supporting both synchronous and streaming chat, embeddings, and simple text generation.

Capabilities

ChatModel

Main interface for synchronous chat interactions with language models. Supports single-turn and multi-turn conversations with full message history.

package dev.langchain4j.model.chat;

import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.model.chat.listener.ChatModelListener;
import dev.langchain4j.model.chat.request.ChatRequest;
import dev.langchain4j.model.chat.request.ChatRequestParameters;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.ModelProvider;
import java.util.List;
import java.util.Set;

/**
 * Represents a language model that has a chat API
 */
public interface ChatModel {
    /**
     * Main API to interact with the chat model
     * @param chatRequest Request containing all inputs to the LLM
     * @return Response containing all outputs from the LLM
     */
    ChatResponse chat(ChatRequest chatRequest);

    /**
     * Convenience method for simple text-based chat
     * @param userMessage User message text
     * @return Response text from the LLM
     */
    String chat(String userMessage);

    /**
     * Chat with variable number of messages
     * @param messages Messages to send
     * @return Response from the LLM
     */
    ChatResponse chat(ChatMessage... messages);

    /**
     * Chat with list of messages
     * @param messages List of messages to send
     * @return Response from the LLM
     */
    ChatResponse chat(List<ChatMessage> messages);

    /**
     * Get supported capabilities
     * @return Set of capabilities this model supports
     */
    Set<Capability> supportedCapabilities();

    /**
     * Get default request parameters
     * @return Default parameters for requests
     */
    ChatRequestParameters defaultRequestParameters();

    /**
     * Get registered listeners
     * @return List of chat model listeners
     */
    List<ChatModelListener> listeners();

    /**
     * Get model provider
     * @return Provider of this model
     */
    ModelProvider provider();
}

Thread Safety:

  • Most ChatModel implementations are thread-safe and can be shared across multiple threads
  • However, individual ChatRequest and ChatResponse objects are NOT thread-safe
  • Best practice: Create one ChatModel instance per application and reuse it
  • Each thread should create its own ChatRequest and process its own ChatResponse
  • Listeners may be invoked concurrently - ensure listener implementations are thread-safe

Common Pitfalls:

  • DON'T create a new ChatModel instance for every request (expensive - involves HTTP client setup)
  • DON'T share ChatRequest/ChatResponse objects between threads
  • DON'T ignore the ChatResponse metadata - it contains token usage and finish reasons
  • DON'T use the simple chat(String) method if you need structured responses or metadata
  • DON'T forget to check supportedCapabilities() before using advanced features

Edge Cases:

  • Empty message list returns model-specific default response or error
  • Very long message histories may exceed context window - use TokenCountEstimator
  • Some models may return partial responses if interrupted or rate limited
  • Null or empty user message may throw IllegalArgumentException (implementation-specific)
  • Multiple consecutive system messages may be merged or rejected by some providers

Performance Notes:

  • Connection pooling is handled internally - reuse ChatModel instances
  • Batch multiple independent requests in separate threads rather than sequentially
  • Consider using StreamingChatModel for user-facing applications (better UX)
  • Pre-validate inputs to avoid network round-trips for invalid requests
  • Use ChatModelListener for monitoring without code changes

Cost Considerations:

  • Every chat() call consumes tokens (input + output)
  • Get actual token counts from ChatResponse.tokenUsage()
  • Use estimateTokenCount() (if model implements TokenCountEstimator) before expensive calls
  • Shorter prompts = lower cost, but may reduce quality
  • System messages count toward token usage
  • Consider caching responses for repeated queries

Exception Handling:

  • RuntimeException or subclasses for network errors, API errors, rate limiting
  • Common exceptions:
    • IOException wrapped in runtime exception for network failures
    • IllegalArgumentException for invalid inputs (null messages, empty text)
    • Provider-specific exceptions for API errors (e.g., OpenAI rate limit errors)
    • Timeout exceptions for slow responses
  • Always wrap calls in try-catch for production code
  • Check ChatResponse.finishReason() - may indicate truncation or content filtering

Usage Example (Basic):

import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.request.ChatRequest;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.data.message.SystemMessage;

// Simple text chat
String response = chatModel.chat("Hello, how are you?");

// Structured chat with request
ChatRequest request = ChatRequest.builder()
    .messages(
        SystemMessage.from("You are a helpful assistant."),
        UserMessage.from("What is the capital of France?")
    )
    .build();

ChatResponse chatResponse = chatModel.chat(request);
String answer = chatResponse.aiMessage().text();

Usage Example (Production-Ready with Error Handling):

import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.request.ChatRequest;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.data.message.SystemMessage;
import dev.langchain4j.model.output.FinishReason;
import java.util.concurrent.TimeUnit;

public class SafeChatExample {
    private final ChatModel chatModel;

    public String chatWithRetry(String userMessage, int maxRetries) {
        for (int attempt = 0; attempt < maxRetries; attempt++) {
            try {
                ChatRequest request = ChatRequest.builder()
                    .messages(
                        SystemMessage.from("You are a helpful assistant."),
                        UserMessage.from(userMessage)
                    )
                    .build();

                ChatResponse response = chatModel.chat(request);

                // Check finish reason
                if (response.finishReason() == FinishReason.STOP) {
                    // Log token usage for cost tracking
                    System.out.println("Tokens used: " + response.tokenUsage());
                    return response.aiMessage().text();
                } else if (response.finishReason() == FinishReason.LENGTH) {
                    throw new RuntimeException("Response truncated due to length");
                } else if (response.finishReason() == FinishReason.CONTENT_FILTER) {
                    throw new RuntimeException("Content filtered by moderation");
                }

            } catch (IllegalArgumentException e) {
                // Invalid input - don't retry
                throw e;
            } catch (RuntimeException e) {
                if (attempt == maxRetries - 1) {
                    throw new RuntimeException("Failed after " + maxRetries + " attempts", e);
                }
                // Exponential backoff for rate limiting
                try {
                    TimeUnit.SECONDS.sleep((long) Math.pow(2, attempt));
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("Interrupted during retry", ie);
                }
            }
        }
        throw new RuntimeException("Should not reach here");
    }
}

Capability Checking Pattern:

import dev.langchain4j.model.chat.Capability;

// Check if model supports JSON schema responses before using
if (chatModel.supportedCapabilities().contains(Capability.RESPONSE_FORMAT_JSON_SCHEMA)) {
    // Use structured output
    ChatRequestParameters params = ChatRequestParameters.builder()
        .responseFormat(ResponseFormat.JSON)
        .build();
} else {
    // Fallback to parsing text responses
    System.out.println("Model doesn't support JSON schema, using text parsing");
}

Related APIs:

  • StreamingChatModel - For streaming responses
  • ChatMessage - Message types (System, User, AI, Tool)
  • ChatRequest - Request builder with parameters
  • ChatResponse - Response with metadata
  • ChatModelListener - For monitoring and logging
  • TokenCountEstimator - For estimating costs before calls

StreamingChatModel

Interface for streaming chat interactions where responses are delivered token-by-token in real-time.

package dev.langchain4j.model.chat;

import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.model.chat.listener.ChatModelListener;
import dev.langchain4j.model.chat.request.ChatRequest;
import dev.langchain4j.model.chat.request.ChatRequestParameters;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import dev.langchain4j.model.ModelProvider;
import java.util.List;
import java.util.Set;

/**
 * Represents a language model that can stream responses token-by-token
 */
public interface StreamingChatModel {
    /**
     * Main API to interact with the streaming chat model
     * @param chatRequest Request containing all inputs to the LLM
     * @param handler Handler for streaming response
     */
    void chat(ChatRequest chatRequest, StreamingChatResponseHandler handler);

    /**
     * Convenience method for simple text-based streaming chat
     * @param userMessage User message text
     * @param handler Handler for streaming response
     */
    void chat(String userMessage, StreamingChatResponseHandler handler);

    /**
     * Stream chat with list of messages
     * @param messages List of messages to send
     * @param handler Handler for streaming response
     */
    void chat(List<ChatMessage> messages, StreamingChatResponseHandler handler);

    /**
     * Get supported capabilities
     * @return Set of capabilities this model supports
     */
    Set<Capability> supportedCapabilities();

    /**
     * Get default request parameters
     * @return Default parameters for requests
     */
    ChatRequestParameters defaultRequestParameters();

    /**
     * Get registered listeners
     * @return List of chat model listeners
     */
    List<ChatModelListener> listeners();

    /**
     * Get model provider
     * @return Provider of this model
     */
    ModelProvider provider();
}

Thread Safety:

  • StreamingChatModel implementations are typically thread-safe
  • Handler callbacks may be invoked on different threads - ensure handlers are thread-safe
  • DON'T share handler instances between concurrent calls
  • Each streaming request should have its own handler instance
  • Buffering in handlers must be synchronized if accessed from multiple threads

Common Pitfalls:

  • DON'T block in onPartialResponse() - it delays subsequent tokens
  • DON'T forget to implement onError() - silent failures are hard to debug
  • DON'T assume tokens arrive as complete words - may receive partial words
  • DON'T modify shared state in handlers without synchronization
  • DON'T forget that onCompleteResponse() is called AFTER all partial responses

Edge Cases:

  • Handler methods may never be called if request fails immediately
  • onPartialResponse() may be called only once if response is very short
  • Empty responses will call onCompleteResponse() without any onPartialResponse() calls
  • Network interruptions may cause onError() mid-stream
  • Some models may send whitespace-only tokens at the beginning

Performance Notes:

  • Streaming provides better perceived performance (lower time-to-first-token)
  • Actual total time may be slightly higher than non-streaming due to overhead
  • Don't buffer all tokens if you're displaying them - defeats streaming purpose
  • For logging/storage, wait for onCompleteResponse() to get complete text
  • Consider CompletableFuture for bridging streaming to synchronous code

Cost Considerations:

  • Token costs are identical to non-streaming ChatModel
  • Get final token usage from ChatResponse in onCompleteResponse()
  • Can't pre-validate response length - may need to implement token limits in handler
  • Consider implementing early termination in handler to save costs on long responses

Exception Handling:

  • onError() is called for all errors during streaming
  • Common error scenarios:
    • Network timeout mid-stream
    • Connection reset by provider
    • Rate limiting after stream starts
    • Invalid content causing content filter to trigger
  • Always implement onError() - never leave it empty
  • Log errors with request context for debugging

Usage Example (Basic):

import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import dev.langchain4j.model.chat.response.ChatResponse;

StreamingChatResponseHandler handler = new StreamingChatResponseHandler() {
    @Override
    public void onPartialResponse(String partialResponse) {
        System.out.print(partialResponse); // Print each token as it arrives
    }

    @Override
    public void onCompleteResponse(ChatResponse completeResponse) {
        System.out.println("\nComplete!");
    }

    @Override
    public void onError(Throwable error) {
        error.printStackTrace();
    }
};

streamingChatModel.chat("Tell me a story", handler);

Usage Example (Production-Ready with CompletableFuture):

import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import dev.langchain4j.model.chat.response.ChatResponse;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.TimeUnit;

public class StreamingChatHelper {

    public CompletableFuture<String> chatAsync(
            StreamingChatModel model,
            String message,
            int timeoutSeconds) {

        CompletableFuture<String> future = new CompletableFuture<>();

        StreamingChatResponseHandler handler = new StreamingChatResponseHandler() {
            private final StringBuilder accumulated = new StringBuilder();

            @Override
            public void onPartialResponse(String token) {
                synchronized (accumulated) {
                    accumulated.append(token);
                    // Optional: Publish progress
                    System.out.print(token);
                }
            }

            @Override
            public void onCompleteResponse(ChatResponse response) {
                // Validate finish reason
                if (response.finishReason() == FinishReason.STOP) {
                    future.complete(accumulated.toString());
                } else {
                    future.completeExceptionally(
                        new RuntimeException("Unexpected finish reason: " +
                            response.finishReason())
                    );
                }
            }

            @Override
            public void onError(Throwable error) {
                future.completeExceptionally(error);
            }
        };

        try {
            model.chat(message, handler);
        } catch (Exception e) {
            future.completeExceptionally(e);
        }

        // Apply timeout
        return future.orTimeout(timeoutSeconds, TimeUnit.SECONDS);
    }
}

Usage Example (Rate Limiting Handler):

import java.util.concurrent.atomic.AtomicInteger;

public class RateLimitingHandler implements StreamingChatResponseHandler {
    private final int maxTokens;
    private final AtomicInteger tokenCount = new AtomicInteger(0);
    private final StringBuilder result = new StringBuilder();
    private volatile boolean stopped = false;

    public RateLimitingHandler(int maxTokens) {
        this.maxTokens = maxTokens;
    }

    @Override
    public void onPartialResponse(String token) {
        if (stopped) return;

        if (tokenCount.incrementAndGet() > maxTokens) {
            stopped = true;
            System.err.println("Token limit exceeded, stopping stream");
            return;
        }

        result.append(token);
        System.out.print(token);
    }

    @Override
    public void onCompleteResponse(ChatResponse response) {
        if (!stopped) {
            System.out.println("\nCompleted: " + response.tokenUsage());
        }
    }

    @Override
    public void onError(Throwable error) {
        System.err.println("Error during streaming: " + error.getMessage());
    }

    public String getResult() {
        return result.toString();
    }
}

Related APIs:

  • ChatModel - Non-streaming equivalent
  • StreamingChatResponseHandler - Handler interface
  • ChatRequest - Request configuration
  • ChatResponse - Final response with metadata
  • CompletableFuture - For async patterns

EmbeddingModel

Interface for converting text into vector embeddings for semantic search and similarity comparisons.

package dev.langchain4j.model.embedding;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.listener.EmbeddingModelListener;
import dev.langchain4j.model.output.Response;
import java.util.List;

/**
 * Represents a model that can convert text into embeddings (vector representations)
 */
public interface EmbeddingModel {
    /**
     * Embed a single text string
     * @param text Text to embed
     * @return Response containing the embedding
     */
    Response<Embedding> embed(String text);

    /**
     * Embed the text content of a TextSegment
     * @param textSegment Text segment to embed
     * @return Response containing the embedding
     */
    Response<Embedding> embed(TextSegment textSegment);

    /**
     * Embed multiple text segments in a batch
     * @param textSegments Text segments to embed
     * @return Response containing list of embeddings
     */
    Response<List<Embedding>> embedAll(List<TextSegment> textSegments);

    /**
     * Get the dimension of embeddings produced by this model
     * @return Embedding dimension
     */
    int dimension();

    /**
     * Get the name of the underlying embedding model
     * @return Model name or "unknown" if not provided
     */
    String modelName();

    /**
     * Add a listener for embedding operations
     * @param listener Listener to add
     * @return EmbeddingModel with listener attached
     */
    EmbeddingModel addListener(EmbeddingModelListener listener);

    /**
     * Add multiple listeners for embedding operations
     * @param listeners Listeners to add
     * @return EmbeddingModel with listeners attached
     */
    EmbeddingModel addListeners(List<EmbeddingModelListener> listeners);
}

Thread Safety:

  • EmbeddingModel implementations are generally thread-safe
  • Embedding and Response objects are immutable and safe to share
  • Use embedAll() for batch processing - more efficient than parallel embed() calls
  • Listeners may be invoked concurrently
  • Vector stores may not be thread-safe - check documentation

Common Pitfalls:

  • DON'T embed texts one-by-one when you have a batch - use embedAll() (much faster)
  • DON'T assume all embedding models have the same dimension
  • DON'T compare embeddings from different models - dimensions and semantics differ
  • DON'T forget to normalize vectors before cosine similarity (if model doesn't pre-normalize)
  • DON'T embed very short texts (< 3 words) - quality degrades significantly

Edge Cases:

  • Empty string returns zero vector or error (implementation-specific)
  • Text exceeding model's token limit is truncated (usually first N tokens)
  • Special characters and emojis may be handled inconsistently
  • Texts in unsupported languages may return poor-quality embeddings
  • Null input throws IllegalArgumentException

Performance Notes:

  • embedAll() is 5-10x faster than multiple embed() calls for batches
  • Optimal batch size varies by provider (typically 50-200 items)
  • Consider caching embeddings for static documents
  • Embedding is much cheaper than LLM generation but not free
  • Local embedding models (e.g., SBERT) can be faster for high volume

Cost Considerations:

  • Charged per token, not per API call
  • Batch operations (embedAll()) cost the same as individual but are faster
  • Caching embeddings is critical for cost control
  • Check Response.tokenUsage() for actual token counts
  • Some providers charge by input characters, not tokens

Exception Handling:

  • Common exceptions:
    • IllegalArgumentException for null or invalid input
    • Network exceptions wrapped in RuntimeException
    • Rate limiting errors
    • Token limit exceeded errors
  • Check Response for errors even if no exception thrown
  • Some implementations return empty embeddings on error

Usage Example (Basic):

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.output.Response;

// Embed a single text
Response<Embedding> response = embeddingModel.embed("Hello world");
Embedding embedding = response.content();
float[] vector = embedding.vector();
int dimension = embedding.dimension();

// Embed multiple texts
List<TextSegment> segments = List.of(
    TextSegment.from("First text"),
    TextSegment.from("Second text")
);
Response<List<Embedding>> multiResponse = embeddingModel.embedAll(segments);
List<Embedding> embeddings = multiResponse.content();

Usage Example (Production-Ready with Caching):

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.stream.Collectors;

public class CachedEmbeddingService {
    private final EmbeddingModel model;
    private final Map<String, Embedding> cache = new ConcurrentHashMap<>();
    private final int maxCacheSize;

    public CachedEmbeddingService(EmbeddingModel model, int maxCacheSize) {
        this.model = model;
        this.maxCacheSize = maxCacheSize;
    }

    public List<Embedding> embedWithCache(List<String> texts) {
        // Separate cached and uncached
        List<String> uncached = new ArrayList<>();
        List<Embedding> results = new ArrayList<>(texts.size());

        for (String text : texts) {
            Embedding cached = cache.get(text);
            if (cached != null) {
                results.add(cached);
            } else {
                uncached.add(text);
                results.add(null); // Placeholder
            }
        }

        // Batch embed uncached texts
        if (!uncached.isEmpty()) {
            try {
                List<TextSegment> segments = uncached.stream()
                    .map(TextSegment::from)
                    .collect(Collectors.toList());

                Response<List<Embedding>> response = model.embedAll(segments);
                List<Embedding> newEmbeddings = response.content();

                // Cache new embeddings
                for (int i = 0; i < uncached.size(); i++) {
                    Embedding embedding = newEmbeddings.get(i);
                    cache.put(uncached.get(i), embedding);

                    // Evict if cache too large (simple LRU would be better)
                    if (cache.size() > maxCacheSize) {
                        String firstKey = cache.keySet().iterator().next();
                        cache.remove(firstKey);
                    }
                }

                // Fill in results
                int uncachedIdx = 0;
                for (int i = 0; i < results.size(); i++) {
                    if (results.get(i) == null) {
                        results.set(i, newEmbeddings.get(uncachedIdx++));
                    }
                }
            } catch (RuntimeException e) {
                throw new RuntimeException("Failed to embed texts: " + e.getMessage(), e);
            }
        }

        return results;
    }

    public void clearCache() {
        cache.clear();
    }

    public int getCacheSize() {
        return cache.size();
    }
}

Usage Example (Similarity Search):

import dev.langchain4j.data.embedding.Embedding;

public class SimilarityHelper {

    // Cosine similarity (assumes vectors are normalized)
    public static double cosineSimilarity(Embedding e1, Embedding e2) {
        float[] v1 = e1.vector();
        float[] v2 = e2.vector();

        if (v1.length != v2.length) {
            throw new IllegalArgumentException("Embedding dimensions don't match");
        }

        double dotProduct = 0.0;
        double norm1 = 0.0;
        double norm2 = 0.0;

        for (int i = 0; i < v1.length; i++) {
            dotProduct += v1[i] * v2[i];
            norm1 += v1[i] * v1[i];
            norm2 += v2[i] * v2[i];
        }

        return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2));
    }

    public static List<String> findMostSimilar(
            String query,
            List<String> documents,
            EmbeddingModel model,
            int topK) {

        // Embed all at once for efficiency
        List<String> allTexts = new ArrayList<>();
        allTexts.add(query);
        allTexts.addAll(documents);

        List<TextSegment> segments = allTexts.stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());

        List<Embedding> embeddings = model.embedAll(segments).content();
        Embedding queryEmbedding = embeddings.get(0);

        // Calculate similarities
        List<Map.Entry<String, Double>> similarities = new ArrayList<>();
        for (int i = 0; i < documents.size(); i++) {
            double similarity = cosineSimilarity(queryEmbedding, embeddings.get(i + 1));
            similarities.add(Map.entry(documents.get(i), similarity));
        }

        // Sort and return top K
        return similarities.stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .limit(topK)
            .map(Map.Entry::getKey)
            .collect(Collectors.toList());
    }
}

Related APIs:

  • Embedding - Vector representation
  • TextSegment - Structured text with metadata
  • EmbeddingStore - For persisting embeddings
  • EmbeddingModelListener - For monitoring
  • Vector databases (Pinecone, Weaviate, etc.) for large-scale storage

LanguageModel

Simple text generation interface without chat message structure. Recommended to use ChatModel instead for more features.

package dev.langchain4j.model.language;

import dev.langchain4j.model.input.Prompt;
import dev.langchain4j.model.output.Response;

/**
 * Represents a language model with a simple text interface
 * ChatModel is recommended for most use cases
 */
public interface LanguageModel {
    /**
     * Generate a response to the given prompt
     * @param prompt Prompt text
     * @return Response containing generated text
     */
    Response<String> generate(String prompt);

    /**
     * Generate a response to the given prompt
     * @param prompt Prompt object
     * @return Response containing generated text
     */
    Response<String> generate(Prompt prompt);
}

Thread Safety:

  • LanguageModel implementations are typically thread-safe
  • Response objects are immutable
  • Prompt objects should not be shared between threads during modification

Common Pitfalls:

  • DON'T use LanguageModel for chat applications - use ChatModel instead
  • DON'T expect message history management - LanguageModel is stateless
  • DON'T use for function calling or structured outputs - ChatModel is required
  • Limited metadata compared to ChatModel

Edge Cases:

  • Very long prompts may be truncated silently
  • Empty prompts may return empty responses or errors
  • No way to specify system messages or roles

Performance Notes:

  • Slightly simpler than ChatModel but offers fewer features
  • No performance advantage over ChatModel
  • Consider ChatModel for all new code

Cost Considerations:

  • Token costs identical to ChatModel
  • Less control over token usage due to simpler interface

Exception Handling:

  • Same exceptions as ChatModel
  • Limited error information in responses

Related APIs:

  • ChatModel - Recommended alternative
  • StreamingLanguageModel - Streaming version
  • Prompt - Prompt template support

Model Capabilities

Enum representing capabilities that models can support.

package dev.langchain4j.model.chat;

/**
 * Represents a capability of a ChatModel or StreamingChatModel
 * Used by low-level APIs to communicate supported features to high-level APIs
 */
public enum Capability {
    /**
     * Indicates model supports responding in JSON format according to a specified JSON schema
     */
    RESPONSE_FORMAT_JSON_SCHEMA
}

Thread Safety:

  • Enum is inherently thread-safe

Common Pitfalls:

  • DON'T assume all models support all capabilities
  • Always check supportedCapabilities() before using advanced features

Usage Example:

import dev.langchain4j.model.chat.Capability;

public class CapabilityChecker {

    public static boolean supportsJsonSchema(ChatModel model) {
        return model.supportedCapabilities()
            .contains(Capability.RESPONSE_FORMAT_JSON_SCHEMA);
    }

    public static void useJsonSchemaIfSupported(ChatModel model) {
        if (supportsJsonSchema(model)) {
            // Use structured output
            System.out.println("Using JSON schema mode");
        } else {
            // Fallback to text parsing
            System.out.println("Falling back to text parsing");
        }
    }
}

Related APIs:

  • ChatModel.supportedCapabilities() - Check model capabilities
  • ChatRequestParameters - Configure request based on capabilities

StreamingLanguageModel

Simple streaming text generation interface. Recommended to use StreamingChatModel instead for more features.

package dev.langchain4j.model.language;

import dev.langchain4j.model.StreamingResponseHandler;
import dev.langchain4j.model.input.Prompt;

/**
 * Represents a language model with streaming text generation
 * StreamingChatModel is recommended for most use cases
 */
public interface StreamingLanguageModel {
    /**
     * Stream a response to the given prompt
     * @param prompt Prompt text
     * @param handler Handler for streaming response
     */
    void generate(String prompt, StreamingResponseHandler<String> handler);

    /**
     * Stream a response to the given prompt
     * @param prompt Prompt object
     * @param handler Handler for streaming response
     */
    void generate(Prompt prompt, StreamingResponseHandler<String> handler);
}

Thread Safety:

  • Same considerations as StreamingChatModel
  • Handler callbacks must be thread-safe

Common Pitfalls:

  • DON'T use for chat - use StreamingChatModel
  • Limited features compared to StreamingChatModel
  • No message history support

Performance Notes:

  • No advantage over StreamingChatModel
  • Use StreamingChatModel for new code

Related APIs:

  • StreamingChatModel - Recommended alternative
  • StreamingResponseHandler - Handler interface

ModerationModel

Interface for content moderation to detect harmful, unsafe, or policy-violating content.

package dev.langchain4j.model.moderation;

import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.input.Prompt;
import dev.langchain4j.model.output.Response;
import java.util.List;

/**
 * Represents a model that can moderate text content
 * Used for detecting harmful, unsafe, or policy-violating content
 */
public interface ModerationModel {
    /**
     * Moderate text
     * @param text Text to moderate
     * @return Moderation response
     */
    Response<Moderation> moderate(String text);

    /**
     * Moderate prompt
     * @param prompt Prompt to moderate
     * @return Moderation response
     */
    Response<Moderation> moderate(Prompt prompt);

    /**
     * Moderate single chat message
     * @param message Chat message to moderate
     * @return Moderation response
     */
    Response<Moderation> moderate(ChatMessage message);

    /**
     * Moderate list of chat messages
     * @param messages Chat messages to moderate
     * @return Moderation response
     */
    Response<Moderation> moderate(List<ChatMessage> messages);

    /**
     * Moderate text segment
     * @param textSegment Text segment to moderate
     * @return Moderation response
     */
    Response<Moderation> moderate(TextSegment textSegment);
}

Thread Safety:

  • ModerationModel implementations are thread-safe
  • Moderation and Response objects are immutable
  • Can be shared across threads

Common Pitfalls:

  • DON'T skip moderation for user-generated content in production
  • DON'T assume moderation is instant - it adds latency
  • DON'T forget to log moderation decisions for compliance
  • DON'T rely solely on moderation - implement application-level controls too
  • Be aware of false positives - have a human review process

Edge Cases:

  • Empty content may return no flags or error
  • Mixed safe/unsafe content in list may flag entire batch
  • Code snippets may trigger false positives
  • Different languages may have different sensitivity
  • Context matters - moderation may lack context awareness

Performance Notes:

  • Adds latency to request pipeline (typically 100-300ms)
  • Consider async moderation for non-blocking flows
  • Cache moderation results for repeated content
  • Batch moderation when possible

Cost Considerations:

  • Usually charged per API call, not tokens
  • Significantly cheaper than LLM generation
  • Consider cost vs. risk tradeoff for your use case

Exception Handling:

  • Network errors wrapped in RuntimeException
  • Invalid input may throw IllegalArgumentException
  • Rate limiting applies
  • Fail open (allow) vs fail closed (block) policy decision needed

Usage Example (Basic):

import dev.langchain4j.model.moderation.ModerationModel;
import dev.langchain4j.model.moderation.Moderation;
import dev.langchain4j.model.output.Response;

Response<Moderation> response = moderationModel.moderate("Some text to check");
Moderation moderation = response.content();

if (moderation.flagged()) {
    System.out.println("Content flagged: " + moderation.flaggedText());
}

Usage Example (Production-Ready with Detailed Checks):

import dev.langchain4j.model.moderation.ModerationModel;
import dev.langchain4j.model.moderation.Moderation;
import dev.langchain4j.data.message.UserMessage;
import java.util.logging.Logger;

public class ContentModerator {
    private static final Logger log = Logger.getLogger(ContentModerator.class.getName());
    private final ModerationModel moderationModel;

    public ModerationResult moderateUserInput(String userInput) {
        try {
            Response<Moderation> response = moderationModel.moderate(userInput);
            Moderation moderation = response.content();

            if (moderation.flagged()) {
                // Log for compliance and review
                log.warning("Content flagged: categories=" +
                    moderation.flaggedCategories() +
                    ", user input length=" + userInput.length());

                return ModerationResult.blocked(
                    "Your message contains content that violates our policies: " +
                    moderation.flaggedCategories()
                );
            }

            return ModerationResult.allowed();

        } catch (RuntimeException e) {
            // Fail open: allow content but log error
            log.severe("Moderation service error: " + e.getMessage());
            return ModerationResult.allowed(); // Or fail closed with .blocked()
        }
    }

    public static class ModerationResult {
        private final boolean allowed;
        private final String reason;

        private ModerationResult(boolean allowed, String reason) {
            this.allowed = allowed;
            this.reason = reason;
        }

        public static ModerationResult allowed() {
            return new ModerationResult(true, null);
        }

        public static ModerationResult blocked(String reason) {
            return new ModerationResult(false, reason);
        }

        public boolean isAllowed() { return allowed; }
        public String getReason() { return reason; }
    }
}

Usage Example (Async Moderation for Non-Blocking):

import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class AsyncModerator {
    private final ModerationModel moderationModel;
    private final ExecutorService executor = Executors.newFixedThreadPool(5);

    public CompletableFuture<Boolean> moderateAsync(String content) {
        return CompletableFuture.supplyAsync(() -> {
            try {
                Response<Moderation> response = moderationModel.moderate(content);
                return !response.content().flagged();
            } catch (Exception e) {
                // Log and fail open
                return true;
            }
        }, executor);
    }

    public void shutdown() {
        executor.shutdown();
    }
}

Related APIs:

  • Moderation - Moderation result with categories
  • ChatMessage - For moderating messages
  • Response - Wrapper with metadata

ScoringModel

Interface for scoring/reranking text segments against a query. Useful for re-ranking retrieved documents.

package dev.langchain4j.model.scoring;

import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.output.Response;
import java.util.List;

/**
 * Represents a model capable of scoring text against a query
 * Useful for re-ranking retrieved documents by relevance
 */
public interface ScoringModel {
    /**
     * Score a single text against query
     * @param text Text to score
     * @param query Query to score against
     * @return Response containing relevance score
     */
    Response<Double> score(String text, String query);

    /**
     * Score a single segment against query
     * @param segment Text segment to score
     * @param query Query to score against
     * @return Response containing relevance score
     */
    Response<Double> score(TextSegment segment, String query);

    /**
     * Score multiple segments against query
     * @param segments Text segments to score
     * @param query Query to score against
     * @return Response containing list of scores (same order as input)
     */
    Response<List<Double>> scoreAll(List<TextSegment> segments, String query);
}

Thread Safety:

  • ScoringModel implementations are thread-safe
  • Response and score objects are immutable
  • Safe to use from multiple threads

Common Pitfalls:

  • DON'T assume scores are normalized (range varies by model)
  • DON'T compare raw scores across different models
  • DON'T use scoring as first-stage retrieval - use embeddings first
  • DON'T forget that scoring is relatively expensive
  • Optimal for reranking top 50-100 results, not thousands

Edge Cases:

  • Empty query may return zero scores or error
  • Empty documents return low scores
  • Very long documents may be truncated
  • Score ranges vary: some models use [0,1], others [-inf, +inf]

Performance Notes:

  • Much slower than embedding similarity (10-100x)
  • Use for reranking top K results only (K < 100)
  • Batch scoring with scoreAll() is more efficient
  • Consider caching scores for static query-document pairs

Cost Considerations:

  • Typically charged per scoring operation
  • More expensive than embeddings, less than LLM generation
  • Batch operations may have volume discounts
  • Reranking 100 documents can cost as much as 1 LLM call

Exception Handling:

  • Network errors wrapped in RuntimeException
  • Invalid inputs throw IllegalArgumentException
  • Rate limiting applies
  • Timeouts for large batches

Usage Example (Basic):

import dev.langchain4j.model.scoring.ScoringModel;
import dev.langchain4j.data.segment.TextSegment;
import java.util.List;

// Score multiple documents for re-ranking
List<TextSegment> documents = List.of(
    TextSegment.from("Document about Java programming"),
    TextSegment.from("Document about Python"),
    TextSegment.from("Document about web development")
);

String query = "How to program in Java?";
Response<List<Double>> scores = scoringModel.scoreAll(documents, query);

// Use scores to re-rank documents

Usage Example (Production-Ready Reranking Pipeline):

import dev.langchain4j.model.scoring.ScoringModel;
import dev.langchain4j.data.segment.TextSegment;
import java.util.*;
import java.util.stream.Collectors;

public class RerankerService {
    private final ScoringModel scoringModel;
    private final int maxRerank;

    public RerankerService(ScoringModel scoringModel, int maxRerank) {
        this.scoringModel = scoringModel;
        this.maxRerank = maxRerank;
    }

    public List<ScoredDocument> rerank(String query, List<String> candidates) {
        // Limit candidates to maxRerank for performance
        List<String> toRerank = candidates.stream()
            .limit(maxRerank)
            .collect(Collectors.toList());

        try {
            List<TextSegment> segments = toRerank.stream()
                .map(TextSegment::from)
                .collect(Collectors.toList());

            Response<List<Double>> response = scoringModel.scoreAll(segments, query);
            List<Double> scores = response.content();

            // Combine documents with scores
            List<ScoredDocument> scored = new ArrayList<>();
            for (int i = 0; i < toRerank.size(); i++) {
                scored.add(new ScoredDocument(toRerank.get(i), scores.get(i)));
            }

            // Sort by score descending
            scored.sort(Comparator.comparingDouble(ScoredDocument::getScore).reversed());

            return scored;

        } catch (RuntimeException e) {
            // Fallback: return original order
            return toRerank.stream()
                .map(doc -> new ScoredDocument(doc, 0.0))
                .collect(Collectors.toList());
        }
    }

    public static class ScoredDocument {
        private final String document;
        private final double score;

        public ScoredDocument(String document, double score) {
            this.document = document;
            this.score = score;
        }

        public String getDocument() { return document; }
        public double getScore() { return score; }
    }
}

Usage Example (Two-Stage Retrieval with Reranking):

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.scoring.ScoringModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import java.util.List;
import java.util.stream.Collectors;

public class TwoStageRetriever {
    private final EmbeddingModel embeddingModel;
    private final ScoringModel scoringModel;
    private final List<String> documentCorpus;
    private final List<Embedding> documentEmbeddings;

    public TwoStageRetriever(
            EmbeddingModel embeddingModel,
            ScoringModel scoringModel,
            List<String> documentCorpus) {
        this.embeddingModel = embeddingModel;
        this.scoringModel = scoringModel;
        this.documentCorpus = documentCorpus;

        // Pre-compute embeddings
        List<TextSegment> segments = documentCorpus.stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());
        this.documentEmbeddings = embeddingModel.embedAll(segments).content();
    }

    public List<String> retrieve(String query, int topK) {
        // Stage 1: Fast embedding-based retrieval
        Embedding queryEmbedding = embeddingModel.embed(query).content();
        List<ScoredDoc> candidates = new ArrayList<>();

        for (int i = 0; i < documentCorpus.size(); i++) {
            double similarity = cosineSimilarity(
                queryEmbedding,
                documentEmbeddings.get(i)
            );
            candidates.add(new ScoredDoc(i, documentCorpus.get(i), similarity));
        }

        // Get top 50 by embedding similarity
        List<String> topCandidates = candidates.stream()
            .sorted(Comparator.comparingDouble(ScoredDoc::getScore).reversed())
            .limit(50)
            .map(ScoredDoc::getDoc)
            .collect(Collectors.toList());

        // Stage 2: Accurate reranking with scoring model
        List<TextSegment> toRerank = topCandidates.stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());

        List<Double> scores = scoringModel.scoreAll(toRerank, query).content();

        List<ScoredDoc> reranked = new ArrayList<>();
        for (int i = 0; i < topCandidates.size(); i++) {
            reranked.add(new ScoredDoc(i, topCandidates.get(i), scores.get(i)));
        }

        return reranked.stream()
            .sorted(Comparator.comparingDouble(ScoredDoc::getScore).reversed())
            .limit(topK)
            .map(ScoredDoc::getDoc)
            .collect(Collectors.toList());
    }

    private static class ScoredDoc {
        int index;
        String doc;
        double score;
        ScoredDoc(int index, String doc, double score) {
            this.index = index;
            this.doc = doc;
            this.score = score;
        }
        String getDoc() { return doc; }
        double getScore() { return score; }
    }

    private double cosineSimilarity(Embedding e1, Embedding e2) {
        float[] v1 = e1.vector();
        float[] v2 = e2.vector();
        double dot = 0.0, norm1 = 0.0, norm2 = 0.0;
        for (int i = 0; i < v1.length; i++) {
            dot += v1[i] * v2[i];
            norm1 += v1[i] * v1[i];
            norm2 += v2[i] * v2[i];
        }
        return dot / (Math.sqrt(norm1) * Math.sqrt(norm2));
    }
}

Related APIs:

  • EmbeddingModel - For first-stage retrieval
  • TextSegment - Document representation
  • Response - Score wrapper

ImageModel

Interface for generating images from text prompts.

package dev.langchain4j.model.image;

import dev.langchain4j.data.image.Image;
import dev.langchain4j.model.output.Response;

/**
 * Represents a model that can generate images from text prompts
 */
public interface ImageModel {
    /**
     * Generate image from prompt
     * @param prompt Text prompt describing desired image
     * @return Response containing generated image
     */
    Response<Image> generate(String prompt);
}

Thread Safety:

  • ImageModel implementations are thread-safe
  • Image and Response objects are immutable

Common Pitfalls:

  • DON'T expect instant results - image generation is slow (5-30 seconds)
  • DON'T forget to handle timeouts appropriately
  • DON'T skip content safety checks on generated images
  • Prompt engineering for images requires different skills than text
  • Generated images may not match expectations

Edge Cases:

  • Empty prompts may return error or random image
  • Very long prompts may be truncated
  • Some requests may be filtered for policy violations
  • Network failures mid-generation lose all progress

Performance Notes:

  • Very slow compared to text generation (10-30 seconds typical)
  • Consider async/background processing
  • No batching support in most implementations
  • High memory usage for high-resolution images

Cost Considerations:

  • Very expensive compared to text generation
  • Typically charged per image, not tokens
  • Higher resolution = higher cost
  • Failed generations may still incur charges

Exception Handling:

  • Long timeouts needed (30-60 seconds)
  • Content policy violations throw exceptions
  • Network errors common due to long generation time
  • Rate limiting strictly enforced

Usage Example (Basic):

import dev.langchain4j.model.image.ImageModel;
import dev.langchain4j.data.image.Image;
import dev.langchain4j.model.output.Response;

Response<Image> response = imageModel.generate(
    "A serene landscape with mountains and a lake"
);
Image image = response.content();
String url = image.url();

Usage Example (Production-Ready with Async):

import dev.langchain4j.model.image.ImageModel;
import dev.langchain4j.data.image.Image;
import java.util.concurrent.*;

public class AsyncImageGenerator {
    private final ImageModel imageModel;
    private final ExecutorService executor = Executors.newFixedThreadPool(3);

    public CompletableFuture<Image> generateAsync(String prompt) {
        return CompletableFuture.supplyAsync(() -> {
            try {
                Response<Image> response = imageModel.generate(prompt);
                return response.content();
            } catch (RuntimeException e) {
                throw new CompletionException(
                    "Image generation failed: " + e.getMessage(), e
                );
            }
        }, executor).orTimeout(60, TimeUnit.SECONDS);
    }

    public void shutdown() {
        executor.shutdown();
    }
}

Related APIs:

  • Image - Generated image data
  • Response - Wrapper with metadata

TokenCountEstimator

Interface for estimating token counts without making API calls to the model.

package dev.langchain4j.model;

import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.agent.tool.ToolSpecification;
import java.util.List;

/**
 * Interface for estimating token counts
 * Useful for staying within model token limits
 */
public interface TokenCountEstimator {
    /**
     * Estimate tokens in text
     * @param text Text to estimate
     * @return Estimated token count
     */
    int estimateTokenCount(String text);

    /**
     * Estimate tokens in messages
     * @param messages Chat messages to estimate
     * @return Estimated token count
     */
    int estimateTokenCount(List<ChatMessage> messages);

    /**
     * Estimate tokens in messages and tools
     * @param messages Chat messages to estimate
     * @param toolSpecifications Tool specifications
     * @return Estimated token count
     */
    int estimateTokenCount(List<ChatMessage> messages,
                           List<ToolSpecification> toolSpecifications);
}

Thread Safety:

  • TokenCountEstimator implementations are thread-safe
  • Estimation is typically a pure function (no shared state)

Common Pitfalls:

  • DON'T assume estimates are 100% accurate (typically ±5-10%)
  • DON'T use estimates from one model for a different model
  • DON'T forget that output tokens also count (can't be estimated beforehand)
  • Response format overhead (JSON, function calls) adds tokens

Edge Cases:

  • Special characters and emojis count differently
  • Different tokenizers (GPT-3.5 vs GPT-4 vs Claude) produce different counts
  • Tool specifications add significant token overhead
  • Empty strings return 0

Performance Notes:

  • Very fast (microseconds) - no network calls
  • Safe to call frequently
  • Consider caching if estimating the same text repeatedly

Cost Considerations:

  • Free operation - no API calls
  • Critical for cost control in production
  • Use to implement token budgets
  • Estimate before expensive operations

Exception Handling:

  • Rarely throws exceptions
  • Null inputs may throw IllegalArgumentException
  • Very large inputs may cause memory issues (rare)

Usage Example (Basic):

import dev.langchain4j.model.TokenCountEstimator;
import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.data.message.UserMessage;
import java.util.List;

TokenCountEstimator estimator = (TokenCountEstimator) chatModel;

int textTokens = estimator.estimateTokenCount("Hello, world!");

List<ChatMessage> messages = List.of(
    UserMessage.from("What is AI?")
);
int messageTokens = estimator.estimateTokenCount(messages);

Usage Example (Token Budget Management):

import dev.langchain4j.model.TokenCountEstimator;
import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.data.message.ChatMessage;
import dev.langchain4j.data.message.UserMessage;
import dev.langchain4j.data.message.SystemMessage;
import java.util.ArrayList;
import java.util.List;

public class TokenBudgetManager {
    private final TokenCountEstimator estimator;
    private final int maxContextTokens;
    private final int reservedForResponse;

    public TokenBudgetManager(
            TokenCountEstimator estimator,
            int maxContextTokens,
            int reservedForResponse) {
        this.estimator = estimator;
        this.maxContextTokens = maxContextTokens;
        this.reservedForResponse = reservedForResponse;
    }

    public List<ChatMessage> fitToContext(
            SystemMessage systemMessage,
            List<ChatMessage> history,
            UserMessage currentMessage) {

        List<ChatMessage> result = new ArrayList<>();
        result.add(systemMessage);

        int availableTokens = maxContextTokens - reservedForResponse;
        int systemTokens = estimator.estimateTokenCount(
            List.of(systemMessage)
        );
        int currentTokens = estimator.estimateTokenCount(
            List.of(currentMessage)
        );

        availableTokens -= (systemTokens + currentTokens);

        if (availableTokens < 0) {
            throw new IllegalArgumentException(
                "System message and current message exceed token budget"
            );
        }

        // Add history from most recent, working backwards
        for (int i = history.size() - 1; i >= 0; i--) {
            ChatMessage message = history.get(i);
            int messageTokens = estimator.estimateTokenCount(List.of(message));

            if (messageTokens <= availableTokens) {
                result.add(1, message); // Insert after system message
                availableTokens -= messageTokens;
            } else {
                break; // No more room
            }
        }

        result.add(currentMessage);
        return result;
    }

    public int estimateRemainingBudget(List<ChatMessage> messages) {
        int used = estimator.estimateTokenCount(messages);
        return maxContextTokens - used - reservedForResponse;
    }
}

Usage Example (Pre-validation):

public class TokenValidator {
    private final TokenCountEstimator estimator;
    private final int maxInputTokens;

    public void validateBeforeCall(List<ChatMessage> messages) {
        int estimatedTokens = estimator.estimateTokenCount(messages);

        if (estimatedTokens > maxInputTokens) {
            throw new IllegalArgumentException(
                String.format(
                    "Message too long: %d tokens (max %d)",
                    estimatedTokens,
                    maxInputTokens
                )
            );
        }
    }
}

Related APIs:

  • ChatModel - Often implements TokenCountEstimator
  • ChatMessage - Messages to estimate
  • ToolSpecification - Tools add token overhead

Model Provider

Enum identifying the provider of a model.

package dev.langchain4j.model;

/**
 * Identifies the provider of a model
 */
public enum ModelProvider {
    OPENAI,
    ANTHROPIC,
    GOOGLE,
    AZURE,
    OLLAMA,
    // ... and other providers
    OTHER
}

Thread Safety:

  • Enum is inherently thread-safe

Usage Example:

import dev.langchain4j.model.ModelProvider;
import dev.langchain4j.model.chat.ChatModel;

public class ProviderSpecificLogic {

    public int getDefaultTimeout(ChatModel model) {
        ModelProvider provider = model.provider();

        switch (provider) {
            case OPENAI:
                return 30;
            case ANTHROPIC:
                return 60;
            case OLLAMA:
                return 120; // Local, may be slower
            default:
                return 45;
        }
    }

    public boolean supportsStreaming(ChatModel model) {
        // Check if model is also a StreamingChatModel
        return model instanceof StreamingChatModel;
    }
}

Related APIs:

  • ChatModel.provider() - Get model provider
  • Provider-specific model implementations

Testing Patterns

Mocking ChatModel

import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.data.message.AiMessage;
import dev.langchain4j.model.output.FinishReason;
import dev.langchain4j.model.output.TokenUsage;
import org.mockito.Mockito;

public class ChatModelTestHelper {

    public static ChatModel createMockChatModel(String response) {
        ChatModel mock = Mockito.mock(ChatModel.class);

        ChatResponse chatResponse = ChatResponse.builder()
            .aiMessage(AiMessage.from(response))
            .finishReason(FinishReason.STOP)
            .tokenUsage(new TokenUsage(10, 20))
            .build();

        Mockito.when(mock.chat(Mockito.anyString()))
            .thenReturn(response);

        Mockito.when(mock.chat(Mockito.any(ChatRequest.class)))
            .thenReturn(chatResponse);

        return mock;
    }

    public static ChatModel createErrorMock() {
        ChatModel mock = Mockito.mock(ChatModel.class);

        Mockito.when(mock.chat(Mockito.anyString()))
            .thenThrow(new RuntimeException("API Error"));

        return mock;
    }
}

Testing with In-Memory Models

import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.openai.OpenAiChatModel;

public class TestableService {
    private final ChatModel chatModel;

    public TestableService(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    // Business logic here

    // In tests:
    public static void testWithMock() {
        ChatModel testModel = ChatModelTestHelper.createMockChatModel(
            "Test response"
        );
        TestableService service = new TestableService(testModel);
        // Test service...
    }
}

Testing Streaming Models

import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.StreamingChatResponseHandler;
import java.util.concurrent.CompletableFuture;

public class StreamingTestHelper {

    public static CompletableFuture<String> captureStreamingResponse(
            StreamingChatModel model,
            String message) {

        CompletableFuture<String> future = new CompletableFuture<>();
        StringBuilder captured = new StringBuilder();

        StreamingChatResponseHandler handler = new StreamingChatResponseHandler() {
            @Override
            public void onPartialResponse(String token) {
                captured.append(token);
            }

            @Override
            public void onCompleteResponse(ChatResponse response) {
                future.complete(captured.toString());
            }

            @Override
            public void onError(Throwable error) {
                future.completeExceptionally(error);
            }
        };

        model.chat(message, handler);
        return future;
    }
}

Testing EmbeddingModel

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.output.Response;
import org.mockito.Mockito;

public class EmbeddingModelTestHelper {

    public static EmbeddingModel createMockEmbedding(int dimension) {
        EmbeddingModel mock = Mockito.mock(EmbeddingModel.class);

        Mockito.when(mock.dimension()).thenReturn(dimension);

        Mockito.when(mock.embed(Mockito.anyString()))
            .thenAnswer(invocation -> {
                float[] vector = new float[dimension];
                for (int i = 0; i < dimension; i++) {
                    vector[i] = (float) Math.random();
                }
                return Response.from(new Embedding(vector));
            });

        return mock;
    }
}

Error Recovery Patterns

Retry with Exponential Backoff

import java.util.concurrent.TimeUnit;

public class RetryHelper {

    public static <T> T retryWithBackoff(
            Supplier<T> operation,
            int maxRetries,
            long initialDelayMs) {

        Exception lastException = null;

        for (int attempt = 0; attempt < maxRetries; attempt++) {
            try {
                return operation.get();
            } catch (Exception e) {
                lastException = e;

                if (attempt < maxRetries - 1) {
                    long delay = initialDelayMs * (long) Math.pow(2, attempt);
                    try {
                        TimeUnit.MILLISECONDS.sleep(delay);
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new RuntimeException("Interrupted during retry", ie);
                    }
                }
            }
        }

        throw new RuntimeException(
            "Operation failed after " + maxRetries + " attempts",
            lastException
        );
    }
}

Circuit Breaker Pattern

import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;

public class CircuitBreaker {
    private final int failureThreshold;
    private final long timeoutMs;
    private final AtomicInteger failureCount = new AtomicInteger(0);
    private final AtomicLong lastFailureTime = new AtomicLong(0);
    private volatile boolean open = false;

    public CircuitBreaker(int failureThreshold, long timeoutMs) {
        this.failureThreshold = failureThreshold;
        this.timeoutMs = timeoutMs;
    }

    public <T> T execute(Supplier<T> operation) {
        if (open) {
            long elapsed = System.currentTimeMillis() - lastFailureTime.get();
            if (elapsed < timeoutMs) {
                throw new RuntimeException("Circuit breaker is OPEN");
            } else {
                // Try half-open state
                open = false;
                failureCount.set(0);
            }
        }

        try {
            T result = operation.get();
            failureCount.set(0); // Reset on success
            return result;
        } catch (Exception e) {
            failureCount.incrementAndGet();
            lastFailureTime.set(System.currentTimeMillis());

            if (failureCount.get() >= failureThreshold) {
                open = true;
            }

            throw e;
        }
    }

    public boolean isOpen() {
        return open;
    }
}

Fallback Pattern

import dev.langchain4j.model.chat.ChatModel;

public class FallbackChatService {
    private final ChatModel primaryModel;
    private final ChatModel fallbackModel;

    public String chatWithFallback(String message) {
        try {
            return primaryModel.chat(message);
        } catch (RuntimeException e) {
            System.err.println("Primary model failed, using fallback: " +
                e.getMessage());
            try {
                return fallbackModel.chat(message);
            } catch (RuntimeException fallbackError) {
                throw new RuntimeException(
                    "Both primary and fallback models failed",
                    fallbackError
                );
            }
        }
    }
}

Rate Limiting Pattern

import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit;

public class RateLimitedChatModel {
    private final ChatModel delegate;
    private final Semaphore rateLimiter;

    public RateLimitedChatModel(ChatModel delegate, int maxConcurrent) {
        this.delegate = delegate;
        this.rateLimiter = new Semaphore(maxConcurrent);
    }

    public String chat(String message, long timeoutSeconds) throws InterruptedException {
        if (!rateLimiter.tryAcquire(timeoutSeconds, TimeUnit.SECONDS)) {
            throw new RuntimeException("Rate limit timeout");
        }

        try {
            return delegate.chat(message);
        } finally {
            rateLimiter.release();
        }
    }
}

Rate Limiting Guidance

Provider-Specific Rate Limits

  • OpenAI: Typically 3,500 requests/minute for GPT-4, 10,000 for GPT-3.5
  • Anthropic: Varies by plan, typically 1,000 requests/minute
  • Azure OpenAI: Configurable, default 240 requests/minute
  • Google: Varies by model, check quota in console
  • Ollama: No rate limits (local)

Implementing Rate Limiting

import com.google.common.util.concurrent.RateLimiter;

public class RateLimitedService {
    private final RateLimiter rateLimiter;
    private final ChatModel chatModel;

    public RateLimitedService(ChatModel chatModel, double requestsPerSecond) {
        this.chatModel = chatModel;
        this.rateLimiter = RateLimiter.create(requestsPerSecond);
    }

    public String chat(String message) {
        rateLimiter.acquire(); // Blocks until permit available
        return chatModel.chat(message);
    }
}

Handling 429 Rate Limit Errors

public class RateLimitHandler {

    public static <T> T handleRateLimits(Supplier<T> operation) {
        int maxRetries = 5;

        for (int i = 0; i < maxRetries; i++) {
            try {
                return operation.get();
            } catch (RuntimeException e) {
                if (isRateLimitError(e) && i < maxRetries - 1) {
                    long waitTime = calculateWaitTime(e, i);
                    try {
                        TimeUnit.MILLISECONDS.sleep(waitTime);
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new RuntimeException("Interrupted", ie);
                    }
                } else {
                    throw e;
                }
            }
        }

        throw new RuntimeException("Should not reach here");
    }

    private static boolean isRateLimitError(RuntimeException e) {
        return e.getMessage() != null &&
            (e.getMessage().contains("429") ||
             e.getMessage().contains("rate limit"));
    }

    private static long calculateWaitTime(RuntimeException e, int attempt) {
        // Check for Retry-After header in exception message
        // Default to exponential backoff
        return 1000 * (long) Math.pow(2, attempt);
    }
}

Model Capability Checking Patterns

Dynamic Feature Detection

import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.Capability;

public class FeatureDetector {

    public static class ModelCapabilities {
        public final boolean supportsJsonSchema;
        public final boolean supportsStreaming;
        public final boolean supportsTokenEstimation;
        public final boolean supportsFunctionCalling;

        public ModelCapabilities(ChatModel model) {
            this.supportsJsonSchema = model.supportedCapabilities()
                .contains(Capability.RESPONSE_FORMAT_JSON_SCHEMA);
            this.supportsStreaming = model instanceof StreamingChatModel;
            this.supportsTokenEstimation = model instanceof TokenCountEstimator;
            this.supportsFunctionCalling = checkFunctionCalling(model);
        }

        private boolean checkFunctionCalling(ChatModel model) {
            // Check if model accepts tool specifications
            try {
                model.chat(ChatRequest.builder()
                    .messages(UserMessage.from("test"))
                    .toolSpecifications(List.of())
                    .build());
                return true;
            } catch (UnsupportedOperationException e) {
                return false;
            } catch (Exception e) {
                // Other error, assume supported
                return true;
            }
        }
    }

    public static ModelCapabilities detect(ChatModel model) {
        return new ModelCapabilities(model);
    }
}

Graceful Degradation

public class AdaptiveService {
    private final ChatModel model;
    private final ModelCapabilities capabilities;

    public AdaptiveService(ChatModel model) {
        this.model = model;
        this.capabilities = FeatureDetector.detect(model);
    }

    public String processWithStructuredOutput(String prompt) {
        if (capabilities.supportsJsonSchema) {
            // Use native JSON schema support
            return processWithNativeJson(prompt);
        } else {
            // Fallback to prompt engineering
            return processWithPromptEngineering(prompt);
        }
    }

    private String processWithNativeJson(String prompt) {
        ChatRequest request = ChatRequest.builder()
            .messages(UserMessage.from(prompt))
            .responseFormat(ResponseFormat.JSON)
            .build();
        return model.chat(request).aiMessage().text();
    }

    private String processWithPromptEngineering(String prompt) {
        String enhancedPrompt = prompt +
            "\n\nPlease respond in valid JSON format.";
        return model.chat(enhancedPrompt);
    }
}

This enhanced documentation provides production-grade guidance for coding agents, covering all aspects of using LangChain4j models safely and efficiently in real-world applications.

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j

docs

ai-services.md

chains.md

classification.md

data-types.md

document-processing.md

embedding-store.md

guardrails.md

index.md

memory.md

messages.md

models.md

output-parsing.md

prompts.md

rag.md

request-response.md

spi.md

tools.md

README.md

tile.json