CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-io-quarkiverse-langchain4j--quarkus-langchain4j-anthropic

Quarkus extension for integrating Anthropic Claude LLM models into Quarkus applications via LangChain4j

Overview
Eval results
Files

advanced-features.mddocs/

Advanced Features

The quarkus-langchain4j-anthropic extension provides advanced capabilities including extended thinking mode, prompt caching, streaming responses, and tool/function calling.

Extended Thinking Mode

Claude's extended thinking mode enables the model to perform internal reasoning before generating a response, improving performance on complex tasks.

ThinkingConfig

Configuration interface for thinking mode:

package io.quarkiverse.langchain4j.anthropic.runtime.config;

@ConfigGroup
interface ChatModelConfig.ThinkingConfig {
    /**
     * The thinking type to enable Claude's reasoning process.
     * Set to "enabled" to activate extended thinking.
     */
    Optional<String> type();

    /**
     * The token budget for the model's thinking process.
     * This allocates tokens specifically for internal reasoning
     * before generating the final response.
     * Typical values: 5000-15000 tokens
     */
    Optional<Integer> budgetTokens();

    /**
     * Whether thinking results should be returned in the response.
     * When true, the thinking content is included in the AiMessage.
     * Default: false
     */
    Optional<Boolean> returnThinking();

    /**
     * Whether previously stored thinking should be sent in follow-up requests.
     * Maintains reasoning context across multiple interactions.
     * Default: true
     */
    Optional<Boolean> sendThinking();

    /**
     * Enable interleaved thinking for Claude 4 models.
     * Allows reasoning between tool calls.
     * Requires Claude 4 model (e.g., claude-opus-4-20250514)
     * and thinking.type: enabled.
     * Default: false
     */
    Optional<Boolean> interleaved();
}

Configuration Example

# Enable thinking
quarkus.langchain4j.anthropic.chat-model.thinking.type=enabled

# Allocate 10000 tokens for thinking
quarkus.langchain4j.anthropic.chat-model.thinking.budget-tokens=10000

# Return thinking content in response
quarkus.langchain4j.anthropic.chat-model.thinking.return-thinking=true

# Send thinking in follow-up messages
quarkus.langchain4j.anthropic.chat-model.thinking.send-thinking=true

# Enable interleaved thinking for Claude 4
quarkus.langchain4j.anthropic.chat-model.thinking.interleaved=true

Usage Example

import jakarta.inject.Inject;
import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.data.message.UserMessage;

@ApplicationScoped
public class ComplexReasoningService {
    @Inject
    ChatModel chatModel;

    public String solveComplexProblem(String problem) {
        // With thinking enabled, Claude will reason internally
        // before generating the final answer
        ChatResponse response = chatModel.chat(UserMessage.from(problem));

        // Access thinking content if return-thinking=true
        String thinking = response.aiMessage().thinking();
        String answer = response.aiMessage().text();

        System.out.println("Thinking: " + thinking);
        return answer;
    }
}

When to Use Thinking Mode

  • Complex problem-solving: Mathematical problems, logic puzzles, code debugging
  • Multi-step reasoning: Tasks requiring planning and step-by-step analysis
  • Accuracy-critical tasks: When correctness is more important than speed
  • Development/debugging: View reasoning process during development

Thinking Mode Considerations

  • Latency: Adds processing time (thinking tokens are generated first)
  • Cost: Thinking tokens count toward total token usage
  • Token budget: Set appropriate budgetTokens based on problem complexity
  • Model support: Best results with Claude 4 models (Opus, Sonnet)
  • Return thinking: Enable during development, disable in production if not needed

Prompt Caching

Prompt caching reduces costs and latency by caching portions of prompts that are reused across requests.

Configuration

# Cache system messages
quarkus.langchain4j.anthropic.chat-model.cache-system-messages=true

# Cache tool definitions
quarkus.langchain4j.anthropic.chat-model.cache-tools=true

Configuration API

package io.quarkiverse.langchain4j.anthropic.runtime.config;

@ConfigGroup
interface ChatModelConfig {
    /**
     * Cache system messages to reduce costs for repeated prompts.
     * Requires minimum 1024 tokens (Claude Opus/Sonnet) or
     * 2048-4096 tokens (Haiku).
     * Supported models: Claude Opus 4.1, Sonnet 4.5, Haiku 4.5, and later.
     * Default: false
     */
    Boolean cacheSystemMessages();

    /**
     * Cache tool definitions to reduce costs.
     * Requires minimum 1024 tokens (Claude Opus/Sonnet) or
     * 2048-4096 tokens (Haiku).
     * Supported models: Claude Opus 4.1, Sonnet 4.5, Haiku 4.5, and later.
     * Default: false
     */
    Boolean cacheTools();
}

When to Use Prompt Caching

  • Long system messages: When system messages exceed minimum token threshold
  • Repeated tool sets: When using the same tool definitions across requests
  • Large context: When providing extensive background information
  • High-volume applications: Where cache hit rates justify overhead

Caching Requirements

  • Minimum tokens: 1024 (Opus/Sonnet) or 2048-4096 (Haiku)
  • Supported models: Claude Opus 4.1+, Sonnet 4.5+, Haiku 4.5+
  • Cache duration: Typically 5 minutes
  • Cost savings: ~90% reduction for cached tokens

Streaming Responses

Streaming enables real-time response generation, providing partial updates as tokens are generated.

StreamingChatResponseHandler

Interface for handling streaming responses:

package dev.langchain4j.model.chat.response;

/**
 * Handler for streaming chat responses
 */
interface StreamingChatResponseHandler {
    /**
     * Called for each partial text response
     *
     * @param response Partial text response
     * @param context Context including stream handle
     */
    void onPartialResponse(
        PartialResponse response,
        PartialResponseContext context
    );

    /**
     * Called for each partial thinking token (if thinking enabled and returned)
     *
     * @param thinking Partial thinking content
     * @param context Context including stream handle
     */
    void onPartialThinking(
        PartialThinking thinking,
        PartialThinkingContext context
    );

    /**
     * Called for each partial tool call during streaming
     *
     * @param toolCall Partial tool call information
     * @param context Context including stream handle
     */
    void onPartialToolCall(
        PartialToolCall toolCall,
        PartialToolCallContext context
    );

    /**
     * Called when a tool call is complete
     *
     * @param toolCall Complete tool call with all arguments
     */
    void onCompleteToolCall(CompleteToolCall toolCall);

    /**
     * Called when streaming completes successfully
     *
     * @param response Complete chat response
     */
    void onCompleteResponse(ChatResponse response);

    /**
     * Called when an error occurs during streaming
     *
     * @param error The error that occurred
     */
    void onError(Throwable error);
}

Related Streaming Types

/**
 * Partial text response
 */
class PartialResponse {
    public String text();
}

/**
 * Context for partial responses
 */
class PartialResponseContext {
    public StreamingHandle streamingHandle();
}

/**
 * Partial thinking content
 */
class PartialThinking {
    public String thinking();
}

/**
 * Context for partial thinking
 */
class PartialThinkingContext {
    public StreamingHandle streamingHandle();
}

/**
 * Context for partial tool calls
 */
class PartialToolCallContext {
    public StreamingHandle streamingHandle();
}

/**
 * Partial tool call
 */
class PartialToolCall {
    public int index();
    public String id();
    public String name();
    public String partialArguments();
}

/**
 * Complete tool call
 */
class CompleteToolCall {
    public int index();
    public dev.langchain4j.agent.tool.ToolExecutionRequest toolExecutionRequest();
}

/**
 * Handle to control streaming
 */
interface StreamingHandle {
    void cancel();
    boolean isCancelled();
}

Basic Streaming Example

import jakarta.inject.Inject;
import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.*;

@ApplicationScoped
public class StreamingService {
    @Inject
    StreamingChatModel streamingChatModel;

    public void streamResponse(String prompt) {
        streamingChatModel.chat(prompt, new StreamingChatResponseHandler() {
            @Override
            public void onPartialResponse(PartialResponse response,
                                         PartialResponseContext context) {
                // Print each token as it arrives
                System.out.print(response.text());
            }

            @Override
            public void onCompleteResponse(ChatResponse response) {
                System.out.println("\n--- Complete ---");
                System.out.println("Total tokens: " +
                    response.tokenUsage().totalTokenCount());
            }

            @Override
            public void onError(Throwable error) {
                System.err.println("Error: " + error.getMessage());
            }
        });
    }
}

Streaming with Thinking

streamingChatModel.chat(complexProblem, new StreamingChatResponseHandler() {
    @Override
    public void onPartialThinking(PartialThinking thinking,
                                 PartialThinkingContext context) {
        // Called when thinking tokens arrive (if return-thinking=true)
        // The thinking parameter contains the partial thinking content
        System.out.print("[Thinking received] ");
    }

    @Override
    public void onPartialResponse(PartialResponse response,
                                 PartialResponseContext context) {
        // Stream response tokens
        System.out.print(response.text());
    }

    @Override
    public void onCompleteResponse(ChatResponse response) {
        // Access complete thinking and response
        String fullThinking = response.aiMessage().thinking();
        String fullResponse = response.aiMessage().text();
        System.out.println("\n--- Analysis Complete ---");
    }

    @Override
    public void onError(Throwable error) {
        error.printStackTrace();
    }
});

Canceling Streams

streamingChatModel.chat(prompt, new StreamingChatResponseHandler() {
    private int tokenCount = 0;
    private static final int MAX_TOKENS = 100;
    private boolean cancelled = false;

    @Override
    public void onPartialResponse(PartialResponse response,
                                 PartialResponseContext context) {
        System.out.print(response.text());
        tokenCount++;

        // Cancel stream after 100 tokens
        if (tokenCount >= MAX_TOKENS) {
            context.streamingHandle().cancel();
            cancelled = true;
            System.out.println("\n[Stream cancelled]");
        }
    }

    @Override
    public void onCompleteResponse(ChatResponse response) {
        if (!cancelled) {
            System.out.println("\n[Stream completed]");
        }
    }

    @Override
    public void onError(Throwable error) {
        error.printStackTrace();
    }
});

Streaming Best Practices

  • UI Updates: Stream to provide real-time feedback in user interfaces
  • Long Responses: Stream for responses that may take time to generate
  • Token Limits: Use cancellation to enforce token limits
  • Error Handling: Always implement onError for robust applications
  • Buffering: Buffer partial responses if needed for processing

Tool/Function Calling

The extension supports declarative tool definitions for function calling.

Declarative AI Service with Tools

import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.service.UserMessage;
import dev.langchain4j.agent.tool.Tool;
import io.quarkiverse.langchain4j.RegisterAiService;
import jakarta.enterprise.context.ApplicationScoped;

// Define tools
@ApplicationScoped
public class WeatherTools {
    @Tool("Get current weather for a location")
    public String getWeather(String location) {
        // Implementation
        return "Sunny, 72°F in " + location;
    }

    @Tool("Get weather forecast for next N days")
    public String getForecast(String location, int days) {
        // Implementation
        return days + "-day forecast for " + location;
    }
}

// AI Service with tools
@RegisterAiService(tools = WeatherTools.class)
public interface WeatherAssistant {
    @SystemMessage("You are a weather assistant. Use the provided tools to answer questions.")
    String chat(String message);
}

// Usage
@ApplicationScoped
public class WeatherService {
    @Inject
    WeatherAssistant assistant;

    public String getWeatherInfo(String query) {
        // Claude will automatically call tools as needed
        return assistant.chat(query);
    }
}

Tool Calling with Streaming

streamingChatModel.chat(request, new StreamingChatResponseHandler() {
    @Override
    public void onPartialToolCall(PartialToolCall toolCall,
                                 PartialToolCallContext context) {
        System.out.println("Calling tool: " + toolCall.name());
        System.out.println("Arguments (partial): " + toolCall.partialArguments());
    }

    @Override
    public void onCompleteToolCall(CompleteToolCall toolCall) {
        var request = toolCall.toolExecutionRequest();
        System.out.println("Tool: " + request.name());
        System.out.println("Arguments: " + request.arguments());
        System.out.println("ID: " + request.id());
    }

    @Override
    public void onPartialResponse(PartialResponse response,
                                 PartialResponseContext context) {
        System.out.print(response.text());
    }

    @Override
    public void onCompleteResponse(ChatResponse response) {
        List<ToolExecutionRequest> toolRequests =
            response.aiMessage().toolExecutionRequests();

        if (!toolRequests.isEmpty()) {
            System.out.println("Tools to execute:");
            for (var tool : toolRequests) {
                System.out.println("- " + tool.name());
            }
        }
    }

    @Override
    public void onError(Throwable error) {
        error.printStackTrace();
    }
});

Tool Caching

Enable tool caching for applications with many tool definitions:

# Cache tool definitions across requests
quarkus.langchain4j.anthropic.chat-model.cache-tools=true

This reduces costs when using the same tools across multiple requests.

Response Attributes

Access additional response metadata through attributes:

import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.data.message.AiMessage;
import java.util.Map;

ChatResponse response = chatModel.chat(request);
AiMessage aiMessage = response.aiMessage();

// Get thinking signature (if thinking enabled)
Map<String, Object> attributes = aiMessage.attributes();
if (attributes != null) {
    String thinkingSignature = (String) attributes.get("thinking_signature");
    List<String> redactedThinkings = (List<String>) attributes.get("redacted_thinking");

    if (thinkingSignature != null) {
        System.out.println("Thinking signature: " + thinkingSignature);
    }
}

Token Usage Tracking

Monitor token usage for cost management:

import dev.langchain4j.model.chat.response.ChatResponse;
import dev.langchain4j.model.output.TokenUsage;

ChatResponse response = chatModel.chat(request);
TokenUsage usage = response.tokenUsage();

System.out.println("Input tokens: " + usage.inputTokenCount());
System.out.println("Output tokens: " + usage.outputTokenCount());
System.out.println("Total tokens: " + usage.totalTokenCount());

// Calculate approximate cost (example rates)
double inputCost = usage.inputTokenCount() * 0.00003;  // $0.03 per 1K tokens
double outputCost = usage.outputTokenCount() * 0.00015; // $0.15 per 1K tokens
double totalCost = inputCost + outputCost;

System.out.printf("Estimated cost: $%.4f%n", totalCost);

Stop Sequences

Define custom stop sequences to control response termination:

# Stop generation when these sequences are encountered
quarkus.langchain4j.anthropic.chat-model.stop-sequences=STOP,END,###

Configuration API:

package io.quarkiverse.langchain4j.anthropic.runtime.config;

@ConfigGroup
interface ChatModelConfig {
    /**
     * Custom text sequences that will cause the model to stop generating.
     * When the model generates any of these sequences, it will stop
     * and return the response up to that point.
     */
    Optional<List<String>> stopSequences();
}

Usage:

// When configured with stop-sequences=STOP,END
// The model will stop generating when it outputs "STOP" or "END"
String response = chatModel.chat("List items, then output STOP");
// Response will end at "STOP"

Multi-Turn Conversations

Maintain context across multiple turns:

import dev.langchain4j.data.message.*;
import dev.langchain4j.model.chat.response.ChatResponse;
import java.util.ArrayList;
import java.util.List;

@ApplicationScoped
public class ConversationService {
    @Inject
    ChatModel chatModel;

    public String conversation() {
        List<ChatMessage> messages = new ArrayList<>();

        // Turn 1
        messages.add(UserMessage.from("What is the capital of France?"));
        ChatResponse response1 = chatModel.chat(messages.toArray(new ChatMessage[0]));
        messages.add(response1.aiMessage());
        System.out.println("AI: " + response1.aiMessage().text());

        // Turn 2 (with context)
        messages.add(UserMessage.from("What is its population?"));
        ChatResponse response2 = chatModel.chat(messages.toArray(new ChatMessage[0]));
        messages.add(response2.aiMessage());
        System.out.println("AI: " + response2.aiMessage().text());

        // Turn 3 (with full context including thinking if enabled)
        messages.add(UserMessage.from("Compare it to London."));
        ChatResponse response3 = chatModel.chat(messages.toArray(new ChatMessage[0]));
        return response3.aiMessage().text();
    }
}

Thinking with Tool Calls (Interleaved)

Claude 4 models support interleaved thinking, allowing reasoning between tool calls:

# Requires Claude 4 model
quarkus.langchain4j.anthropic.chat-model.model-name=claude-opus-4-20250514

# Enable thinking
quarkus.langchain4j.anthropic.chat-model.thinking.type=enabled
quarkus.langchain4j.anthropic.chat-model.thinking.budget-tokens=10000

# Enable interleaved thinking
quarkus.langchain4j.anthropic.chat-model.thinking.interleaved=true

This allows Claude to:

  1. Think about which tool to call
  2. Execute the tool
  3. Think about the tool's result
  4. Decide on next actions
  5. Generate the final response

Complete Advanced Example

import jakarta.inject.Inject;
import jakarta.enterprise.context.ApplicationScoped;
import dev.langchain4j.model.chat.ChatModel;
import dev.langchain4j.model.chat.StreamingChatModel;
import dev.langchain4j.model.chat.response.*;
import dev.langchain4j.data.message.*;
import io.quarkiverse.langchain4j.ModelName;
import java.util.ArrayList;
import java.util.List;

@ApplicationScoped
public class AdvancedService {
    @Inject
    @ModelName("advanced")
    ChatModel chatModel;

    @Inject
    @ModelName("advanced")
    StreamingChatModel streamingChatModel;

    /**
     * Complex reasoning with thinking, caching, and token tracking
     */
    public String solveComplexProblem(String problem) {
        List<ChatMessage> messages = new ArrayList<>();

        // System message (will be cached if cache-system-messages=true)
        String systemPrompt = """
            You are an expert problem solver with advanced reasoning capabilities.
            Break down complex problems step by step.
            Use your thinking process to explore different approaches.
            """;
        messages.add(SystemMessage.from(systemPrompt));

        // User message
        messages.add(UserMessage.from(problem));

        // Get response with thinking
        ChatResponse response = chatModel.chat(messages.toArray(new ChatMessage[0]));

        // Access thinking and response
        String thinking = response.aiMessage().thinking();
        String answer = response.aiMessage().text();

        // Track token usage
        var usage = response.tokenUsage();
        System.out.println("Tokens: " + usage.totalTokenCount());

        return answer;
    }

    /**
     * Streaming with thinking and cancellation
     */
    public void streamWithControl(String prompt) {
        streamingChatModel.chat(prompt, new StreamingChatResponseHandler() {
            private StringBuilder thinkingBuffer = new StringBuilder();
            private StringBuilder responseBuffer = new StringBuilder();

            @Override
            public void onPartialThinking(PartialThinking partial,
                                         PartialThinkingContext context) {
                thinkingBuffer.append(partial.thinking());
                System.out.print("[Thinking] " + partial.thinking());

                // Cancel if thinking is too verbose
                if (thinkingBuffer.length() > 5000) {
                    context.streamingHandle().cancel();
                }
            }

            @Override
            public void onPartialResponse(PartialResponse partial,
                                         PartialResponseContext context) {
                responseBuffer.append(partial.text());
                System.out.print(partial.text());
            }

            @Override
            public void onPartialToolCall(PartialToolCall toolCall,
                                         PartialToolCallContext context) {
                System.out.println("\n[Tool call] " + toolCall.name());
            }

            @Override
            public void onCompleteToolCall(CompleteToolCall toolCall) {
                System.out.println("[Tool ready] " + toolCall.toolExecutionRequest().name());
            }

            @Override
            public void onCompleteResponse(ChatResponse response) {
                System.out.println("\n--- Complete ---");
                System.out.println("Thinking length: " + thinkingBuffer.length());
                System.out.println("Response length: " + responseBuffer.length());
                System.out.println("Tokens used: " +
                    response.tokenUsage().totalTokenCount());
            }

            @Override
            public void onError(Throwable error) {
                System.err.println("Error: " + error.getMessage());
            }
        });
    }
}

Configuration for Advanced Example

# Advanced model configuration
quarkus.langchain4j.anthropic.advanced.api-key=sk-ant-...
quarkus.langchain4j.anthropic.advanced.chat-model.model-name=claude-opus-4-20250514
quarkus.langchain4j.anthropic.advanced.chat-model.max-tokens=4096
quarkus.langchain4j.anthropic.advanced.chat-model.temperature=0.7

# Extended thinking with interleaving
quarkus.langchain4j.anthropic.advanced.chat-model.thinking.type=enabled
quarkus.langchain4j.anthropic.advanced.chat-model.thinking.budget-tokens=15000
quarkus.langchain4j.anthropic.advanced.chat-model.thinking.return-thinking=true
quarkus.langchain4j.anthropic.advanced.chat-model.thinking.interleaved=true

# Prompt caching for efficiency
quarkus.langchain4j.anthropic.advanced.chat-model.cache-system-messages=true
quarkus.langchain4j.anthropic.advanced.chat-model.cache-tools=true

# Debugging
quarkus.langchain4j.anthropic.advanced.log-requests=true

Install with Tessl CLI

npx tessl i tessl/maven-io-quarkiverse-langchain4j--quarkus-langchain4j-anthropic@1.7.0

docs

advanced-features.md

cdi-injection.md

client-api.md

configuration.md

index.md

tile.json