Spring Boot-compatible Ollama integration providing ChatModel and EmbeddingModel implementations for running large language models locally with support for streaming, tool calling, model management, and observability.
Configuration options for Ollama chat model operations.
OllamaChatOptions provides comprehensive configuration for chat model behavior, including model selection, generation parameters, GPU/memory management, sampling control, and tool calling capabilities.
package org.springframework.ai.ollama.api;
public class OllamaChatOptions implements ToolCallingChatOptionsImplements: org.springframework.ai.model.tool.ToolCallingChatOptions
// Using builder
OllamaChatOptions options = OllamaChatOptions.builder()
.model(OllamaModel.LLAMA3.id())
.temperature(0.7)
.build();
// Copy from existing options
OllamaChatOptions copy = OllamaChatOptions.fromOptions(existingOptions);The builder provides overloaded methods for convenient configuration.
Model Selection:
// Accepts String model name
public Builder model(String model);
// Accepts OllamaModel enum
public Builder model(OllamaModel model);// Using String
OllamaChatOptions options = OllamaChatOptions.builder()
.model("llama3")
.build();
// Using enum (recommended)
OllamaChatOptions options = OllamaChatOptions.builder()
.model(OllamaModel.MISTRAL)
.build();Tool Callbacks:
// Accepts List
public Builder toolCallbacks(List<ToolCallback> toolCallbacks);
// Accepts varargs
public Builder toolCallbacks(ToolCallback... toolCallbacks);// Using List
.toolCallbacks(List.of(callback1, callback2))
// Using varargs
.toolCallbacks(callback1, callback2, callback3)Tool Names:
// Accepts Set
public Builder toolNames(Set<String> toolNames);
// Accepts varargs
public Builder toolNames(String... toolNames);// Using Set
.toolNames(Set.of("getTool1", "getTool2"))
// Using varargs
.toolNames("getTool1", "getTool2")Controls which model to use and response format.
OllamaChatOptions options = OllamaChatOptions.builder()
// Model name (required)
.model("llama3")
// Response format: "json" or JSON Schema Map
.format("json")
// How long to keep model in memory (e.g., "5m", "1h")
.keepAlive("10m")
// Truncate inputs to fit context length
.truncate(true)
.build();Parameters:
model (String): Model name from Ollama libraryformat (Object): Response format - String "json" or Map containing JSON SchemakeepAlive (String): Duration in Go format (e.g., "5m", "30s", "1h")truncate (Boolean): Auto-truncate to context length (default: true)Control text generation behavior.
OllamaChatOptions options = OllamaChatOptions.builder()
// Sampling temperature (0.0 - 2.0)
.temperature(0.8)
// Maximum tokens to generate
.numPredict(256)
// Random seed for reproducibility
.seed(42)
// Top-k sampling
.topK(40)
// Top-p (nucleus) sampling
.topP(0.9)
// Minimum probability threshold
.minP(0.05)
// Repetition penalties
.repeatPenalty(1.1)
.presencePenalty(0.0)
.frequencyPenalty(0.0)
// Stop sequences
.stop(List.of("Human:", "Assistant:"))
.build();Key Parameters:
temperature (Double): Creativity control (default: 0.8)
numPredict (Integer): Max tokens (default: 128, -1=infinite, -2=fill context)seed (Integer): Random seed (default: -1 for random)topK (Integer): Consider top K tokens (default: 40)topP (Double): Nucleus sampling threshold (default: 0.9)minP (Double): Minimum probability relative to top token (default: 0.0)repeatPenalty (Double): Penalize repetitions (default: 1.1)presencePenalty (Double): Presence penalty (default: 0.0)frequencyPenalty (Double): Frequency penalty (default: 0.0)stop (List<String>): Stop sequencesFine-grained control over token sampling.
OllamaChatOptions options = OllamaChatOptions.builder()
// Tail-free sampling
.tfsZ(1.0)
// Typical sampling
.typicalP(1.0)
// Repetition context window
.repeatLastN(64)
// Mirostat sampling (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
.mirostat(0)
.mirostatTau(5.0f)
.mirostatEta(0.1f)
// Penalize newlines in output
.penalizeNewline(true)
// Number of tokens to keep from prompt
.numKeep(4)
.build();Advanced Parameters:
tfsZ (Float): Tail-free sampling (default: 1.0, disabled)typicalP (Float): Typical sampling (default: 1.0)repeatLastN (Integer): Look-back window for penalties (default: 64, 0=disabled, -1=numCtx)mirostat (Integer): Mirostat mode (0/1/2)mirostatTau (Float): Target entropy (default: 5.0)mirostatEta (Float): Learning rate (default: 0.1)penalizeNewline (Boolean): Penalize newlines (default: true)numKeep (Integer): Tokens to keep from prompt (default: 4)Configure hardware resource usage.
OllamaChatOptions options = OllamaChatOptions.builder()
// Context window size
.numCtx(4096)
// Batch size for prompt processing
.numBatch(512)
// GPU layers (-1 = auto, 0 = CPU only)
.numGPU(-1)
// Main GPU for multi-GPU setups
.mainGPU(0)
// Low VRAM mode
.lowVRAM(false)
// FP16 for KV cache
.f16KV(true)
// Return logits for all tokens
.logitsAll(false)
// Load only vocabulary
.vocabOnly(false)
// Memory mapping
.useMMap(true)
.useMLock(false)
// NUMA support
.useNUMA(false)
// Thread count (default: auto-detect)
.numThread(8)
.build();Hardware Parameters:
numCtx (Integer): Context window tokens (default: 2048)numBatch (Integer): Prompt batch size (default: 512)numGPU (Integer): GPU layers (default: -1 auto, 0=CPU)mainGPU (Integer): Primary GPU index (default: 0)lowVRAM (Boolean): Low VRAM mode (default: false)f16KV (Boolean): Use FP16 for KV cache (default: true)logitsAll (Boolean): Return logits for all tokens, not just the last one. Required for completions to return logprobs (default: not set/null)vocabOnly (Boolean): Load only the vocabulary, not the weights (default: not set/null)useMMap (Boolean): Memory-map model (default: null)useMLock (Boolean): Lock model in RAM (default: false)useNUMA (Boolean): Enable NUMA (default: false)numThread (Integer): CPU threads (default: auto)Enable thinking mode for reasoning models.
// Boolean enable/disable (Qwen 3, DeepSeek-v3.1, DeepSeek R1)
OllamaChatOptions options = OllamaChatOptions.builder()
.model("qwen3:4b-thinking")
.enableThinking() // Enable reasoning traces
.build();
// Disable thinking explicitly
OllamaChatOptions options = OllamaChatOptions.builder()
.model("qwen3:4b-thinking")
.disableThinking()
.build();
// String levels (GPT-OSS model)
OllamaChatOptions options = OllamaChatOptions.builder()
.model("gpt-oss")
.thinkHigh() // or .thinkLow(), .thinkMedium()
.build();Thinking Methods:
enableThinking(): Enable reasoning (returns ThinkOption.ThinkBoolean.ENABLED)disableThinking(): Disable reasoningthinkLow(): Low thinking level (GPT-OSS)thinkMedium(): Medium thinking level (GPT-OSS)thinkHigh(): High thinking level (GPT-OSS)thinkOption(ThinkOption): Set custom think optionSee thinking.md for detailed usage.
Configure tools that the model can use.
OllamaChatOptions options = OllamaChatOptions.builder()
.model(OllamaModel.LLAMA3)
// Register tool callbacks
.toolCallbacks(List.of(
FunctionToolCallback.builder("getWeather", weatherService)
.description("Get weather for a location")
.inputType(WeatherRequest.class)
.build()
))
// Specify which tools to enable
.toolNames("getWeather", "getTime")
// Enable internal tool execution
.internalToolExecutionEnabled(true)
// Tool context (shared data)
.toolContext(Map.of("apiKey", "xyz123"))
.build();Tool Parameters:
toolCallbacks (List<ToolCallback>): Tool implementationstoolNames (Set<String>): Enabled tool namesinternalToolExecutionEnabled (Boolean): Auto-execute toolstoolContext (Map<String, Object>): Shared context dataSee tool-calling.md for detailed usage.
OllamaChatOptions options = OllamaChatOptions.builder()
.model(OllamaModel.LLAMA3.id())
.temperature(0.7)
.numPredict(512)
.build();
OllamaChatModel chatModel = OllamaChatModel.builder()
.ollamaApi(ollamaApi)
.defaultOptions(options)
.build();
ChatResponse response = chatModel.call(new Prompt("Hello!"));OllamaChatOptions options = OllamaChatOptions.builder()
.model("llama3")
.format("json")
.build();
String prompt = "List 3 colors as JSON array with 'name' and 'hex' fields";
ChatResponse response = chatModel.call(new Prompt(prompt, options));
// Response will be valid JSONOllamaChatOptions options = OllamaChatOptions.builder()
.model("llama3")
.numCtx(8192) // Large context window
.numBatch(1024) // Large batch size
.numGPU(-1) // Use all GPU layers
.useMLock(true) // Lock in RAM for speed
.numThread(16) // Use 16 CPU threads
.keepAlive("30m") // Keep model loaded longer
.build();// Default options
OllamaChatOptions defaultOptions = OllamaChatOptions.builder()
.model("llama3")
.temperature(0.7)
.build();
// Override for specific request
OllamaChatOptions requestOptions = OllamaChatOptions.builder()
.temperature(0.2) // More deterministic for this request
.numPredict(100)
.build();
ChatResponse response = chatModel.call(
new Prompt("Summarize this text...", requestOptions)
);OllamaChatOptions provides several static and instance utility methods for working with options.
// Filter non-supported fields from options map
public static Map<String, Object> filterNonSupportedFields(Map<String, Object> options);
// Create from existing options (deep copy)
public static OllamaChatOptions fromOptions(OllamaChatOptions options);// Convert options to Map for API requests
public Map<String, Object> toMap();
// Create a copy of these options
public OllamaChatOptions copy();OllamaChatOptions options = OllamaChatOptions.builder()
.temperature(0.8)
.topP(0.9)
.build();
Map<String, Object> optionsMap = options.toMap();
// Use in API requests or serializationOllamaChatOptions original = OllamaChatOptions.builder()
.model("llama3")
.temperature(0.7)
.build();
// Create a copy (instance method)
OllamaChatOptions copy = original.copy();
// Or use static fromOptions method
OllamaChatOptions copy2 = OllamaChatOptions.fromOptions(original);
// Modify the copy
copy.setTemperature(0.9);Removes fields that are not part of the Ollama options API but are managed separately in the request (model, format, keep_alive, truncate).
Map<String, Object> allOptions = Map.of(
"temperature", 0.8,
"model", "llama3", // Non-supported - part of request
"format", "json", // Non-supported - part of request
"keep_alive", "5m", // Non-supported - part of request
"truncate", true, // Non-supported - part of request
"top_p", 0.9
);
// Remove fields that aren't part of Ollama options API
Map<String, Object> filtered = OllamaChatOptions.filterNonSupportedFields(allOptions);
// Returns only: {"temperature": 0.8, "top_p": 0.9}Non-Supported Fields: The following fields are filtered out because they're part of the ChatRequest, not the options:
modelformatkeep_alivetruncateThe following defaults are used when options are not explicitly set:
numCtx: 2048numBatch: 512numGPU: -1 (auto)mainGPU: 0lowVRAM: falsef16KV: truenumKeep: 4seed: -1numPredict: 128topK: 40topP: 0.9minP: 0.0temperature: 0.8repeatPenalty: 1.1presencePenalty: 0.0frequencyPenalty: 0.0mirostat: 0mirostatTau: 5.0mirostatEta: 0.1penalizeNewline: truetruncate: true// Factual, deterministic responses
.temperature(0.1)
// Balanced (default)
.temperature(0.8)
// Creative writing
.temperature(1.2)// Small context for simple queries (faster, less memory)
.numCtx(2048)
// Large context for long conversations or documents
.numCtx(8192)
// Match context to model capabilities
.numCtx(32768) // For models that support it// CPU-only execution
.numGPU(0)
// Auto-detect optimal GPU layers
.numGPU(-1)
// Manual GPU allocation for specific models
.numGPU(35) // Load specific number of layers to GPUOllamaChatOptions are separate from request-level options in ChatRequestmodel, format, keepAlive, truncate) are "synthetic" - they're part of the request but managed through options for conveniencetoolCallbacks, toolNames, etc.) are inherited from ToolCallingChatOptionsqwen3:4b-thinking) auto-enable thinking by default in Ollama 0.12+tessl i tessl/maven-org-springframework-ai--spring-ai-ollama@1.1.1