CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/maven-org-springframework-ai--spring-ai-ollama

Spring Boot-compatible Ollama integration providing ChatModel and EmbeddingModel implementations for running large language models locally with support for streaming, tool calling, model management, and observability.

Overview
Eval results
Files

chat-options.mddocs/reference/

OllamaChatOptions

Configuration options for Ollama chat model operations.

Overview

OllamaChatOptions provides comprehensive configuration for chat model behavior, including model selection, generation parameters, GPU/memory management, sampling control, and tool calling capabilities.

Class Information

package org.springframework.ai.ollama.api;

public class OllamaChatOptions implements ToolCallingChatOptions

Implements: org.springframework.ai.model.tool.ToolCallingChatOptions

Creating Options

Basic Creation

// Using builder
OllamaChatOptions options = OllamaChatOptions.builder()
    .model(OllamaModel.LLAMA3.id())
    .temperature(0.7)
    .build();

// Copy from existing options
OllamaChatOptions copy = OllamaChatOptions.fromOptions(existingOptions);

Builder Method Overloads

The builder provides overloaded methods for convenient configuration.

Model Selection:

// Accepts String model name
public Builder model(String model);

// Accepts OllamaModel enum
public Builder model(OllamaModel model);
// Using String
OllamaChatOptions options = OllamaChatOptions.builder()
    .model("llama3")
    .build();

// Using enum (recommended)
OllamaChatOptions options = OllamaChatOptions.builder()
    .model(OllamaModel.MISTRAL)
    .build();

Tool Callbacks:

// Accepts List
public Builder toolCallbacks(List<ToolCallback> toolCallbacks);

// Accepts varargs
public Builder toolCallbacks(ToolCallback... toolCallbacks);
// Using List
.toolCallbacks(List.of(callback1, callback2))

// Using varargs
.toolCallbacks(callback1, callback2, callback3)

Tool Names:

// Accepts Set
public Builder toolNames(Set<String> toolNames);

// Accepts varargs
public Builder toolNames(String... toolNames);
// Using Set
.toolNames(Set.of("getTool1", "getTool2"))

// Using varargs
.toolNames("getTool1", "getTool2")

Configuration Categories

Model Selection

Controls which model to use and response format.

OllamaChatOptions options = OllamaChatOptions.builder()
    // Model name (required)
    .model("llama3")

    // Response format: "json" or JSON Schema Map
    .format("json")

    // How long to keep model in memory (e.g., "5m", "1h")
    .keepAlive("10m")

    // Truncate inputs to fit context length
    .truncate(true)
    .build();

Parameters:

  • model (String): Model name from Ollama library
  • format (Object): Response format - String "json" or Map containing JSON Schema
  • keepAlive (String): Duration in Go format (e.g., "5m", "30s", "1h")
  • truncate (Boolean): Auto-truncate to context length (default: true)

Generation Parameters

Control text generation behavior.

OllamaChatOptions options = OllamaChatOptions.builder()
    // Sampling temperature (0.0 - 2.0)
    .temperature(0.8)

    // Maximum tokens to generate
    .numPredict(256)

    // Random seed for reproducibility
    .seed(42)

    // Top-k sampling
    .topK(40)

    // Top-p (nucleus) sampling
    .topP(0.9)

    // Minimum probability threshold
    .minP(0.05)

    // Repetition penalties
    .repeatPenalty(1.1)
    .presencePenalty(0.0)
    .frequencyPenalty(0.0)

    // Stop sequences
    .stop(List.of("Human:", "Assistant:"))
    .build();

Key Parameters:

  • temperature (Double): Creativity control (default: 0.8)
    • Lower (0.0-0.5): More focused and deterministic
    • Higher (0.8-2.0): More creative and diverse
  • numPredict (Integer): Max tokens (default: 128, -1=infinite, -2=fill context)
  • seed (Integer): Random seed (default: -1 for random)
  • topK (Integer): Consider top K tokens (default: 40)
  • topP (Double): Nucleus sampling threshold (default: 0.9)
  • minP (Double): Minimum probability relative to top token (default: 0.0)
  • repeatPenalty (Double): Penalize repetitions (default: 1.1)
  • presencePenalty (Double): Presence penalty (default: 0.0)
  • frequencyPenalty (Double): Frequency penalty (default: 0.0)
  • stop (List<String>): Stop sequences

Advanced Sampling

Fine-grained control over token sampling.

OllamaChatOptions options = OllamaChatOptions.builder()
    // Tail-free sampling
    .tfsZ(1.0)

    // Typical sampling
    .typicalP(1.0)

    // Repetition context window
    .repeatLastN(64)

    // Mirostat sampling (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
    .mirostat(0)
    .mirostatTau(5.0f)
    .mirostatEta(0.1f)

    // Penalize newlines in output
    .penalizeNewline(true)

    // Number of tokens to keep from prompt
    .numKeep(4)
    .build();

Advanced Parameters:

  • tfsZ (Float): Tail-free sampling (default: 1.0, disabled)
  • typicalP (Float): Typical sampling (default: 1.0)
  • repeatLastN (Integer): Look-back window for penalties (default: 64, 0=disabled, -1=numCtx)
  • mirostat (Integer): Mirostat mode (0/1/2)
  • mirostatTau (Float): Target entropy (default: 5.0)
  • mirostatEta (Float): Learning rate (default: 0.1)
  • penalizeNewline (Boolean): Penalize newlines (default: true)
  • numKeep (Integer): Tokens to keep from prompt (default: 4)

Memory and GPU Management

Configure hardware resource usage.

OllamaChatOptions options = OllamaChatOptions.builder()
    // Context window size
    .numCtx(4096)

    // Batch size for prompt processing
    .numBatch(512)

    // GPU layers (-1 = auto, 0 = CPU only)
    .numGPU(-1)

    // Main GPU for multi-GPU setups
    .mainGPU(0)

    // Low VRAM mode
    .lowVRAM(false)

    // FP16 for KV cache
    .f16KV(true)

    // Return logits for all tokens
    .logitsAll(false)

    // Load only vocabulary
    .vocabOnly(false)

    // Memory mapping
    .useMMap(true)
    .useMLock(false)

    // NUMA support
    .useNUMA(false)

    // Thread count (default: auto-detect)
    .numThread(8)
    .build();

Hardware Parameters:

  • numCtx (Integer): Context window tokens (default: 2048)
  • numBatch (Integer): Prompt batch size (default: 512)
  • numGPU (Integer): GPU layers (default: -1 auto, 0=CPU)
  • mainGPU (Integer): Primary GPU index (default: 0)
  • lowVRAM (Boolean): Low VRAM mode (default: false)
  • f16KV (Boolean): Use FP16 for KV cache (default: true)
  • logitsAll (Boolean): Return logits for all tokens, not just the last one. Required for completions to return logprobs (default: not set/null)
  • vocabOnly (Boolean): Load only the vocabulary, not the weights (default: not set/null)
  • useMMap (Boolean): Memory-map model (default: null)
  • useMLock (Boolean): Lock model in RAM (default: false)
  • useNUMA (Boolean): Enable NUMA (default: false)
  • numThread (Integer): CPU threads (default: auto)

Thinking/Reasoning Models

Enable thinking mode for reasoning models.

// Boolean enable/disable (Qwen 3, DeepSeek-v3.1, DeepSeek R1)
OllamaChatOptions options = OllamaChatOptions.builder()
    .model("qwen3:4b-thinking")
    .enableThinking()  // Enable reasoning traces
    .build();

// Disable thinking explicitly
OllamaChatOptions options = OllamaChatOptions.builder()
    .model("qwen3:4b-thinking")
    .disableThinking()
    .build();

// String levels (GPT-OSS model)
OllamaChatOptions options = OllamaChatOptions.builder()
    .model("gpt-oss")
    .thinkHigh()  // or .thinkLow(), .thinkMedium()
    .build();

Thinking Methods:

  • enableThinking(): Enable reasoning (returns ThinkOption.ThinkBoolean.ENABLED)
  • disableThinking(): Disable reasoning
  • thinkLow(): Low thinking level (GPT-OSS)
  • thinkMedium(): Medium thinking level (GPT-OSS)
  • thinkHigh(): High thinking level (GPT-OSS)
  • thinkOption(ThinkOption): Set custom think option

See thinking.md for detailed usage.

Tool/Function Calling

Configure tools that the model can use.

OllamaChatOptions options = OllamaChatOptions.builder()
    .model(OllamaModel.LLAMA3)

    // Register tool callbacks
    .toolCallbacks(List.of(
        FunctionToolCallback.builder("getWeather", weatherService)
            .description("Get weather for a location")
            .inputType(WeatherRequest.class)
            .build()
    ))

    // Specify which tools to enable
    .toolNames("getWeather", "getTime")

    // Enable internal tool execution
    .internalToolExecutionEnabled(true)

    // Tool context (shared data)
    .toolContext(Map.of("apiKey", "xyz123"))
    .build();

Tool Parameters:

  • toolCallbacks (List<ToolCallback>): Tool implementations
  • toolNames (Set<String>): Enabled tool names
  • internalToolExecutionEnabled (Boolean): Auto-execute tools
  • toolContext (Map<String, Object>): Shared context data

See tool-calling.md for detailed usage.

Usage Examples

Basic Chat

OllamaChatOptions options = OllamaChatOptions.builder()
    .model(OllamaModel.LLAMA3.id())
    .temperature(0.7)
    .numPredict(512)
    .build();

OllamaChatModel chatModel = OllamaChatModel.builder()
    .ollamaApi(ollamaApi)
    .defaultOptions(options)
    .build();

ChatResponse response = chatModel.call(new Prompt("Hello!"));

JSON Output Format

OllamaChatOptions options = OllamaChatOptions.builder()
    .model("llama3")
    .format("json")
    .build();

String prompt = "List 3 colors as JSON array with 'name' and 'hex' fields";
ChatResponse response = chatModel.call(new Prompt(prompt, options));
// Response will be valid JSON

High-Performance Configuration

OllamaChatOptions options = OllamaChatOptions.builder()
    .model("llama3")
    .numCtx(8192)          // Large context window
    .numBatch(1024)        // Large batch size
    .numGPU(-1)            // Use all GPU layers
    .useMLock(true)        // Lock in RAM for speed
    .numThread(16)         // Use 16 CPU threads
    .keepAlive("30m")      // Keep model loaded longer
    .build();

Per-Request Override

// Default options
OllamaChatOptions defaultOptions = OllamaChatOptions.builder()
    .model("llama3")
    .temperature(0.7)
    .build();

// Override for specific request
OllamaChatOptions requestOptions = OllamaChatOptions.builder()
    .temperature(0.2)      // More deterministic for this request
    .numPredict(100)
    .build();

ChatResponse response = chatModel.call(
    new Prompt("Summarize this text...", requestOptions)
);

Utility Methods

OllamaChatOptions provides several static and instance utility methods for working with options.

Static Methods

// Filter non-supported fields from options map
public static Map<String, Object> filterNonSupportedFields(Map<String, Object> options);

// Create from existing options (deep copy)
public static OllamaChatOptions fromOptions(OllamaChatOptions options);

Instance Methods

// Convert options to Map for API requests
public Map<String, Object> toMap();

// Create a copy of these options
public OllamaChatOptions copy();

Convert to Map

OllamaChatOptions options = OllamaChatOptions.builder()
    .temperature(0.8)
    .topP(0.9)
    .build();

Map<String, Object> optionsMap = options.toMap();
// Use in API requests or serialization

Copy Options

OllamaChatOptions original = OllamaChatOptions.builder()
    .model("llama3")
    .temperature(0.7)
    .build();

// Create a copy (instance method)
OllamaChatOptions copy = original.copy();

// Or use static fromOptions method
OllamaChatOptions copy2 = OllamaChatOptions.fromOptions(original);

// Modify the copy
copy.setTemperature(0.9);

Filter Non-Supported Fields

Removes fields that are not part of the Ollama options API but are managed separately in the request (model, format, keep_alive, truncate).

Map<String, Object> allOptions = Map.of(
    "temperature", 0.8,
    "model", "llama3",       // Non-supported - part of request
    "format", "json",        // Non-supported - part of request
    "keep_alive", "5m",      // Non-supported - part of request
    "truncate", true,        // Non-supported - part of request
    "top_p", 0.9
);

// Remove fields that aren't part of Ollama options API
Map<String, Object> filtered = OllamaChatOptions.filterNonSupportedFields(allOptions);
// Returns only: {"temperature": 0.8, "top_p": 0.9}

Non-Supported Fields: The following fields are filtered out because they're part of the ChatRequest, not the options:

  • model
  • format
  • keep_alive
  • truncate

Default Values

The following defaults are used when options are not explicitly set:

  • numCtx: 2048
  • numBatch: 512
  • numGPU: -1 (auto)
  • mainGPU: 0
  • lowVRAM: false
  • f16KV: true
  • numKeep: 4
  • seed: -1
  • numPredict: 128
  • topK: 40
  • topP: 0.9
  • minP: 0.0
  • temperature: 0.8
  • repeatPenalty: 1.1
  • presencePenalty: 0.0
  • frequencyPenalty: 0.0
  • mirostat: 0
  • mirostatTau: 5.0
  • mirostatEta: 0.1
  • penalizeNewline: true
  • truncate: true

Best Practices

Temperature Selection

// Factual, deterministic responses
.temperature(0.1)

// Balanced (default)
.temperature(0.8)

// Creative writing
.temperature(1.2)

Context Window Management

// Small context for simple queries (faster, less memory)
.numCtx(2048)

// Large context for long conversations or documents
.numCtx(8192)

// Match context to model capabilities
.numCtx(32768)  // For models that support it

GPU Optimization

// CPU-only execution
.numGPU(0)

// Auto-detect optimal GPU layers
.numGPU(-1)

// Manual GPU allocation for specific models
.numGPU(35)  // Load specific number of layers to GPU

Related Documentation

Notes

  1. Not all options are supported by all models
  2. Options in OllamaChatOptions are separate from request-level options in ChatRequest
  3. Some fields (model, format, keepAlive, truncate) are "synthetic" - they're part of the request but managed through options for convenience
  4. Tool-related options (toolCallbacks, toolNames, etc.) are inherited from ToolCallingChatOptions
  5. Thinking models (e.g., qwen3:4b-thinking) auto-enable thinking by default in Ollama 0.12+
tessl i tessl/maven-org-springframework-ai--spring-ai-ollama@1.1.1

docs

index.md

tile.json