CtrlK
CommunityDocumentationLog inGet started
Tessl Logo

tessl/maven-org-springframework-ai--spring-ai-ollama

Spring Boot-compatible Ollama integration providing ChatModel and EmbeddingModel implementations for running large language models locally with support for streaming, tool calling, model management, and observability.

Overview
Eval results
Files

embedding-options.mddocs/reference/

OllamaEmbeddingOptions

Configuration options for Ollama embedding model operations.

Overview

OllamaEmbeddingOptions provides configuration for embedding generation, including model selection, memory management, and GPU allocation. It's a simpler configuration than chat options since embeddings don't require generation parameters.

Class Information

package org.springframework.ai.ollama.api;

public class OllamaEmbeddingOptions implements EmbeddingOptions

Implements: org.springframework.ai.embedding.EmbeddingOptions

Creating Options

Basic Creation

// Using builder
OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model(OllamaModel.NOMIC_EMBED_TEXT.id())
    .keepAlive("10m")
    .build();

// Copy from existing options
OllamaEmbeddingOptions copy = OllamaEmbeddingOptions.fromOptions(existingOptions);

Builder Method Overloads

The builder provides overloaded methods for convenient model selection.

Model Selection:

// Accepts String model name
public Builder model(String model);

// Accepts OllamaModel enum
public Builder model(OllamaModel model);
// Using String
OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .build();

// Using enum (recommended)
OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model(OllamaModel.MXBAI_EMBED_LARGE)
    .build();

Configuration Parameters

Model Selection

Controls which embedding model to use.

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    // Model name (required)
    .model("nomic-embed-text")

    // How long to keep model in memory
    .keepAlive("5m")

    // Truncate inputs to fit context length
    .truncate(true)
    .build();

Parameters:

  • model (String): Embedding model name from Ollama library
  • keepAlive (String): Duration in Go format (e.g., "5m", "30s", "1h")
  • truncate (Boolean): Auto-truncate to context length (default: true)

Memory and GPU Management

Configure hardware resource usage for embedding generation.

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    // GPU configuration
    .numGPU(-1)           // -1 = auto, 0 = CPU only
    .mainGPU(0)           // Main GPU for multi-GPU setups
    .lowVRAM(false)       // Low VRAM mode

    // Memory management
    .useMMap(true)        // Memory-map model files
    .useMLock(false)      // Lock model in RAM
    .useNUMA(false)       // Enable NUMA

    // Processing configuration
    .numBatch(512)        // Batch size for processing
    .numThread(8)         // CPU threads to use
    .vocabOnly(false)     // Load only vocabulary
    .build();

Hardware Parameters:

  • numGPU (Integer): Number of GPU layers (default: -1 auto, 0=CPU)
  • mainGPU (Integer): Primary GPU index for multi-GPU (default: 0)
  • lowVRAM (Boolean): Enable low VRAM mode (default: false)
  • useMMap (Boolean): Memory-map model files (default: null)
  • useMLock (Boolean): Lock model in RAM (default: false)
  • useNUMA (Boolean): Enable NUMA support (default: false)
  • numBatch (Integer): Batch size (default: 512)
  • numThread (Integer): CPU threads (default: auto-detect)
  • vocabOnly (Boolean): Load only vocabulary, not weights

Usage Examples

Basic Embedding Generation

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model(OllamaModel.NOMIC_EMBED_TEXT.id())
    .build();

OllamaEmbeddingModel embeddingModel = OllamaEmbeddingModel.builder()
    .ollamaApi(ollamaApi)
    .defaultOptions(options)
    .build();

float[] embedding = embeddingModel.embed("Hello, world!");

Batch Processing

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .numBatch(1024)       // Larger batch size for throughput
    .keepAlive("15m")     // Keep model loaded for batch jobs
    .build();

List<String> texts = List.of("text1", "text2", "text3");
EmbeddingResponse response = embeddingModel.call(
    new EmbeddingRequest(texts, options)
);

CPU-Only Embedding

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .numGPU(0)           // Use CPU only
    .numThread(16)       // Use 16 CPU threads
    .useMMap(true)       // Memory-map for efficient loading
    .build();

High-Performance GPU Configuration

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("mxbai-embed-large")
    .numGPU(-1)          // Use all available GPU layers
    .numBatch(2048)      // Large batch size
    .useMLock(true)      // Lock in RAM for faster access
    .keepAlive("30m")    // Keep loaded for multiple operations
    .build();

Per-Request Override

// Default options
OllamaEmbeddingOptions defaultOptions = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .numBatch(512)
    .build();

OllamaEmbeddingModel embeddingModel = OllamaEmbeddingModel.builder()
    .ollamaApi(ollamaApi)
    .defaultOptions(defaultOptions)
    .build();

// Override for specific request
OllamaEmbeddingOptions requestOptions = OllamaEmbeddingOptions.builder()
    .truncate(false)     // Don't truncate for this request
    .build();

EmbeddingResponse response = embeddingModel.call(
    new EmbeddingRequest(List.of("long text..."), requestOptions)
);

Utility Methods

OllamaEmbeddingOptions provides several static and instance utility methods for working with options.

Static Methods

// Filter non-supported fields from options map
public static Map<String, Object> filterNonSupportedFields(Map<String, Object> options);

// Create from existing options (deep copy)
public static OllamaEmbeddingOptions fromOptions(OllamaEmbeddingOptions options);

Instance Methods

// Convert options to Map for API requests
public Map<String, Object> toMap();

// Create a copy of these options
public OllamaEmbeddingOptions copy();

// Get embedding dimensions (always returns null for Ollama)
public Integer getDimensions();

Convert to Map

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .numGPU(-1)
    .build();

Map<String, Object> optionsMap = options.toMap();
// Use in API requests or serialization

Copy Options

OllamaEmbeddingOptions original = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .numBatch(512)
    .build();

// Create a copy (instance method)
OllamaEmbeddingOptions copy = original.copy();

// Or use static fromOptions method
OllamaEmbeddingOptions copy2 = OllamaEmbeddingOptions.fromOptions(original);

// Modify the copy
copy.setNumBatch(1024);

Filter Non-Supported Fields

Removes fields that are not part of the Ollama options API but are managed separately in the request (model, keep_alive, truncate).

Map<String, Object> allOptions = Map.of(
    "num_batch", 512,
    "model", "nomic-embed-text",  // Non-supported - part of request
    "keep_alive", "5m",            // Non-supported - part of request
    "truncate", true,              // Non-supported - part of request
    "num_gpu", -1
);

// Remove fields that aren't part of Ollama options API
Map<String, Object> filtered = OllamaEmbeddingOptions.filterNonSupportedFields(allOptions);
// Returns only: {"num_batch": 512, "num_gpu": -1}

Non-Supported Fields: The following fields are filtered out because they're part of the EmbeddingsRequest, not the options:

  • model
  • keep_alive
  • truncate

Dimensions Method

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .build();

// getDimensions() always returns null for Ollama
// Ollama determines dimensions based on the model
Integer dimensions = options.getDimensions();  // null

Default Values

When options are not explicitly set:

  • numBatch: 512
  • numGPU: -1 (auto-detect)
  • mainGPU: 0
  • lowVRAM: false
  • useMMap: null (Ollama default)
  • useMLock: false
  • useNUMA: false
  • numThread: auto-detect
  • vocabOnly: false
  • truncate: true

Recommended Embedding Models

Nomic Embed Text

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model(OllamaModel.NOMIC_EMBED_TEXT.id())  // "nomic-embed-text"
    .build();
// High-quality, large context window (8192 tokens)

MixedBread AI Large

OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model(OllamaModel.MXBAI_EMBED_LARGE.id())  // "mxbai-embed-large"
    .build();
// State-of-the-art embeddings

Best Practices

Memory Management

// For large batches or long-running processes
OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .useMLock(true)       // Lock in RAM to avoid swapping
    .keepAlive("1h")      // Keep loaded for extended operations
    .build();

// For memory-constrained environments
OllamaEmbeddingOptions options = OllamaEmbeddingOptions.builder()
    .model("nomic-embed-text")
    .lowVRAM(true)        // Reduce VRAM usage
    .numGPU(0)            // Use CPU if needed
    .keepAlive("1m")      // Unload quickly after use
    .build();

Batch Size Tuning

// Small batches (lower memory, higher latency)
.numBatch(256)

// Default batch size
.numBatch(512)

// Large batches (higher memory, better throughput)
.numBatch(2048)

GPU Optimization

// CPU-only (no GPU required)
.numGPU(0)
.numThread(8)

// Auto-detect optimal GPU usage
.numGPU(-1)

// Multi-GPU setup
.numGPU(-1)
.mainGPU(0)  // Use GPU 0 for small tensors

Truncation Handling

// Auto-truncate long texts (default, safer)
.truncate(true)

// Error on texts exceeding context length (stricter)
.truncate(false)

Comparison with Chat Options

OllamaEmbeddingOptions is simpler than OllamaChatOptions because embeddings don't require:

  • Generation parameters (temperature, top-k, top-p, etc.)
  • Sampling control (mirostat, penalties, etc.)
  • Stop sequences
  • Thinking/reasoning options
  • Tool calling

Both share common parameters:

  • Model selection
  • GPU/memory configuration
  • Keep-alive duration
  • Truncation behavior

Related Documentation

  • OllamaEmbeddingModel - Using the embedding model
  • OllamaModel - Available embedding model constants
  • OllamaChatOptions - Chat model options for comparison
  • API Types - EmbeddingsRequest and EmbeddingsResponse

Notes

  1. getDimensions() always returns null - Ollama determines embedding dimensions based on the model
  2. Embedding models are typically smaller and faster than chat models
  3. Not all Ollama models support embeddings - use dedicated embedding models like nomic-embed-text
  4. The vocabOnly option is rarely used - it loads only vocabulary without weights
  5. Fields model, keepAlive, and truncate are "synthetic" - they're part of the request but managed through options
  6. Embedding generation doesn't support streaming
tessl i tessl/maven-org-springframework-ai--spring-ai-ollama@1.1.1

docs

index.md

tile.json