tessl/maven-dev-langchain4j--langchain4j-pgvector

LangChain4j PGVector integration for PostgreSQL-based vector embedding storage and retrieval

—

Pending

Overview

Eval results

Files

Embedding Operations

Name: tessl/maven-dev-langchain4j--langchain4j-pgvector
Author: tessl

Add, remove, and manage embeddings with support for single and batch operations, including text segments and metadata.

Capabilities

Add Single Embedding

Add a single embedding to the store with auto-generated ID.

/**
 * Adds an embedding to the store with auto-generated ID
 * @param embedding The embedding to be added to the store
 * @return The auto-generated ID (UUID) associated with the added embedding
 */
String add(Embedding embedding);

Usage Example:

import dev.langchain4j.data.embedding.Embedding;

Embedding embedding = embeddingModel.embed("sample text").content();
String id = embeddingStore.add(embedding);
System.out.println("Added embedding with ID: " + id);

Add Embedding with ID

Add a single embedding to the store with a specific ID.

/**
 * Adds an embedding to the store with a specific ID
 * If an embedding with this ID already exists, it will be replaced (upsert behavior)
 * @param id The unique identifier for the embedding to be added
 * @param embedding The embedding to be added to the store
 */
void add(String id, Embedding embedding);

Usage Example:

String customId = "doc-123";
Embedding embedding = embeddingModel.embed("sample text").content();
embeddingStore.add(customId, embedding);

Add Embedding with Text Segment

Add an embedding along with the original text content and metadata.

/**
 * Adds an embedding and the corresponding content that has been embedded to the store
 * @param embedding The embedding to be added to the store
 * @param textSegment Original content that was embedded, including text and optional metadata
 * @return The auto-generated ID (UUID) associated with the added embedding
 */
String add(Embedding embedding, TextSegment textSegment);

Usage Example:

import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.data.document.Metadata;

String text = "LangChain4j is a Java framework for building LLM applications";
Metadata metadata = new Metadata();
metadata.put("source", "documentation");
metadata.put("page", 1);

TextSegment segment = TextSegment.from(text, metadata);
Embedding embedding = embeddingModel.embed(text).content();

String id = embeddingStore.add(embedding, segment);

Add Multiple Embeddings

Add multiple embeddings in a single batch operation.

/**
 * Adds multiple embeddings to the store with auto-generated IDs
 * More efficient than adding embeddings one by one
 * @param embeddings A list of embeddings to be added to the store
 * @return A list of auto-generated IDs (UUIDs) associated with the added embeddings
 */
List<String> addAll(List<Embedding> embeddings);

Usage Example:

import java.util.List;

List<String> texts = List.of("text 1", "text 2", "text 3");
List<Embedding> embeddings = texts.stream()
    .map(text -> embeddingModel.embed(text).content())
    .collect(Collectors.toList());

List<String> ids = embeddingStore.addAll(embeddings);
System.out.println("Added " + ids.size() + " embeddings");

Add Multiple Embeddings with Details

Add multiple embeddings with their IDs and optional text segments.

/**
 * Adds multiple embeddings with their IDs and optional text segments
 * Performs upsert - if an ID already exists, the embedding is replaced
 * @param ids List of unique identifiers for the embeddings
 * @param embeddings List of embeddings to be added
 * @param embedded List of text segments (can be null, or individual elements can be null)
 * @throws IllegalArgumentException if ids and embeddings sizes don't match, or if embedded is non-null and size doesn't match
 */
void addAll(List<String> ids, List<Embedding> embeddings, List<TextSegment> embedded);

Usage Example:

import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

List<String> texts = List.of("text 1", "text 2", "text 3");

List<String> ids = IntStream.range(0, texts.size())
    .mapToObj(i -> "doc-" + i)
    .collect(Collectors.toList());

List<Embedding> embeddings = texts.stream()
    .map(text -> embeddingModel.embed(text).content())
    .collect(Collectors.toList());

List<TextSegment> segments = texts.stream()
    .map(TextSegment::from)
    .collect(Collectors.toList());

embeddingStore.addAll(ids, embeddings, segments);

Remove Single Embedding by ID

Remove a specific embedding by its ID.

/**
 * Removes a single embedding by its ID
 * This is a convenience method equivalent to removeAll(Collections.singleton(id))
 * @param id The ID of the embedding to remove
 */
void remove(String id);

Usage Example:

String idToRemove = "doc-123";
embeddingStore.remove(idToRemove);

Remove All Embeddings

Remove all embeddings from the store (truncates the table).

/**
 * Removes all embeddings from the store
 * This operation truncates the table and cannot be undone
 */
void removeAll();

Usage Example:

// Clear all embeddings
embeddingStore.removeAll();

Remove Embeddings by IDs

Remove specific embeddings by their IDs.

/**
 * Removes embeddings by their IDs
 * @param ids Collection of embedding IDs to remove
 * @throws IllegalArgumentException if ids collection is null or empty
 */
void removeAll(Collection<String> ids);

Usage Example:

import java.util.List;

List<String> idsToRemove = List.of("doc-1", "doc-2", "doc-3");
embeddingStore.removeAll(idsToRemove);

Remove Embeddings by Filter

Remove embeddings that match specific metadata filter criteria.

/**
 * Removes all embeddings that match the specified filter
 * The filter is applied to metadata fields
 * @param filter Filter to match embeddings for removal
 * @throws IllegalArgumentException if filter is null
 */
void removeAll(Filter filter);

Usage Example:

import dev.langchain4j.store.embedding.filter.Filter;
import dev.langchain4j.store.embedding.filter.MetadataFilterBuilder;

// Remove all embeddings from a specific source
Filter filter = MetadataFilterBuilder.metadataKey("source").isEqualTo("outdated_docs");
embeddingStore.removeAll(filter);

// Remove embeddings older than a certain date
Filter dateFilter = MetadataFilterBuilder.metadataKey("created_date")
    .isLessThan("2024-01-01");
embeddingStore.removeAll(dateFilter);

Batch Operations Best Practices

Performance Optimization

For large datasets, use batch operations instead of single operations:

// Less efficient - many individual operations
for (String text : largeTextList) {
    Embedding embedding = embeddingModel.embed(text).content();
    embeddingStore.add(embedding);
}

// More efficient - single batch operation
List<Embedding> embeddings = largeTextList.stream()
    .map(text -> embeddingModel.embed(text).content())
    .collect(Collectors.toList());
embeddingStore.addAll(embeddings);

Document Ingestion Workflow

Complete workflow for ingesting documents:

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentSplitter;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import java.util.List;
import java.util.stream.Collectors;

// Load document
Document document = FileSystemDocumentLoader.loadDocument("/path/to/document.txt");

// Split into chunks
DocumentSplitter splitter = DocumentSplitters.recursive(300, 50);
List<TextSegment> segments = splitter.split(document);

// Generate embeddings
List<Embedding> embeddings = segments.stream()
    .map(segment -> embeddingModel.embed(segment.text()).content())
    .collect(Collectors.toList());

// Generate IDs
List<String> ids = segments.stream()
    .map(segment -> java.util.UUID.randomUUID().toString())
    .collect(Collectors.toList());

// Store all at once
embeddingStore.addAll(ids, embeddings, segments);

Upsert Behavior

The add and addAll methods with explicit IDs perform upsert operations:

// First insert
embeddingStore.add("doc-1", embedding1);

// Update with same ID - replaces the existing embedding
embeddingStore.add("doc-1", embedding2);

// Batch upsert
List<String> ids = List.of("doc-1", "doc-2", "doc-3");
List<Embedding> embeddings = List.of(emb1, emb2, emb3);
List<TextSegment> segments = List.of(seg1, seg2, seg3);

// Will replace doc-1, insert doc-2 and doc-3
embeddingStore.addAll(ids, embeddings, segments);

Error Handling

Handle potential errors during operations:

import java.sql.SQLException;

try {
    embeddingStore.add(embedding, textSegment);
} catch (RuntimeException e) {
    if (e.getCause() instanceof SQLException) {
        // Handle database connection or constraint errors
        logger.error("Database error: " + e.getMessage(), e);
    } else {
        throw e;
    }
}

Metadata Storage

Metadata is stored according to the configured MetadataStorageConfig:

import dev.langchain4j.data.document.Metadata;

// Create metadata
Metadata metadata = new Metadata();
metadata.put("source", "documentation");
metadata.put("page", 42);
metadata.put("section", "installation");

// Create text segment with metadata
TextSegment segment = TextSegment.from("Installation instructions...", metadata);

// Add with metadata (stored according to configuration)
embeddingStore.add(embedding, segment);

Important Notes

ID Format: Auto-generated IDs are UUIDs in string format
Upsert Behavior: Adding an embedding with an existing ID replaces the old embedding
Batch Size: For very large batches (>10,000), consider chunking the operations
Metadata: Only included when adding with TextSegment; null metadata is stored as NULL in database
Empty Lists: addAll(List<Embedding>) with an empty list is a no-op
Transaction: Each operation runs in its own database transaction
Connection: A database connection is obtained from the pool for each operation

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-pgvector

docs

embedding-operations.md

tessl/maven-dev-langchain4j--langchain4j-pgvector