tessl/maven-dev-langchain4j--langchain4j-chroma

LangChain4j integration for Chroma embedding store enabling storage, retrieval, and similarity search of vector embeddings with metadata filtering support for both API V1 and V2.

—

Pending

Overview

Eval results

Files

Batch Operations Guide

Name: tessl/maven-dev-langchain4j--langchain4j-chroma
Author: tessl

Efficient batch processing patterns for ChromaEmbeddingStore.

Why Batch Operations

Single operations:

// N HTTP requests, high overhead
for (Embedding emb : embeddings) {
    store.add(emb);
}

Batch operations:

// 1 HTTP request, efficient
List<String> ids = store.addAll(embeddings);

Performance difference: 10-100x faster for large datasets.

Adding in Batches

Basic Batch Add

List<Embedding> embeddings = generateEmbeddings(documents);
List<String> ids = store.addAll(embeddings);

With Text Segments

List<Embedding> embeddings = new ArrayList<>();
List<TextSegment> segments = new ArrayList<>();

for (String doc : documents) {
    embeddings.add(embeddingModel.embed(doc).content());
    segments.add(TextSegment.from(doc));
}

List<String> ids = store.addAll(embeddings, segments);

With Custom IDs and Metadata

List<String> ids = new ArrayList<>();
List<Embedding> embeddings = new ArrayList<>();
List<TextSegment> segments = new ArrayList<>();

for (Document doc : documents) {
    ids.add(doc.id());

    Embedding emb = embeddingModel.embed(doc.text()).content();
    embeddings.add(emb);

    Metadata meta = new Metadata()
        .put("author", doc.author())
        .put("year", doc.year());

    segments.add(TextSegment.from(doc.text(), meta));
}

store.addAll(ids, embeddings, segments);

Optimal Batch Sizes

Guidelines

// Small batches: Real-time, low latency
int batchSize = 50;

// Medium batches: Balanced
int batchSize = 200;

// Large batches: Maximum throughput
int batchSize = 500;

Factors to Consider

Network latency - Higher latency → larger batches
Memory constraints - Limited memory → smaller batches
Error isolation - Need to identify failures → smaller batches
Timeout - Short timeout → smaller batches

Chunked Processing

Basic Chunking

public void processInChunks(
    List<String> documents,
    int batchSize
) {
    for (int i = 0; i < documents.size(); i += batchSize) {
        int end = Math.min(i + batchSize, documents.size());
        List<String> batch = documents.subList(i, end);

        List<Embedding> embeddings = batch.stream()
            .map(doc -> embeddingModel.embed(doc).content())
            .collect(Collectors.toList());

        List<TextSegment> segments = batch.stream()
            .map(TextSegment::from)
            .collect(Collectors.toList());

        store.addAll(embeddings, segments);

        System.out.println("Processed " + end + " / " + documents.size());
    }
}

With Progress Tracking

public void processWithProgress(
    List<String> documents,
    int batchSize,
    Consumer<Progress> progressCallback
) {
    int total = documents.size();

    for (int i = 0; i < total; i += batchSize) {
        int end = Math.min(i + batchSize, total);
        List<String> batch = documents.subList(i, end);

        processBatch(batch);

        Progress progress = new Progress(end, total);
        progressCallback.accept(progress);
    }
}

record Progress(int current, int total) {
    public double percentage() {
        return (current * 100.0) / total;
    }
}

Error Handling in Batches

Retry Failed Batches

public void processWithRetry(
    List<String> documents,
    int batchSize,
    int maxRetries
) {
    for (int i = 0; i < documents.size(); i += batchSize) {
        int end = Math.min(i + batchSize, documents.size());
        List<String> batch = documents.subList(i, end);

        boolean success = false;
        int attempt = 0;

        while (!success && attempt < maxRetries) {
            try {
                processBatch(batch);
                success = true;
            } catch (Exception e) {
                attempt++;
                if (attempt >= maxRetries) {
                    System.err.println("Batch failed after " + maxRetries +
                                     " attempts: " + e.getMessage());
                    // Log failed batch for manual processing
                    logFailedBatch(batch, e);
                } else {
                    Thread.sleep(1000 * attempt);  // Backoff
                }
            }
        }
    }
}

Fallback to Individual Processing

public List<String> processWithFallback(
    List<Embedding> embeddings,
    List<TextSegment> segments
) {
    try {
        // Try batch first
        return store.addAll(embeddings, segments);

    } catch (Exception e) {
        System.err.println("Batch failed, processing individually");

        // Fallback to individual adds
        List<String> ids = new ArrayList<>();
        for (int i = 0; i < embeddings.size(); i++) {
            try {
                String id = store.add(embeddings.get(i), segments.get(i));
                ids.add(id);
            } catch (Exception itemError) {
                System.err.println("Item " + i + " failed: " +
                                 itemError.getMessage());
                ids.add(null);  // Mark failure
            }
        }
        return ids;
    }
}

Parallel Batch Processing

Parallel Chunking

public void processParallel(
    List<String> documents,
    int batchSize
) {
    int numBatches = (documents.size() + batchSize - 1) / batchSize;

    IntStream.range(0, numBatches)
        .parallel()
        .forEach(batchIndex -> {
            int start = batchIndex * batchSize;
            int end = Math.min(start + batchSize, documents.size());
            List<String> batch = documents.subList(start, end);

            processBatch(batch);
        });
}

Warning: Ensure thread-safe access to store if processing in parallel.

Streaming Large Datasets

File Streaming

public void processLargeFile(
    Path filePath,
    int batchSize
) throws IOException {
    try (Stream<String> lines = Files.lines(filePath)) {
        List<String> batch = new ArrayList<>();

        lines.forEach(line -> {
            batch.add(line);

            if (batch.size() >= batchSize) {
                processBatch(new ArrayList<>(batch));
                batch.clear();
            }
        });

        // Process remaining
        if (!batch.isEmpty()) {
            processBatch(batch);
        }
    }
}

Batch Removal

Remove Multiple IDs

List<String> idsToRemove = Arrays.asList("id1", "id2", "id3");
store.removeAll(idsToRemove);

Remove by Filter

Filter filter = metadataKey("status").isEqualTo("outdated");
store.removeAll(filter);  // Batch remove on server side

Measuring Performance

Batch Performance Metrics

public void measureBatchPerformance(
    List<String> documents,
    int[] batchSizes
) {
    for (int batchSize : batchSizes) {
        long start = System.currentTimeMillis();

        processInChunks(documents, batchSize);

        long duration = System.currentTimeMillis() - start;
        double throughput = documents.size() / (duration / 1000.0);

        System.out.println("Batch size " + batchSize +
                         ": " + duration + "ms" +
                         ", throughput: " + throughput + " docs/sec");

        // Clear for next test
        store.removeAll();
    }
}

Best Practices

Always use batch operations - For multiple items
Choose appropriate batch size - Test with your data
Handle errors gracefully - Retry or fallback strategies
Monitor progress - For long-running operations
Test at scale - Verify performance with production data
Adjust timeout - Increase for large batches
Consider memory - Large batches consume more memory
Stream large datasets - Avoid loading all into memory

Add Operations - Batch add API
Performance Guide - Optimization strategies
Error Handling - Error strategies

See: Add Operations Guide for complete API details.

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-chroma

tessl/maven-dev-langchain4j--langchain4j-chroma

batch-operations.mddocs/guides/

Batch Operations Guide

Why Batch Operations

Adding in Batches

Basic Batch Add

With Text Segments

With Custom IDs and Metadata

Optimal Batch Sizes

Guidelines

Factors to Consider

Chunked Processing

Basic Chunking

With Progress Tracking

Error Handling in Batches

Retry Failed Batches

Fallback to Individual Processing

Parallel Batch Processing

Parallel Chunking

Streaming Large Datasets

File Streaming

Batch Removal

Remove Multiple IDs

Remove by Filter

Measuring Performance

Batch Performance Metrics

Best Practices

Related

tessl/maven-dev-langchain4j--langchain4j-chroma

batch-operations.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/guides/

Batch Operations Guide

Why Batch Operations

Adding in Batches

Basic Batch Add

With Text Segments

With Custom IDs and Metadata

Optimal Batch Sizes

Guidelines

Factors to Consider

Chunked Processing

Basic Chunking

With Progress Tracking

Error Handling in Batches

Retry Failed Batches

Fallback to Individual Processing

Parallel Batch Processing

Parallel Chunking

Streaming Large Datasets

File Streaming

Batch Removal

Remove Multiple IDs

Remove by Filter

Measuring Performance

Batch Performance Metrics

Best Practices

Related

batch-operations.mddocs/guides/