tessl/maven-io-quarkiverse-langchain4j--quarkus-langchain4j-easy-rag

Easy RAG extension for Quarkus LangChain4j that dramatically simplifies implementing Retrieval Augmented Generation pipelines with automatic document ingestion and embedding store management

Overview

Eval results

Files

Document Ingestion

Name: tessl/maven-io-quarkiverse-langchain4j--quarkus-langchain4j-easy-rag
Author: tessl

The EasyRagIngestor class handles the complete document ingestion pipeline, including loading, parsing, splitting, embedding generation, and storage. While typically used internally by the extension, understanding its behavior is important for troubleshooting and advanced use cases.

API

package io.quarkiverse.langchain4j.easyrag.runtime;

import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.data.segment.TextSegment;

/**
 * Handles document loading, splitting, embedding, and storage.
 */
public class EasyRagIngestor {

    /**
     * Creates an EasyRagIngestor with the specified dependencies.
     *
     * @param embeddingModel Model for generating embeddings
     * @param embeddingStore Store for persisting embeddings
     * @param config Configuration for ingestion behavior
     */
    public EasyRagIngestor(
        EmbeddingModel embeddingModel,
        EmbeddingStore<TextSegment> embeddingStore,
        EasyRagConfig config
    );

    /**
     * Performs complete document ingestion process.
     *
     * Loads documents from configured path, parses them with Apache Tika,
     * splits into segments, generates embeddings, and stores in embedding store.
     * Supports embeddings reuse when configured with InMemoryEmbeddingStore.
     */
    public void ingest();
}

Ingestion Pipeline

When ingest() is called, the following steps occur:

1. Document Loading

Documents are loaded based on configuration:

Filesystem Loading (path-type=FILESYSTEM):

// Recursive loading (default)
List<Document> documents = FileSystemDocumentLoader
    .loadDocumentsRecursively(path, pathMatcher);

// Non-recursive loading
List<Document> documents = FileSystemDocumentLoader
    .loadDocuments(path, pathMatcher);

Classpath Loading (path-type=CLASSPATH):

// Recursive loading (default)
List<Document> documents = ClassPathDocumentLoader
    .loadDocumentsRecursively(path, pathMatcher);

// Non-recursive loading
List<Document> documents = ClassPathDocumentLoader
    .loadDocuments(path, pathMatcher);

2. Document Parsing

Documents are parsed using Apache Tika, which automatically detects and handles many formats:

Supported Formats:

Plain text (.txt, .md, .csv)
PDF documents (.pdf)
Microsoft Office (.docx, .xlsx, .pptx)
HTML files (.html, .htm)
XML files (.xml)
Images with text via OCR (.jpg, .png, .tiff) - requires Tesseract

Configuration for OCR: If you need OCR capabilities, install Tesseract and configure Tika:

# Install Tesseract (Ubuntu/Debian)
sudo apt-get install tesseract-ocr

# Install Tesseract (macOS)
brew install tesseract

# Point to custom Tika config if needed
-Dtika.config=/path/to/tika-config.xml

3. Document Splitting

Documents are split into segments using a recursive text splitter:

DocumentSplitter splitter = DocumentSplitters.recursive(
    maxSegmentSize,      // From config (default: 300 tokens)
    maxOverlapSize,      // From config (default: 30 tokens)
    new HuggingFaceTokenCountEstimator()
);

List<Document> segments = splitter.splitAll(documents);

Token Estimation: Uses HuggingFace's token count estimator for accurate token-based splitting.

Why Splitting Matters:

LLMs have limited context windows
Smaller segments enable more precise retrieval
Overlap preserves context across boundaries

4. Embedding Generation

Each segment is converted to an embedding vector:

EmbeddingStoreIngestor.builder()
    .embeddingModel(embeddingModel)
    .embeddingStore(embeddingStore)
    .build()
    .ingest(segments);

The embedding model generates a vector representation of each segment's semantic meaning.

5. Storage

Embeddings are stored in the configured EmbeddingStore:

InMemoryEmbeddingStore: Stored in application memory
Redis: Persisted to Redis database
Chroma: Persisted to Chroma vector database
Infinispan: Persisted to Infinispan cache

6. Embeddings Reuse (Optional)

When reuse-embeddings.enabled=true and using InMemoryEmbeddingStore:

First Run:

// Ingest documents normally
ingest();

// Serialize embeddings to file
embeddingStore.serializeToFile(embeddingsFilePath);

Subsequent Runs:

// Load embeddings from file instead of recomputing
InMemoryEmbeddingStore store = InMemoryEmbeddingStore.fromFile(embeddingsFilePath);

This significantly reduces startup time during development.

When Ingestion Occurs

Ingestion timing is controlled by the ingestion-strategy configuration:

ON (Default)

Ingestion happens automatically at application startup:

quarkus.langchain4j.easy-rag.ingestion-strategy=ON

The extension calls ingest() during startup after all beans are initialized.

OFF

No ingestion occurs:

quarkus.langchain4j.easy-rag.ingestion-strategy=OFF

Use cases:

Persistent embedding store already populated
Documents don't change
Using pre-computed embeddings

MANUAL

Ingestion waits for explicit trigger:

quarkus.langchain4j.easy-rag.ingestion-strategy=MANUAL

See Manual Ingestion Control for usage.

Logging

The ingestor logs its progress:

INFO  Ingesting documents from filesystem: /data/documents, path matcher = glob:**, recursive = true
INFO  Ingested 42 files as 318 documents

For embeddings reuse:

INFO  Reading embeddings from /path/to/embeddings.json
INFO  Writing embeddings to /path/to/embeddings.json

Configuration Impact

Path Configuration

# Filesystem path
quarkus.langchain4j.easy-rag.path=/data/documents
quarkus.langchain4j.easy-rag.path-type=FILESYSTEM

# Classpath path
quarkus.langchain4j.easy-rag.path=knowledge-base
quarkus.langchain4j.easy-rag.path-type=CLASSPATH

Filtering Configuration

# Match only specific file types
quarkus.langchain4j.easy-rag.path-matcher=glob:**.{txt,md,pdf}

# Non-recursive scanning
quarkus.langchain4j.easy-rag.recursive=false

Splitting Configuration

# Larger segments for more context
quarkus.langchain4j.easy-rag.max-segment-size=500
quarkus.langchain4j.easy-rag.max-overlap-size=50

# Smaller segments for precise retrieval
quarkus.langchain4j.easy-rag.max-segment-size=200
quarkus.langchain4j.easy-rag.max-overlap-size=20

Advanced Usage: Direct Instantiation

While uncommon, you can manually create and use an EasyRagIngestor:

import io.quarkiverse.langchain4j.easyrag.runtime.EasyRagIngestor;
import io.quarkiverse.langchain4j.easyrag.runtime.EasyRagConfig;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.data.segment.TextSegment;
import jakarta.inject.Inject;

public class CustomIngestion {

    @Inject
    EmbeddingModel embeddingModel;

    @Inject
    EmbeddingStore<TextSegment> embeddingStore;

    @Inject
    EasyRagConfig config;

    public void performCustomIngestion() {
        EasyRagIngestor ingestor = new EasyRagIngestor(
            embeddingModel,
            embeddingStore,
            config
        );

        ingestor.ingest();
    }
}

Use cases:

Custom ingestion logic
Multiple document sources
Dynamic configuration

Error Handling

The ingest() method may throw exceptions:

Common Errors

Path Not Found:

java.nio.file.NoSuchFileException: /data/documents

Solution: Verify the path exists and is accessible.

Parse Errors:

org.apache.tika.exception.TikaException: Failed to parse document

Solution: Check document format and Tika configuration.

Embedding Errors:

dev.langchain4j.model.embedding.EmbeddingModelException: Rate limit exceeded

Solution: Reduce ingestion rate or upgrade API plan.

Storage Errors:

java.io.IOException: Connection refused to Redis

Solution: Verify embedding store is accessible.

Handling Errors

import org.jboss.logging.Logger;

private static final Logger LOG = Logger.getLogger(MyClass.class);

try {
    ingestor.ingest();
} catch (NoSuchFileException e) {
    LOG.error("Document path not found", e);
} catch (TikaException e) {
    LOG.error("Failed to parse document", e);
} catch (Exception e) {
    LOG.error("Ingestion failed", e);
}

Performance Considerations

Ingestion Time

Ingestion time depends on:

Number of documents: More documents = longer time
Document size: Larger documents take longer to parse and split
Embedding model: Remote models add network latency
Embedding store: Remote stores add network latency

Example timings (for 100 documents, ~1MB total):

In-memory store + local embeddings: ~10-30 seconds
In-memory store + OpenAI embeddings: ~30-60 seconds
Redis store + OpenAI embeddings: ~40-70 seconds

Optimization Tips

Use embeddings reuse in development:

quarkus.langchain4j.easy-rag.reuse-embeddings.enabled=true

Use local embedding models:

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-embedding-onnx</artifactId>
</dependency>

Reduce document size:

# Filter to specific file types
quarkus.langchain4j.easy-rag.path-matcher=glob:**.txt

Increase segment size (fewer embeddings to generate):

quarkus.langchain4j.easy-rag.max-segment-size=500

Memory Usage

In-memory stores hold all embeddings in RAM:

Memory estimate:

Embedding dimension: 384 (typical)
Float size: 4 bytes
Per embedding: ~1.5 KB (including metadata)
1000 segments: ~1.5 MB
10,000 segments: ~15 MB
100,000 segments: ~150 MB

For large document sets, use a persistent store (Redis, Chroma).

Integration with Persistent Stores

Redis Store

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-redis</artifactId>
</dependency>

quarkus.langchain4j.easy-rag.path=/data/documents
quarkus.langchain4j.redis.dimension=384
quarkus.redis.hosts=redis://localhost:6379

When Redis extension is present, ingestion automatically uses Redis instead of in-memory store.

Chroma Store

<dependency>
    <groupId>io.quarkiverse.langchain4j</groupId>
    <artifactId>quarkus-langchain4j-chroma</artifactId>
</dependency>

quarkus.langchain4j.easy-rag.path=/data/documents
quarkus.langchain4j.chroma.base-url=http://localhost:8000
quarkus.langchain4j.chroma.collection-name=my-documents

Avoiding Duplicate Ingestion

With persistent stores, use OFF strategy after initial ingestion:

# First run: ingest documents
%dev.quarkus.langchain4j.easy-rag.ingestion-strategy=ON

# Subsequent runs: skip ingestion
%prod.quarkus.langchain4j.easy-rag.ingestion-strategy=OFF

Or use MANUAL strategy and trigger re-ingestion only when documents change.

Document Metadata

Each ingested segment preserves metadata from the original document:

TextSegment {
    String text;              // Segment content
    Metadata metadata;        // Document source, file name, etc.
}

This metadata is stored with embeddings and returned during retrieval, allowing you to:

Display source documents to users
Filter results by document properties
Track which documents provided context

Custom Document Processing

For advanced scenarios, you can implement custom processing:

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentSplitter;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import jakarta.inject.Inject;
import java.util.List;

public class CustomDocumentProcessor {

    @Inject
    EmbeddingModel embeddingModel;

    @Inject
    EmbeddingStore<TextSegment> embeddingStore;

    public void ingestWithCustomProcessing(List<Document> documents) {
        // Custom pre-processing
        List<Document> processed = documents.stream()
            .map(this::preprocess)
            .toList();

        // Custom splitting
        DocumentSplitter splitter = DocumentSplitters.recursive(400, 40);
        List<Document> segments = splitter.splitAll(processed)
            .stream()
            .map(split -> Document.document(split.text()))
            .toList();

        // Ingest
        EmbeddingStoreIngestor.builder()
            .embeddingModel(embeddingModel)
            .embeddingStore(embeddingStore)
            .build()
            .ingest(segments);
    }

    private Document preprocess(Document doc) {
        // Custom preprocessing: remove headers, footers, etc.
        String cleaned = cleanText(doc.text());
        return Document.document(cleaned, doc.metadata());
    }

    private String cleanText(String text) {
        // Implement custom cleaning logic
        return text;
    }
}

tessl/maven-io-quarkiverse-langchain4j--quarkus-langchain4j-easy-rag