Easy RAG extension for Quarkus LangChain4j that dramatically simplifies implementing Retrieval Augmented Generation pipelines with automatic document ingestion and embedding store management
The EasyRagIngestor class handles the complete document ingestion pipeline, including loading, parsing, splitting, embedding generation, and storage. While typically used internally by the extension, understanding its behavior is important for troubleshooting and advanced use cases.
package io.quarkiverse.langchain4j.easyrag.runtime;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.data.segment.TextSegment;
/**
* Handles document loading, splitting, embedding, and storage.
*/
public class EasyRagIngestor {
/**
* Creates an EasyRagIngestor with the specified dependencies.
*
* @param embeddingModel Model for generating embeddings
* @param embeddingStore Store for persisting embeddings
* @param config Configuration for ingestion behavior
*/
public EasyRagIngestor(
EmbeddingModel embeddingModel,
EmbeddingStore<TextSegment> embeddingStore,
EasyRagConfig config
);
/**
* Performs complete document ingestion process.
*
* Loads documents from configured path, parses them with Apache Tika,
* splits into segments, generates embeddings, and stores in embedding store.
* Supports embeddings reuse when configured with InMemoryEmbeddingStore.
*/
public void ingest();
}When ingest() is called, the following steps occur:
Documents are loaded based on configuration:
Filesystem Loading (path-type=FILESYSTEM):
// Recursive loading (default)
List<Document> documents = FileSystemDocumentLoader
.loadDocumentsRecursively(path, pathMatcher);
// Non-recursive loading
List<Document> documents = FileSystemDocumentLoader
.loadDocuments(path, pathMatcher);Classpath Loading (path-type=CLASSPATH):
// Recursive loading (default)
List<Document> documents = ClassPathDocumentLoader
.loadDocumentsRecursively(path, pathMatcher);
// Non-recursive loading
List<Document> documents = ClassPathDocumentLoader
.loadDocuments(path, pathMatcher);Documents are parsed using Apache Tika, which automatically detects and handles many formats:
Supported Formats:
Configuration for OCR: If you need OCR capabilities, install Tesseract and configure Tika:
# Install Tesseract (Ubuntu/Debian)
sudo apt-get install tesseract-ocr
# Install Tesseract (macOS)
brew install tesseract# Point to custom Tika config if needed
-Dtika.config=/path/to/tika-config.xmlDocuments are split into segments using a recursive text splitter:
DocumentSplitter splitter = DocumentSplitters.recursive(
maxSegmentSize, // From config (default: 300 tokens)
maxOverlapSize, // From config (default: 30 tokens)
new HuggingFaceTokenCountEstimator()
);
List<Document> segments = splitter.splitAll(documents);Token Estimation: Uses HuggingFace's token count estimator for accurate token-based splitting.
Why Splitting Matters:
Each segment is converted to an embedding vector:
EmbeddingStoreIngestor.builder()
.embeddingModel(embeddingModel)
.embeddingStore(embeddingStore)
.build()
.ingest(segments);The embedding model generates a vector representation of each segment's semantic meaning.
Embeddings are stored in the configured EmbeddingStore:
When reuse-embeddings.enabled=true and using InMemoryEmbeddingStore:
First Run:
// Ingest documents normally
ingest();
// Serialize embeddings to file
embeddingStore.serializeToFile(embeddingsFilePath);Subsequent Runs:
// Load embeddings from file instead of recomputing
InMemoryEmbeddingStore store = InMemoryEmbeddingStore.fromFile(embeddingsFilePath);This significantly reduces startup time during development.
Ingestion timing is controlled by the ingestion-strategy configuration:
Ingestion happens automatically at application startup:
quarkus.langchain4j.easy-rag.ingestion-strategy=ONThe extension calls ingest() during startup after all beans are initialized.
No ingestion occurs:
quarkus.langchain4j.easy-rag.ingestion-strategy=OFFUse cases:
Ingestion waits for explicit trigger:
quarkus.langchain4j.easy-rag.ingestion-strategy=MANUALSee Manual Ingestion Control for usage.
The ingestor logs its progress:
INFO Ingesting documents from filesystem: /data/documents, path matcher = glob:**, recursive = true
INFO Ingested 42 files as 318 documentsFor embeddings reuse:
INFO Reading embeddings from /path/to/embeddings.json
INFO Writing embeddings to /path/to/embeddings.json# Filesystem path
quarkus.langchain4j.easy-rag.path=/data/documents
quarkus.langchain4j.easy-rag.path-type=FILESYSTEM
# Classpath path
quarkus.langchain4j.easy-rag.path=knowledge-base
quarkus.langchain4j.easy-rag.path-type=CLASSPATH# Match only specific file types
quarkus.langchain4j.easy-rag.path-matcher=glob:**.{txt,md,pdf}
# Non-recursive scanning
quarkus.langchain4j.easy-rag.recursive=false# Larger segments for more context
quarkus.langchain4j.easy-rag.max-segment-size=500
quarkus.langchain4j.easy-rag.max-overlap-size=50
# Smaller segments for precise retrieval
quarkus.langchain4j.easy-rag.max-segment-size=200
quarkus.langchain4j.easy-rag.max-overlap-size=20While uncommon, you can manually create and use an EasyRagIngestor:
import io.quarkiverse.langchain4j.easyrag.runtime.EasyRagIngestor;
import io.quarkiverse.langchain4j.easyrag.runtime.EasyRagConfig;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.data.segment.TextSegment;
import jakarta.inject.Inject;
public class CustomIngestion {
@Inject
EmbeddingModel embeddingModel;
@Inject
EmbeddingStore<TextSegment> embeddingStore;
@Inject
EasyRagConfig config;
public void performCustomIngestion() {
EasyRagIngestor ingestor = new EasyRagIngestor(
embeddingModel,
embeddingStore,
config
);
ingestor.ingest();
}
}Use cases:
The ingest() method may throw exceptions:
Path Not Found:
java.nio.file.NoSuchFileException: /data/documentsSolution: Verify the path exists and is accessible.
Parse Errors:
org.apache.tika.exception.TikaException: Failed to parse documentSolution: Check document format and Tika configuration.
Embedding Errors:
dev.langchain4j.model.embedding.EmbeddingModelException: Rate limit exceededSolution: Reduce ingestion rate or upgrade API plan.
Storage Errors:
java.io.IOException: Connection refused to RedisSolution: Verify embedding store is accessible.
import org.jboss.logging.Logger;
private static final Logger LOG = Logger.getLogger(MyClass.class);
try {
ingestor.ingest();
} catch (NoSuchFileException e) {
LOG.error("Document path not found", e);
} catch (TikaException e) {
LOG.error("Failed to parse document", e);
} catch (Exception e) {
LOG.error("Ingestion failed", e);
}Ingestion time depends on:
Example timings (for 100 documents, ~1MB total):
Use embeddings reuse in development:
quarkus.langchain4j.easy-rag.reuse-embeddings.enabled=trueUse local embedding models:
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-embedding-onnx</artifactId>
</dependency>Reduce document size:
# Filter to specific file types
quarkus.langchain4j.easy-rag.path-matcher=glob:**.txtIncrease segment size (fewer embeddings to generate):
quarkus.langchain4j.easy-rag.max-segment-size=500In-memory stores hold all embeddings in RAM:
Memory estimate:
For large document sets, use a persistent store (Redis, Chroma).
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-redis</artifactId>
</dependency>quarkus.langchain4j.easy-rag.path=/data/documents
quarkus.langchain4j.redis.dimension=384
quarkus.redis.hosts=redis://localhost:6379When Redis extension is present, ingestion automatically uses Redis instead of in-memory store.
<dependency>
<groupId>io.quarkiverse.langchain4j</groupId>
<artifactId>quarkus-langchain4j-chroma</artifactId>
</dependency>quarkus.langchain4j.easy-rag.path=/data/documents
quarkus.langchain4j.chroma.base-url=http://localhost:8000
quarkus.langchain4j.chroma.collection-name=my-documentsWith persistent stores, use OFF strategy after initial ingestion:
# First run: ingest documents
%dev.quarkus.langchain4j.easy-rag.ingestion-strategy=ON
# Subsequent runs: skip ingestion
%prod.quarkus.langchain4j.easy-rag.ingestion-strategy=OFFOr use MANUAL strategy and trigger re-ingestion only when documents change.
Each ingested segment preserves metadata from the original document:
TextSegment {
String text; // Segment content
Metadata metadata; // Document source, file name, etc.
}This metadata is stored with embeddings and returned during retrieval, allowing you to:
For advanced scenarios, you can implement custom processing:
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.DocumentSplitter;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import jakarta.inject.Inject;
import java.util.List;
public class CustomDocumentProcessor {
@Inject
EmbeddingModel embeddingModel;
@Inject
EmbeddingStore<TextSegment> embeddingStore;
public void ingestWithCustomProcessing(List<Document> documents) {
// Custom pre-processing
List<Document> processed = documents.stream()
.map(this::preprocess)
.toList();
// Custom splitting
DocumentSplitter splitter = DocumentSplitters.recursive(400, 40);
List<Document> segments = splitter.splitAll(processed)
.stream()
.map(split -> Document.document(split.text()))
.toList();
// Ingest
EmbeddingStoreIngestor.builder()
.embeddingModel(embeddingModel)
.embeddingStore(embeddingStore)
.build()
.ingest(segments);
}
private Document preprocess(Document doc) {
// Custom preprocessing: remove headers, footers, etc.
String cleaned = cleanText(doc.text());
return Document.document(cleaned, doc.metadata());
}
private String cleanText(String text) {
// Implement custom cleaning logic
return text;
}
}ingestion-strategy=MANUALInstall with Tessl CLI
npx tessl i tessl/maven-io-quarkiverse-langchain4j--quarkus-langchain4j-easy-rag@1.7.0