Zero-configuration RAG package that bundles document parsing, embedding, and splitting for easy Retrieval-Augmented Generation in Java applications
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Core class for ingesting documents into an embedding store with automatic parsing, splitting, and embedding.
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
public static IngestionResult ingest(
Document document,
EmbeddingStore<TextSegment> embeddingStore
)Parameters:
document - Document to ingestembeddingStore - Where to store generated embeddingsReturns: IngestionResult with token usage information
Automatic Behavior:
Example:
Document doc = FileSystemDocumentLoader.loadDocument(path);
EmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
IngestionResult result = EmbeddingStoreIngestor.ingest(doc, store);public static IngestionResult ingest(
List<Document> documents,
EmbeddingStore<TextSegment> embeddingStore
)Parameters:
documents - List of documents to ingestembeddingStore - Where to store generated embeddingsReturns: IngestionResult with aggregated token usage
Example:
List<Document> docs = FileSystemDocumentLoader.loadDocumentsRecursively(dir);
IngestionResult result = EmbeddingStoreIngestor.ingest(docs, store);public static Builder builder()Returns: Builder for configuring EmbeddingStoreIngestor
Example:
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingStore(store)
.documentSplitter(customSplitter) // Optional
.embeddingModel(customModel) // Optional
.build();public Builder documentTransformer(DocumentTransformer documentTransformer)Transform documents before splitting. Use to preprocess, filter, or enrich documents.
Example:
DocumentTransformer addMetadata = doc -> {
doc.metadata().put("source", "internal");
return doc;
};
builder.documentTransformer(addMetadata);public Builder documentSplitter(DocumentSplitter documentSplitter)Custom splitter for chunking documents. If not provided, uses SPI-discovered RecursiveDocumentSplitterFactory (300 tokens, 30 overlap).
Example:
import dev.langchain4j.data.document.splitter.DocumentSplitters;
DocumentSplitter customSplitter = DocumentSplitters.recursive(
500, // tokens per chunk
50 // overlap
);
builder.documentSplitter(customSplitter);public Builder textSegmentTransformer(TextSegmentTransformer textSegmentTransformer)Transform text segments after splitting. Use to enrich metadata, filter, or modify text.
Example:
TextSegmentTransformer enricher = segment -> {
segment.metadata().put("length", segment.text().length());
return segment;
};
builder.textSegmentTransformer(enricher);public Builder embeddingModel(EmbeddingModel embeddingModel)Custom embedding model. If not provided, uses SPI-discovered BgeSmallEnV15QuantizedEmbeddingModel.
Example:
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
EmbeddingModel model = OpenAiEmbeddingModel.builder()
.apiKey(apiKey)
.modelName("text-embedding-3-small")
.build();
builder.embeddingModel(model);public Builder embeddingStore(EmbeddingStore<TextSegment> embeddingStore)Required. The store where embeddings will be saved.
Example:
EmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
builder.embeddingStore(store);public EmbeddingStoreIngestor build()Builds the configured EmbeddingStoreIngestor.
Throws: IllegalArgumentException if embeddingStore not set
public IngestionResult ingest(Document document)Ingest document using configured pipeline.
Returns: IngestionResult with token usage
public IngestionResult ingest(List<Document> documents)Ingest multiple documents using configured pipeline.
Returns: IngestionResult with aggregated token usage
public IngestionResult ingest(Document... documents)Ingest multiple documents (varargs) using configured pipeline.
Returns: IngestionResult with aggregated token usage
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.EmbeddingStoreIngestor;
import dev.langchain4j.store.embedding.IngestionResult;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;
// Custom configuration
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(DocumentSplitters.recursive(500, 50))
.textSegmentTransformer(segment -> {
// Add metadata to each segment
segment.metadata().put("ingested_at", System.currentTimeMillis());
return segment;
})
.embeddingStore(new InMemoryEmbeddingStore<>())
.build();
// Ingest documents
List<Document> documents = loadDocuments();
IngestionResult result = ingestor.ingest(documents);
System.out.println("Processed " + result.tokenUsage().totalTokenCount() + " tokens");Install with Tessl CLI
npx tessl i tessl/maven-dev-langchain4j--langchain4j-easy-rag