Zero-configuration RAG package that bundles document parsing, embedding, and splitting for easy Retrieval-Augmented Generation in Java applications
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Default settings and customization options for easy-rag components.
Implementation: RecursiveDocumentSplitter
Settings:
HuggingFaceTokenCountEstimatorRationale:
Splitting Strategy Priority:
\n\n) - Preserves semantic sections., !, ?) - Keeps complete sentencesExample behavior:
Document: "This is paragraph one.\n\nThis is paragraph two.\n\nThis is paragraph three."
Chunk 1: "This is paragraph one." (if under 300 tokens)
Chunk 2: "This is paragraph two." (with 30-token overlap from previous)
Chunk 3: "This is paragraph three." (with 30-token overlap from previous)Implementation: BgeSmallEnV15QuantizedEmbeddingModel
Settings:
Rationale:
Performance Characteristics:
Implementation: ApacheTikaDocumentParser
Settings:
Supported Formats: 200+ including:
import dev.langchain4j.data.document.splitter.DocumentSplitters;
// Larger chunks for more context
DocumentSplitter largeSplitter = DocumentSplitters.recursive(
500, // 500 tokens per chunk
50 // 50 token overlap (10%)
);
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(largeSplitter)
.embeddingStore(store)
.build();When to increase chunk size:
When to decrease chunk size:
// More overlap for better boundary coverage
DocumentSplitter highOverlap = DocumentSplitters.recursive(
300, // Same chunk size
60 // 20% overlap (more than default)
);
// No overlap for maximum efficiency
DocumentSplitter noOverlap = DocumentSplitters.recursive(
300,
0 // No overlap
);High overlap (20-30%):
Low/No overlap (0-5%):
import dev.langchain4j.data.document.splitter.DocumentBySentenceSplitter;
// Split by sentences, group into chunks
DocumentSplitter sentenceSplitter = new DocumentBySentenceSplitter(
300, // max tokens
30 // overlap
);Use when:
import dev.langchain4j.data.document.splitter.DocumentByParagraphSplitter;
// Split by paragraphs, group into chunks
DocumentSplitter paragraphSplitter = new DocumentByParagraphSplitter(
500, // larger chunks for paragraphs
50
);Use when:
import dev.langchain4j.data.document.splitter.DocumentByCharacterSplitter;
// Split by character count
DocumentSplitter charSplitter = new DocumentByCharacterSplitter(
1000, // characters
100 // overlap in characters
);Use when:
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
EmbeddingModel openAiModel = OpenAiEmbeddingModel.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.modelName("text-embedding-3-small") // 1536 dimensions
.build();
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingModel(openAiModel)
.embeddingStore(store)
.build();Advantages:
Tradeoffs:
import dev.langchain4j.model.azure.AzureOpenAiEmbeddingModel;
EmbeddingModel azureModel = AzureOpenAiEmbeddingModel.builder()
.endpoint(azureEndpoint)
.apiKey(azureApiKey)
.deploymentName("text-embedding-ada-002")
.build();Use when:
import dev.langchain4j.model.cohere.CohereEmbeddingModel;
EmbeddingModel cohereModel = CohereEmbeddingModel.builder()
.apiKey(cohereApiKey)
.modelName("embed-english-v3.0")
.build();Use when:
import dev.langchain4j.data.document.DocumentTransformer;
DocumentTransformer metadataAdder = document -> {
document.metadata().put("source", "internal_docs");
document.metadata().put("indexed_at", Instant.now().toString());
document.metadata().put("version", "1.0");
return document;
};
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentTransformer(metadataAdder)
.embeddingStore(store)
.build();// Only process documents meeting criteria
DocumentTransformer filter = document -> {
String text = document.text();
// Skip empty or very short documents
if (text == null || text.length() < 100) {
return null; // null means skip this document
}
// Skip documents with certain markers
if (text.contains("[DRAFT]") || text.contains("[DEPRECATED]")) {
return null;
}
return document;
};DocumentTransformer cleaner = document -> {
String cleaned = document.text()
.replaceAll("\\s+", " ") // Normalize whitespace
.replaceAll("[\\x00-\\x1F]", "") // Remove control characters
.trim();
return Document.from(cleaned, document.metadata());
};import dev.langchain4j.data.segment.TextSegmentTransformer;
TextSegmentTransformer enricher = segment -> {
Metadata meta = segment.metadata();
// Add segment-specific metadata
meta.put("length", segment.text().length());
meta.put("word_count", segment.text().split("\\s+").length);
meta.put("chunk_hash", segment.text().hashCode());
// Copy document metadata if present
if (meta.containsKey("document_id")) {
meta.put("parent_doc", meta.get("document_id"));
}
return segment;
};
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.textSegmentTransformer(enricher)
.embeddingStore(store)
.build();// Skip segments that don't meet criteria
TextSegmentTransformer filter = segment -> {
String text = segment.text();
// Skip very short segments
if (text.length() < 50) {
return null;
}
// Skip segments that are mostly numbers/symbols
long alphaCount = text.chars().filter(Character::isLetter).count();
if (alphaCount < text.length() * 0.5) {
return null;
}
return segment;
};import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;
// Custom document transformer
DocumentTransformer docTransformer = document -> {
// Add source metadata
document.metadata().put("source", "knowledge_base");
document.metadata().put("ingested_at", Instant.now().toString());
// Clean text
String cleaned = document.text()
.replaceAll("\\s+", " ")
.trim();
return Document.from(cleaned, document.metadata());
};
// Custom splitter with larger chunks
DocumentSplitter splitter = DocumentSplitters.recursive(500, 50);
// Custom segment transformer
TextSegmentTransformer segmentTransformer = segment -> {
// Add segment metadata
segment.metadata().put("length", segment.text().length());
segment.metadata().put("chunk_id", UUID.randomUUID().toString());
return segment;
};
// Production embedding model
EmbeddingModel embeddingModel = OpenAiEmbeddingModel.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.modelName("text-embedding-3-small")
.build();
// Build complete pipeline
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentTransformer(docTransformer)
.documentSplitter(splitter)
.textSegmentTransformer(segmentTransformer)
.embeddingModel(embeddingModel)
.embeddingStore(store)
.build();
// Ingest documents
IngestionResult result = ingestor.ingest(documents);// Fast iteration with defaults
EmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor.ingest(documents, store);Characteristics:
// Use defaults but persist store
InMemoryEmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor.ingest(documents, store);
// Persist to avoid re-embedding
store.serializeToFile("embeddings-v1.json");Characteristics:
// OpenAI embeddings with custom chunks
EmbeddingModel model = OpenAiEmbeddingModel.builder()
.apiKey(apiKey)
.modelName("text-embedding-3-small")
.build();
DocumentSplitter splitter = DocumentSplitters.recursive(500, 50);
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(splitter)
.embeddingModel(model)
.embeddingStore(vectorDatabase) // e.g., Pinecone, Weaviate
.build();Characteristics:
// Use multilingual embedding model
EmbeddingModel multilingualModel = OpenAiEmbeddingModel.builder()
.apiKey(apiKey)
.modelName("text-embedding-3-small") // Supports 100+ languages
.build();
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingModel(multilingualModel)
.embeddingStore(store)
.build();// Medical/legal/technical domain
// Use domain-specific embedding model and custom chunking
// Custom splitter for technical content
DocumentSplitter technicalSplitter = DocumentSplitters.recursive(
400, // Larger chunks for technical context
60 // More overlap for technical continuity
);
// Domain-specific embedding model
EmbeddingModel domainModel = CustomDomainEmbeddingModel.create();
// Add domain metadata
DocumentTransformer domainEnricher = doc -> {
doc.metadata().put("domain", "medical");
doc.metadata().put("requires_expertise", true);
return doc;
};
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentTransformer(domainEnricher)
.documentSplitter(technicalSplitter)
.embeddingModel(domainModel)
.embeddingStore(store)
.build();When processing many documents, ingest in batches:
List<Document> allDocuments = loadAllDocuments(); // e.g., 10,000 docs
int batchSize = 100;
for (int i = 0; i < allDocuments.size(); i += batchSize) {
int end = Math.min(i + batchSize, allDocuments.size());
List<Document> batch = allDocuments.subList(i, end);
IngestionResult result = ingestor.ingest(batch);
System.out.println("Processed batch " + (i/batchSize + 1) +
" - tokens: " + result.tokenUsage().totalTokenCount());
}For large documents, consider streaming:
// Process files one at a time instead of loading all
Path docsDir = Paths.get("documents");
Files.walk(docsDir)
.filter(Files::isRegularFile)
.forEach(path -> {
Document doc = FileSystemDocumentLoader.loadDocument(path);
ingestor.ingest(doc);
// Document and segments eligible for GC after ingestion
});import java.util.concurrent.ForkJoinPool;
List<Document> documents = loadDocuments();
// Process documents in parallel
ForkJoinPool customPool = new ForkJoinPool(4); // 4 threads
customPool.submit(() ->
documents.parallelStream().forEach(doc ->
ingestor.ingest(doc)
)
).join();Note: Ensure thread-safety of embedding store implementation.
Install with Tessl CLI
npx tessl i tessl/maven-dev-langchain4j--langchain4j-easy-rag