Zero-configuration RAG package that bundles document parsing, embedding, and splitting for easy Retrieval-Augmented Generation in Java applications
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
How easy-rag achieves zero-configuration RAG through Java's Service Provider Interface (SPI).
Easy-rag enables RAG without explicit configuration by:
RecursiveDocumentSplitterFactoryWhen you use EmbeddingStoreIngestor.ingest() without explicit configuration, the framework automatically discovers and uses easy-rag's bundled components.
Application Code
↓
EmbeddingStoreIngestor (orchestrator from langchain4j-core)
↓
┌──────────────┬────────────────┬─────────────────┐
│ Document │ Document │ Embedding │
│ Parser │ Splitter │ Model │
│ (Tika) │ (Recursive) │ (BGE-small) │
└──────────────┴────────────────┴─────────────────┘
↓
EmbeddingStore (user-provided)What easy-rag provides:
RecursiveDocumentSplitterFactory class (in easy-rag JAR)What core framework provides:
EmbeddingStoreIngestor orchestration logicRegistration: easy-rag JAR contains:
META-INF/services/dev.langchain4j.spi.data.document.splitter.DocumentSplitterFactoryThis file contains: dev.langchain4j.data.document.splitter.recursive.RecursiveDocumentSplitterFactory
Discovery: When EmbeddingStoreIngestor initializes without explicit documentSplitter:
ServiceLoader<DocumentSplitterFactory> loader =
ServiceLoader.load(DocumentSplitterFactory.class);Loading: Framework calls factory.create() to get the configured splitter
Fallback: If no SPI implementation found, throws exception
Single Implementation: Works automatically
// Only easy-rag on classpath
EmbeddingStoreIngestor.ingest(doc, store); // Uses easy-rag defaultsMultiple Implementations: Framework throws IllegalStateException
// Both easy-rag and custom-splitter on classpath
EmbeddingStoreIngestor.ingest(doc, store); // ERROR: Multiple implementationsSolution: Explicitly configure the component:
EmbeddingStoreIngestor.builder()
.documentSplitter(yourPreferredSplitter)
.embeddingStore(store)
.build();No Implementation: Framework throws exception
// No splitter implementation on classpath
EmbeddingStoreIngestor.ingest(doc, store); // ERROR: No implementation foundPackage: dev.langchain4j.data.document.splitter.recursive
Provided by: easy-rag JAR
Creates: Recursive document splitter with these settings:
HuggingFaceTokenCountEstimatorSplitting Strategy:
\n\n)., !, ?)Why these defaults:
Dependency: langchain4j-document-parser-apache-tika v1.11.0-beta19
Tika Version: 3.2.3
Provided by: Transitive dependency
SPI Registration: ApacheTikaDocumentParserFactory
Capabilities:
Common Formats:
Usage: Automatically used by FileSystemDocumentLoader when dependency present.
Dependency: langchain4j-embeddings-bge-small-en-v15-q v1.11.0-beta19
Provided by: Transitive dependency
SPI Registration: BgeSmallEnV15QuantizedEmbeddingModelFactory
Specifications:
Advantages:
Tradeoffs:
Performance:
langchain4j-easy-rag (your dependency)
├── langchain4j (core framework)
│ └── langchain4j-core
├── langchain4j-document-parser-apache-tika
│ └── apache-tika-core 3.2.3
│ └── apache-tika-parsers (200+ format support)
└── langchain4j-embeddings-bge-small-en-v15-q
└── onnxruntime (for model execution)Total Size: Approximately 80-100MB including all transitive dependencies
✅ Prototyping: Get RAG working in minutes
✅ Learning: Focus on concepts, not configuration
✅ Small Projects: Defaults sufficient for scope
✅ English Content: BGE-small optimized for English
✅ Standard Formats: Common document types (PDF, DOCX, TXT)
✅ Development: In-process model eliminates external dependencies
✅ Privacy-Sensitive: All processing happens locally
❌ Large Scale: High volume needs faster embedding models
❌ Non-English: Need multilingual or language-specific models
❌ Domain-Specific: Specialized content (legal, medical, code)
❌ Performance: Need GPU acceleration or API-based models
❌ Custom Chunking: Different chunk sizes or strategies
❌ Production SLAs: Need predictable performance guarantees
import dev.langchain4j.data.document.splitter.DocumentSplitters;
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(DocumentSplitters.recursive(500, 50))
.embeddingStore(store)
.build();
// Still uses auto-discovered embedding modelimport dev.langchain4j.model.openai.OpenAiEmbeddingModel;
EmbeddingModel model = OpenAiEmbeddingModel.builder()
.apiKey(apiKey)
.modelName("text-embedding-3-small")
.build();
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.embeddingModel(model)
.embeddingStore(store)
.build();
// Still uses auto-discovered splitterEmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(customSplitter)
.embeddingModel(customModel)
.embeddingStore(store)
.build();
// No auto-discovery used// Start with easy-rag defaults
EmbeddingStoreIngestor.ingest(docs, store);
// Later: Add custom splitter
ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(optimizedSplitter)
.embeddingStore(store)
.build();
// Later: Add production embedding model
ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(optimizedSplitter)
.embeddingModel(productionModel)
.embeddingStore(store)
.build();package dev.langchain4j.spi.data.document.splitter;
public interface DocumentSplitterFactory {
DocumentSplitter create();
}Purpose: SPI interface for document splitter providers
Implementation in easy-rag:
package dev.langchain4j.data.document.splitter.recursive;
public class RecursiveDocumentSplitterFactory implements DocumentSplitterFactory {
@Override
public DocumentSplitter create() {
return new RecursiveDocumentSplitter(
300, // maxSegmentSizeInTokens
30, // maxOverlapSizeInTokens
new HuggingFaceTokenCountEstimator()
);
}
}package dev.langchain4j.data.document;
public interface DocumentSplitter {
List<TextSegment> split(Document document);
default List<TextSegment> splitAll(List<Document> documents);
}Purpose: Interface for splitting documents into chunks
Used by: EmbeddingStoreIngestor during ingestion
package dev.langchain4j.model.embedding;
public interface EmbeddingModel {
Response<List<Embedding>> embedAll(List<TextSegment> textSegments);
default Response<Embedding> embed(String text);
default Response<Embedding> embed(TextSegment textSegment);
default int dimension();
}Purpose: Interface for embedding models
Provided by easy-rag: BgeSmallEnV15QuantizedEmbeddingModel (via transitive dependency)
langchain4j-easy-rag.jar
├── META-INF/
│ └── services/
│ └── dev.langchain4j.spi.data.document.splitter.DocumentSplitterFactory
│ (contains: dev.langchain4j...RecursiveDocumentSplitterFactory)
├── dev/langchain4j/data/document/splitter/recursive/
│ └── RecursiveDocumentSplitterFactory.class
└── pom.xml (declares transitive dependencies)Goals:
Non-Goals:
Install with Tessl CLI
npx tessl i tessl/maven-dev-langchain4j--langchain4j-easy-rag@1.11.0