CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j-easy-rag

Zero-configuration RAG package that bundles document parsing, embedding, and splitting for easy Retrieval-Augmented Generation in Java applications

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

architecture.mddocs/

Architecture

How easy-rag achieves zero-configuration RAG through Java's Service Provider Interface (SPI).

Zero-Configuration Design

Easy-rag enables RAG without explicit configuration by:

  1. Providing SPI implementation - RecursiveDocumentSplitterFactory
  2. Bundling dependencies - Apache Tika parser and BGE-small embedding model
  3. Automatic discovery - Core framework loads implementations via SPI

When you use EmbeddingStoreIngestor.ingest() without explicit configuration, the framework automatically discovers and uses easy-rag's bundled components.

Component Stack

Application Code
       ↓
EmbeddingStoreIngestor (orchestrator from langchain4j-core)
       ↓
┌──────────────┬────────────────┬─────────────────┐
│   Document   │   Document     │   Embedding     │
│   Parser     │   Splitter     │   Model         │
│   (Tika)     │   (Recursive)  │   (BGE-small)   │
└──────────────┴────────────────┴─────────────────┘
       ↓
EmbeddingStore (user-provided)

What easy-rag provides:

  • RecursiveDocumentSplitterFactory class (in easy-rag JAR)
  • Apache Tika dependency (transitive)
  • BGE-small-en-v1.5 dependency (transitive)

What core framework provides:

  • EmbeddingStoreIngestor orchestration logic
  • SPI discovery mechanism
  • All interfaces and base types

SPI Discovery Mechanism

How SPI Works

  1. Registration: easy-rag JAR contains:

    META-INF/services/dev.langchain4j.spi.data.document.splitter.DocumentSplitterFactory

    This file contains: dev.langchain4j.data.document.splitter.recursive.RecursiveDocumentSplitterFactory

  2. Discovery: When EmbeddingStoreIngestor initializes without explicit documentSplitter:

    ServiceLoader<DocumentSplitterFactory> loader =
        ServiceLoader.load(DocumentSplitterFactory.class);
  3. Loading: Framework calls factory.create() to get the configured splitter

  4. Fallback: If no SPI implementation found, throws exception

SPI Conflict Resolution

Single Implementation: Works automatically

// Only easy-rag on classpath
EmbeddingStoreIngestor.ingest(doc, store);  // Uses easy-rag defaults

Multiple Implementations: Framework throws IllegalStateException

// Both easy-rag and custom-splitter on classpath
EmbeddingStoreIngestor.ingest(doc, store);  // ERROR: Multiple implementations

Solution: Explicitly configure the component:

EmbeddingStoreIngestor.builder()
    .documentSplitter(yourPreferredSplitter)
    .embeddingStore(store)
    .build();

No Implementation: Framework throws exception

// No splitter implementation on classpath
EmbeddingStoreIngestor.ingest(doc, store);  // ERROR: No implementation found

Bundled Components

RecursiveDocumentSplitterFactory

Package: dev.langchain4j.data.document.splitter.recursive

Provided by: easy-rag JAR

Creates: Recursive document splitter with these settings:

  • Chunk size: 300 tokens
  • Overlap: 30 tokens (10%)
  • Token estimator: HuggingFaceTokenCountEstimator
  • Splitting strategy: Recursive

Splitting Strategy:

  1. Attempt split on paragraph boundaries (\n\n)
  2. Fallback to sentence boundaries (., !, ?)
  3. Fallback to word boundaries (whitespace)
  4. Fallback to character boundaries if necessary

Why these defaults:

  • 300 tokens: Good balance between context and granularity
  • 10% overlap: Preserves context at chunk boundaries
  • Recursive: Respects natural text structure

Apache Tika Document Parser

Dependency: langchain4j-document-parser-apache-tika v1.11.0-beta19

Tika Version: 3.2.3

Provided by: Transitive dependency

SPI Registration: ApacheTikaDocumentParserFactory

Capabilities:

  • 200+ document formats
  • Automatic format detection
  • Metadata extraction
  • Text content extraction
  • Error handling for malformed files

Common Formats:

  • Documents: PDF, DOC, DOCX, ODT, RTF
  • Spreadsheets: XLS, XLSX, ODS
  • Presentations: PPT, PPTX, ODP
  • Text: TXT, MD, HTML, XML, CSV
  • Archives: ZIP, TAR, GZ
  • Email: MSG, EML, MBOX

Usage: Automatically used by FileSystemDocumentLoader when dependency present.

BGE-small-en-v1.5 Embedding Model

Dependency: langchain4j-embeddings-bge-small-en-v15-q v1.11.0-beta19

Provided by: Transitive dependency

SPI Registration: BgeSmallEnV15QuantizedEmbeddingModelFactory

Specifications:

  • Type: ONNX quantized BERT-based model
  • Dimensions: 384
  • Model Size: ~24MB
  • Execution: In-process within JVM (no external API)
  • Parallelization: Cached thread pool (threads = CPU cores)
  • Language: Optimized for English text
  • Query Prefix: "Represent this sentence for searching relevant passages:"

Advantages:

  • No external API required (fully offline)
  • No API keys or network calls
  • Consistent latency
  • No usage costs
  • Privacy (data doesn't leave your system)

Tradeoffs:

  • CPU-bound (no GPU acceleration)
  • Slower than API-based models
  • Fixed 384 dimensions
  • English-optimized only
  • Quantized (slightly lower quality vs full precision)

Performance:

  • Speed: ~50-200 segments/second (CPU-dependent)
  • Quality: Good for general English text
  • Best for: Development, testing, small-scale production

Dependency Tree

langchain4j-easy-rag (your dependency)
├── langchain4j (core framework)
│   └── langchain4j-core
├── langchain4j-document-parser-apache-tika
│   └── apache-tika-core 3.2.3
│       └── apache-tika-parsers (200+ format support)
└── langchain4j-embeddings-bge-small-en-v15-q
    └── onnxruntime (for model execution)

Total Size: Approximately 80-100MB including all transitive dependencies

When Zero-Configuration Works

Ideal Scenarios

Prototyping: Get RAG working in minutes

Learning: Focus on concepts, not configuration

Small Projects: Defaults sufficient for scope

English Content: BGE-small optimized for English

Standard Formats: Common document types (PDF, DOCX, TXT)

Development: In-process model eliminates external dependencies

Privacy-Sensitive: All processing happens locally

When to Customize

Large Scale: High volume needs faster embedding models

Non-English: Need multilingual or language-specific models

Domain-Specific: Specialized content (legal, medical, code)

Performance: Need GPU acceleration or API-based models

Custom Chunking: Different chunk sizes or strategies

Production SLAs: Need predictable performance guarantees

Customization Patterns

Override Splitter Only

import dev.langchain4j.data.document.splitter.DocumentSplitters;

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(DocumentSplitters.recursive(500, 50))
    .embeddingStore(store)
    .build();
// Still uses auto-discovered embedding model

Override Embedding Model Only

import dev.langchain4j.model.openai.OpenAiEmbeddingModel;

EmbeddingModel model = OpenAiEmbeddingModel.builder()
    .apiKey(apiKey)
    .modelName("text-embedding-3-small")
    .build();

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .embeddingModel(model)
    .embeddingStore(store)
    .build();
// Still uses auto-discovered splitter

Override Both

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(customSplitter)
    .embeddingModel(customModel)
    .embeddingStore(store)
    .build();
// No auto-discovery used

Gradual Migration

// Start with easy-rag defaults
EmbeddingStoreIngestor.ingest(docs, store);

// Later: Add custom splitter
ingestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(optimizedSplitter)
    .embeddingStore(store)
    .build();

// Later: Add production embedding model
ingestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(optimizedSplitter)
    .embeddingModel(productionModel)
    .embeddingStore(store)
    .build();

Component Interfaces

DocumentSplitterFactory

package dev.langchain4j.spi.data.document.splitter;

public interface DocumentSplitterFactory {
    DocumentSplitter create();
}

Purpose: SPI interface for document splitter providers

Implementation in easy-rag:

package dev.langchain4j.data.document.splitter.recursive;

public class RecursiveDocumentSplitterFactory implements DocumentSplitterFactory {
    @Override
    public DocumentSplitter create() {
        return new RecursiveDocumentSplitter(
            300,  // maxSegmentSizeInTokens
            30,   // maxOverlapSizeInTokens
            new HuggingFaceTokenCountEstimator()
        );
    }
}

DocumentSplitter

package dev.langchain4j.data.document;

public interface DocumentSplitter {
    List<TextSegment> split(Document document);
    default List<TextSegment> splitAll(List<Document> documents);
}

Purpose: Interface for splitting documents into chunks

Used by: EmbeddingStoreIngestor during ingestion

EmbeddingModel

package dev.langchain4j.model.embedding;

public interface EmbeddingModel {
    Response<List<Embedding>> embedAll(List<TextSegment> textSegments);
    default Response<Embedding> embed(String text);
    default Response<Embedding> embed(TextSegment textSegment);
    default int dimension();
}

Purpose: Interface for embedding models

Provided by easy-rag: BgeSmallEnV15QuantizedEmbeddingModel (via transitive dependency)

Package Structure

langchain4j-easy-rag.jar
├── META-INF/
│   └── services/
│       └── dev.langchain4j.spi.data.document.splitter.DocumentSplitterFactory
│           (contains: dev.langchain4j...RecursiveDocumentSplitterFactory)
├── dev/langchain4j/data/document/splitter/recursive/
│   └── RecursiveDocumentSplitterFactory.class
└── pom.xml (declares transitive dependencies)

Design Philosophy

Goals:

  1. Minimize friction: RAG in 3 lines of code
  2. Sensible defaults: Work well for common cases
  3. Easy override: Customize what you need
  4. Transparent: Clear what's happening behind the scenes
  5. Production-ready: Defaults work for many real applications

Non-Goals:

  1. One-size-fits-all: Not optimal for all use cases
  2. Zero dependencies: Bundles needed components
  3. Highest performance: Tradeoffs for simplicity
  4. All features: Focus on core RAG workflow

Related Documentation

  • Configuration - Default settings details
  • Document Ingestion API - Using EmbeddingStoreIngestor
  • Quick Start - Get started quickly
  • Troubleshooting - SPI conflicts and issues

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-easy-rag

docs

api-document-loading.md

api-ingestion.md

api-retrieval.md

api-types-chat.md

api-types-core.md

api-types-storage.md

architecture.md

configuration.md

examples.md

index.md

quickstart.md

reference.md

troubleshooting.md

tile.json