CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j-easy-rag

Zero-configuration RAG package that bundles document parsing, embedding, and splitting for easy Retrieval-Augmented Generation in Java applications

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

configuration.mddocs/

Configuration

Default settings and customization options for easy-rag components.

Default Configuration

Document Splitter

Implementation: RecursiveDocumentSplitter

Settings:

  • Chunk Size: 300 tokens
  • Overlap: 30 tokens (10% of chunk size)
  • Token Estimator: HuggingFaceTokenCountEstimator
  • Splitting Strategy: Recursive with fallbacks

Rationale:

  • 300 tokens provides good context window for most use cases
  • 10% overlap preserves context at boundaries
  • Recursive strategy respects document structure
  • HuggingFace estimator compatible with many models

Splitting Strategy Priority:

  1. Paragraph boundaries (\n\n) - Preserves semantic sections
  2. Sentence boundaries (., !, ?) - Keeps complete sentences
  3. Word boundaries (whitespace) - Avoids mid-word splits
  4. Character boundaries - Last resort for very long words/tokens

Example behavior:

Document: "This is paragraph one.\n\nThis is paragraph two.\n\nThis is paragraph three."

Chunk 1: "This is paragraph one."  (if under 300 tokens)
Chunk 2: "This is paragraph two."  (with 30-token overlap from previous)
Chunk 3: "This is paragraph three." (with 30-token overlap from previous)

Embedding Model

Implementation: BgeSmallEnV15QuantizedEmbeddingModel

Settings:

  • Model: BGE-small-en-v1.5 (quantized)
  • Dimensions: 384
  • Model Type: ONNX quantized BERT-based
  • Model Size: ~24MB
  • Execution: In-process within JVM
  • Thread Pool: Cached (size = CPU cores)
  • Query Prefix: "Represent this sentence for searching relevant passages:"

Rationale:

  • Small model size enables bundling
  • In-process execution eliminates external dependencies
  • No API keys or network required
  • Good quality for general English text
  • Quantization reduces size with acceptable quality loss

Performance Characteristics:

  • Speed: ~50-200 segments/second (CPU-dependent)
  • Memory: ~100MB during execution
  • Latency: Consistent (no network variability)
  • Cost: Free (no per-token charges)

Document Parser

Implementation: ApacheTikaDocumentParser

Settings:

  • Tika Version: 3.2.3
  • Auto-detection: Enabled (format detected automatically)
  • Metadata Extraction: Enabled
  • Error Handling: Graceful (returns partial content on errors)

Supported Formats: 200+ including:

  • Documents: PDF, DOC, DOCX, ODT, RTF, Pages
  • Spreadsheets: XLS, XLSX, ODS, Numbers
  • Presentations: PPT, PPTX, ODP, Keynote
  • Text: TXT, MD, HTML, XML, CSV, JSON
  • Archives: ZIP, TAR, GZ, RAR (extracts contents)
  • Email: MSG, EML, MBOX

Customization Guide

Custom Chunk Size

import dev.langchain4j.data.document.splitter.DocumentSplitters;

// Larger chunks for more context
DocumentSplitter largeSplitter = DocumentSplitters.recursive(
    500,  // 500 tokens per chunk
    50    // 50 token overlap (10%)
);

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(largeSplitter)
    .embeddingStore(store)
    .build();

When to increase chunk size:

  • Need more context per chunk
  • Documents have long-form content
  • Queries are complex/broad
  • Using larger context window models

When to decrease chunk size:

  • Precise retrieval needed
  • Short, focused content
  • Limited embedding dimensions
  • Memory constraints

Custom Overlap

// More overlap for better boundary coverage
DocumentSplitter highOverlap = DocumentSplitters.recursive(
    300,  // Same chunk size
    60    // 20% overlap (more than default)
);

// No overlap for maximum efficiency
DocumentSplitter noOverlap = DocumentSplitters.recursive(
    300,
    0     // No overlap
);

High overlap (20-30%):

  • ✅ Better context preservation
  • ✅ Reduces boundary issues
  • ❌ More storage required
  • ❌ Redundant information

Low/No overlap (0-5%):

  • ✅ Storage efficient
  • ✅ Less redundancy
  • ❌ May lose boundary context
  • ❌ Potential information gaps

Alternative Splitting Strategies

Sentence-Based Splitting

import dev.langchain4j.data.document.splitter.DocumentBySentenceSplitter;

// Split by sentences, group into chunks
DocumentSplitter sentenceSplitter = new DocumentBySentenceSplitter(
    300,  // max tokens
    30    // overlap
);

Use when:

  • Content is naturally sentence-based
  • Need clean semantic boundaries
  • Processing dialog or conversations

Paragraph-Based Splitting

import dev.langchain4j.data.document.splitter.DocumentByParagraphSplitter;

// Split by paragraphs, group into chunks
DocumentSplitter paragraphSplitter = new DocumentByParagraphSplitter(
    500,  // larger chunks for paragraphs
    50
);

Use when:

  • Content has clear paragraph structure
  • Want to preserve topic boundaries
  • Processing articles or documentation

Character-Based Splitting

import dev.langchain4j.data.document.splitter.DocumentByCharacterSplitter;

// Split by character count
DocumentSplitter charSplitter = new DocumentByCharacterSplitter(
    1000,  // characters
    100    // overlap in characters
);

Use when:

  • Simple, predictable chunks
  • Token counting not critical
  • Processing uniform text

Custom Embedding Models

OpenAI Embeddings

import dev.langchain4j.model.openai.OpenAiEmbeddingModel;

EmbeddingModel openAiModel = OpenAiEmbeddingModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("text-embedding-3-small")  // 1536 dimensions
    .build();

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .embeddingModel(openAiModel)
    .embeddingStore(store)
    .build();

Advantages:

  • Higher quality embeddings
  • Larger dimensions (1536)
  • API-based (no local compute)
  • Multilingual support

Tradeoffs:

  • Requires API key
  • Network dependency
  • Usage costs
  • Latency variability

Azure OpenAI Embeddings

import dev.langchain4j.model.azure.AzureOpenAiEmbeddingModel;

EmbeddingModel azureModel = AzureOpenAiEmbeddingModel.builder()
    .endpoint(azureEndpoint)
    .apiKey(azureApiKey)
    .deploymentName("text-embedding-ada-002")
    .build();

Use when:

  • Using Azure infrastructure
  • Need enterprise SLAs
  • Data residency requirements

Cohere Embeddings

import dev.langchain4j.model.cohere.CohereEmbeddingModel;

EmbeddingModel cohereModel = CohereEmbeddingModel.builder()
    .apiKey(cohereApiKey)
    .modelName("embed-english-v3.0")
    .build();

Use when:

  • Need specialized retrieval embeddings
  • Using Cohere's reranking
  • Multi-language support needed

Document Transformers

Add Metadata Before Splitting

import dev.langchain4j.data.document.DocumentTransformer;

DocumentTransformer metadataAdder = document -> {
    document.metadata().put("source", "internal_docs");
    document.metadata().put("indexed_at", Instant.now().toString());
    document.metadata().put("version", "1.0");
    return document;
};

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentTransformer(metadataAdder)
    .embeddingStore(store)
    .build();

Filter Documents

// Only process documents meeting criteria
DocumentTransformer filter = document -> {
    String text = document.text();

    // Skip empty or very short documents
    if (text == null || text.length() < 100) {
        return null;  // null means skip this document
    }

    // Skip documents with certain markers
    if (text.contains("[DRAFT]") || text.contains("[DEPRECATED]")) {
        return null;
    }

    return document;
};

Clean/Normalize Text

DocumentTransformer cleaner = document -> {
    String cleaned = document.text()
        .replaceAll("\\s+", " ")           // Normalize whitespace
        .replaceAll("[\\x00-\\x1F]", "")   // Remove control characters
        .trim();

    return Document.from(cleaned, document.metadata());
};

Text Segment Transformers

Enrich Segments After Splitting

import dev.langchain4j.data.segment.TextSegmentTransformer;

TextSegmentTransformer enricher = segment -> {
    Metadata meta = segment.metadata();

    // Add segment-specific metadata
    meta.put("length", segment.text().length());
    meta.put("word_count", segment.text().split("\\s+").length);
    meta.put("chunk_hash", segment.text().hashCode());

    // Copy document metadata if present
    if (meta.containsKey("document_id")) {
        meta.put("parent_doc", meta.get("document_id"));
    }

    return segment;
};

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .textSegmentTransformer(enricher)
    .embeddingStore(store)
    .build();

Filter Segments

// Skip segments that don't meet criteria
TextSegmentTransformer filter = segment -> {
    String text = segment.text();

    // Skip very short segments
    if (text.length() < 50) {
        return null;
    }

    // Skip segments that are mostly numbers/symbols
    long alphaCount = text.chars().filter(Character::isLetter).count();
    if (alphaCount < text.length() * 0.5) {
        return null;
    }

    return segment;
};

Complete Custom Pipeline

import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.model.openai.OpenAiEmbeddingModel;

// Custom document transformer
DocumentTransformer docTransformer = document -> {
    // Add source metadata
    document.metadata().put("source", "knowledge_base");
    document.metadata().put("ingested_at", Instant.now().toString());

    // Clean text
    String cleaned = document.text()
        .replaceAll("\\s+", " ")
        .trim();

    return Document.from(cleaned, document.metadata());
};

// Custom splitter with larger chunks
DocumentSplitter splitter = DocumentSplitters.recursive(500, 50);

// Custom segment transformer
TextSegmentTransformer segmentTransformer = segment -> {
    // Add segment metadata
    segment.metadata().put("length", segment.text().length());
    segment.metadata().put("chunk_id", UUID.randomUUID().toString());
    return segment;
};

// Production embedding model
EmbeddingModel embeddingModel = OpenAiEmbeddingModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("text-embedding-3-small")
    .build();

// Build complete pipeline
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentTransformer(docTransformer)
    .documentSplitter(splitter)
    .textSegmentTransformer(segmentTransformer)
    .embeddingModel(embeddingModel)
    .embeddingStore(store)
    .build();

// Ingest documents
IngestionResult result = ingestor.ingest(documents);

Configuration Recipes

Development Setup

// Fast iteration with defaults
EmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor.ingest(documents, store);

Characteristics:

  • Zero configuration
  • Fast to set up
  • In-process model (no API keys)
  • Good enough quality

Production Setup (Cost-Optimized)

// Use defaults but persist store
InMemoryEmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
EmbeddingStoreIngestor.ingest(documents, store);

// Persist to avoid re-embedding
store.serializeToFile("embeddings-v1.json");

Characteristics:

  • No API costs
  • One-time embedding computation
  • Persistent across restarts
  • Good for small-medium scale

Production Setup (Quality-Optimized)

// OpenAI embeddings with custom chunks
EmbeddingModel model = OpenAiEmbeddingModel.builder()
    .apiKey(apiKey)
    .modelName("text-embedding-3-small")
    .build();

DocumentSplitter splitter = DocumentSplitters.recursive(500, 50);

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(splitter)
    .embeddingModel(model)
    .embeddingStore(vectorDatabase)  // e.g., Pinecone, Weaviate
    .build();

Characteristics:

  • High-quality embeddings
  • Optimized chunk size
  • Scalable vector database
  • Production-grade

Multilingual Setup

// Use multilingual embedding model
EmbeddingModel multilingualModel = OpenAiEmbeddingModel.builder()
    .apiKey(apiKey)
    .modelName("text-embedding-3-small")  // Supports 100+ languages
    .build();

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .embeddingModel(multilingualModel)
    .embeddingStore(store)
    .build();

Domain-Specific Setup

// Medical/legal/technical domain
// Use domain-specific embedding model and custom chunking

// Custom splitter for technical content
DocumentSplitter technicalSplitter = DocumentSplitters.recursive(
    400,  // Larger chunks for technical context
    60    // More overlap for technical continuity
);

// Domain-specific embedding model
EmbeddingModel domainModel = CustomDomainEmbeddingModel.create();

// Add domain metadata
DocumentTransformer domainEnricher = doc -> {
    doc.metadata().put("domain", "medical");
    doc.metadata().put("requires_expertise", true);
    return doc;
};

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentTransformer(domainEnricher)
    .documentSplitter(technicalSplitter)
    .embeddingModel(domainModel)
    .embeddingStore(store)
    .build();

Performance Tuning

Batch Size

When processing many documents, ingest in batches:

List<Document> allDocuments = loadAllDocuments();  // e.g., 10,000 docs
int batchSize = 100;

for (int i = 0; i < allDocuments.size(); i += batchSize) {
    int end = Math.min(i + batchSize, allDocuments.size());
    List<Document> batch = allDocuments.subList(i, end);

    IngestionResult result = ingestor.ingest(batch);

    System.out.println("Processed batch " + (i/batchSize + 1) +
                      " - tokens: " + result.tokenUsage().totalTokenCount());
}

Memory Management

For large documents, consider streaming:

// Process files one at a time instead of loading all
Path docsDir = Paths.get("documents");

Files.walk(docsDir)
    .filter(Files::isRegularFile)
    .forEach(path -> {
        Document doc = FileSystemDocumentLoader.loadDocument(path);
        ingestor.ingest(doc);

        // Document and segments eligible for GC after ingestion
    });

Parallel Processing

import java.util.concurrent.ForkJoinPool;

List<Document> documents = loadDocuments();

// Process documents in parallel
ForkJoinPool customPool = new ForkJoinPool(4);  // 4 threads

customPool.submit(() ->
    documents.parallelStream().forEach(doc ->
        ingestor.ingest(doc)
    )
).join();

Note: Ensure thread-safety of embedding store implementation.

Related Documentation

  • Architecture - How defaults are discovered
  • Document Ingestion API - EmbeddingStoreIngestor API
  • Quick Start - Quick start examples
  • Troubleshooting - Common configuration issues

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-easy-rag

docs

api-document-loading.md

api-ingestion.md

api-retrieval.md

api-types-chat.md

api-types-core.md

api-types-storage.md

architecture.md

configuration.md

examples.md

index.md

quickstart.md

reference.md

troubleshooting.md

tile.json