CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j-easy-rag

Zero-configuration RAG package that bundles document parsing, embedding, and splitting for easy Retrieval-Augmented Generation in Java applications

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

reference.mddocs/

Reference

Dependencies, limitations, and external resources for easy-rag.

Maven Coordinates

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-easy-rag</artifactId>
    <version>1.11.0-beta19</version>
</dependency>

Package Identifier: pkg:maven/dev.langchain4j/langchain4j-easy-rag@1.11.0

License: Apache-2.0

Language: Java

Minimum Java Version: Java 8+

Transitive Dependencies

Easy-rag automatically includes these dependencies:

LangChain4j Core

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j</artifactId>
    <version>1.11.0</version>
</dependency>

Provides core interfaces and orchestration:

  • EmbeddingStoreIngestor
  • EmbeddingStore interface
  • Document, TextSegment, Embedding types
  • SPI discovery mechanism

Apache Tika Document Parser

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-document-parser-apache-tika</artifactId>
    <version>1.11.0-beta19</version>
</dependency>

Tika Version: 3.2.3

Provides document parsing for 200+ formats:

  • ApacheTikaDocumentParser
  • ApacheTikaDocumentParserFactory (SPI)

Tika Sub-dependencies:

  • tika-core - Core parsing framework
  • tika-parsers-standard-package - Format-specific parsers
  • Apache Commons (compress, lang3, logging, io, codec)
  • Bouncy Castle (bcprov-jdk18on) - Cryptography for encrypted documents

BGE-small-en-v1.5 Embedding Model

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-embeddings-bge-small-en-v15-q</artifactId>
    <version>1.11.0-beta19</version>
</dependency>

Provides in-process embedding model:

  • BgeSmallEnV15QuantizedEmbeddingModel
  • BgeSmallEnV15QuantizedEmbeddingModelFactory (SPI)
  • ONNX Runtime (~20MB) - Model execution engine
  • Quantized model weights (~24MB)

Total Dependency Size: Approximately 80-100MB including all transitive dependencies

Supported Document Formats

Via Apache Tika, easy-rag supports 200+ formats including:

Documents

  • PDF - Portable Document Format
  • DOC/DOCX - Microsoft Word
  • ODT - OpenDocument Text
  • RTF - Rich Text Format
  • Pages - Apple Pages
  • WordPerfect - WPD

Spreadsheets

  • XLS/XLSX - Microsoft Excel
  • ODS - OpenDocument Spreadsheet
  • Numbers - Apple Numbers
  • CSV - Comma-Separated Values

Presentations

  • PPT/PPTX - Microsoft PowerPoint
  • ODP - OpenDocument Presentation
  • Keynote - Apple Keynote

Text Formats

  • TXT - Plain text
  • MD - Markdown
  • HTML/XHTML - Web pages
  • XML - Extensible Markup Language
  • JSON - JavaScript Object Notation
  • YAML - YAML Ain't Markup Language

Archives

  • ZIP - ZIP archive (extracts contents)
  • TAR - Tape archive
  • GZ/BZ2 - Compressed files
  • RAR - RAR archive
  • 7Z - 7-Zip archive

Email

  • MSG - Outlook message
  • EML - Email message
  • MBOX - Email mailbox

Code & Markup

  • Java, Python, JavaScript, C++, etc. (as plain text)
  • CSS, SQL, Shell scripts

Other

  • EPUB - eBook format
  • PS - PostScript
  • Images with OCR (requires Tesseract configuration)

Format Detection: Automatic based on content (not just file extension)

Default Configuration Values

Document Splitting

  • Chunk Size: 300 tokens
  • Overlap: 30 tokens (10%)
  • Token Estimator: HuggingFaceTokenCountEstimator
  • Splitting Strategy: Recursive (paragraph → sentence → word → character)

Embedding Model

  • Model: BGE-small-en-v1.5 (quantized ONNX)
  • Dimensions: 384
  • Model Size: ~24MB
  • Execution: In-process (CPU)
  • Thread Pool: Cached (threads = CPU cores)
  • Language: English-optimized
  • Query Prefix: "Represent this sentence for searching relevant passages:"

Content Retrieval (when using from() factory)

  • Max Results: 3
  • Min Score: 0.0 (no threshold)
  • Filter: null (no filtering)

Limitations

Document Processing

Fixed Chunk Size:

  • Default 300 tokens cannot be changed without explicit configuration
  • SPI factory does not expose chunk size parameter
  • Must use builder pattern to customize

Single Splitter Strategy:

  • Only recursive splitter provided by default
  • Must explicitly provide alternative splitters (sentence-based, paragraph-based, etc.)

No Incremental Loading:

  • Must load entire document into memory
  • No streaming support for very large files
  • Document parsed completely before splitting

Embedding Model

English-Only Optimization:

  • BGE-small-en-v1.5 trained primarily on English text
  • Lower quality for non-English languages
  • No multilingual support out of the box

Fixed Dimensions:

  • 384-dimensional embeddings
  • Cannot change without switching models
  • Smaller than many production models (1536-3072 dimensions)

CPU-Bound Execution:

  • No GPU acceleration
  • Slower than API-based models
  • Performance depends on CPU cores

In-Process Only:

  • Model runs in JVM
  • Cannot distribute embedding generation
  • Memory overhead for model (~100MB during execution)

Quantization Tradeoffs:

  • Quantized for size (24MB vs larger full-precision)
  • Slightly lower quality than full-precision model
  • Acceptable for most general use cases

Storage

No Built-in Vector Database:

  • Only InMemoryEmbeddingStore provided
  • Not suitable for large-scale production
  • All embeddings in memory
  • Lost on restart unless serialized

Linear Search:

  • InMemoryEmbeddingStore uses brute-force similarity search
  • O(n) time complexity
  • Slow for large embedding counts (>100k)
  • No indexing or approximate nearest neighbor

Serialization Format:

  • JSON-based serialization
  • Not optimized for size
  • Can be slow for large stores
  • No compression by default

SPI Conflicts

Single Implementation Required:

  • Cannot have multiple DocumentSplitterFactory on classpath
  • Framework throws exception on conflict
  • Must explicitly choose implementation

No Priority/Ordering:

  • Cannot specify preferred implementation if multiple present
  • All-or-nothing: either automatic or explicit

Performance

Embedding Speed:

  • ~50-200 segments/second (CPU-dependent)
  • Much slower than API-based models (1000-5000 segments/second)
  • Single-threaded per document by default

Memory Usage:

  • All documents loaded into memory during processing
  • No streaming or chunked loading
  • Can cause OOM with large document sets

No Caching:

  • Re-computes embeddings if run multiple times
  • Must manually persist and reload to avoid recomputation

Content Quality

No Reranking:

  • Simple cosine similarity ranking
  • No semantic reranking
  • May return sub-optimal results

No Query Expansion:

  • No automatic query reformulation
  • No synonym expansion
  • Limited to exact semantic matching

Context Boundaries:

  • Overlapping chunks can cause redundancy
  • Information may be split across chunks
  • No cross-chunk reasoning

Production Considerations

When Easy-RAG Defaults Are Sufficient

Development and prototypingLearning RAG conceptsSmall to medium datasets (< 50k embeddings)English-language contentStandard document formatsPrivacy-sensitive applications (on-premise processing)No external API dependenciesSimple single-server deployments

When to Customize or Upgrade

Large scale (>100k embeddings) → Use vector database (Pinecone, Weaviate, Qdrant)

High performance requirements → Use API-based embedding models (OpenAI, Cohere)

Multilingual content → Use multilingual embedding models

Specialized domains → Use domain-specific embedding models

Production SLAs → Consider managed services with guarantees

Distributed systems → Use scalable vector databases

Real-time updates → Use databases with update/delete capabilities

Advanced retrieval → Add reranking, query expansion, hybrid search

Alternative Embedding Models

OpenAI

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-open-ai</artifactId>
    <version>1.11.0</version>
</dependency>

Models:

  • text-embedding-3-small - 1536 dimensions, fast, cost-effective
  • text-embedding-3-large - 3072 dimensions, highest quality
  • text-embedding-ada-002 - 1536 dimensions, previous generation

Advantages: High quality, multilingual, fast API Tradeoffs: Requires API key, usage costs, network dependency

Cohere

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-cohere</artifactId>
    <version>1.11.0</version>
</dependency>

Models:

  • embed-english-v3.0 - English, retrieval-optimized
  • embed-multilingual-v3.0 - 100+ languages

Advantages: Retrieval-optimized, compression options Tradeoffs: Requires API key, usage costs

Azure OpenAI

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-azure-open-ai</artifactId>
    <version>1.11.0</version>
</dependency>

Advantages: Enterprise SLAs, data residency, private endpoints Tradeoffs: Azure setup required, same API costs

Local Models (via Ollama)

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-ollama</artifactId>
    <version>1.11.0</version>
</dependency>

Models: Various open-source models (nomic-embed-text, mxbai-embed-large)

Advantages: Local execution, no API costs, privacy Tradeoffs: Requires Ollama setup, variable quality

Alternative Vector Stores

Pinecone

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-pinecone</artifactId>
    <version>1.11.0</version>
</dependency>

Type: Fully managed cloud vector database

Advantages: Scalable, fast, managed, enterprise features Tradeoffs: Requires account, usage costs

Weaviate

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-weaviate</artifactId>
    <version>1.11.0</version>
</dependency>

Type: Open-source vector database (self-hosted or cloud)

Advantages: Open source, GraphQL API, hybrid search Tradeoffs: Requires setup, operational overhead (if self-hosted)

Qdrant

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-qdrant</artifactId>
    <version>1.11.0</version>
</dependency>

Type: Open-source vector database (self-hosted or cloud)

Advantages: Fast, efficient, easy to deploy Tradeoffs: Operational overhead (if self-hosted)

Milvus

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-milvus</artifactId>
    <version>1.11.0</version>
</dependency>

Type: Open-source vector database for large-scale

Advantages: Very scalable, GPU support, distributed Tradeoffs: Complex setup, operational overhead

Redis

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-redis</artifactId>
    <version>1.11.0</version>
</dependency>

Type: Redis with vector search (RediSearch module)

Advantages: Familiar Redis infrastructure, fast Tradeoffs: Requires Redis Stack or Cloud

Elasticsearch

<dependency>
    <groupId>dev.langchain4j</groupId>
    <artifactId>langchain4j-elasticsearch</artifactId>
    <version>1.11.0</version>
</dependency>

Type: Elasticsearch with vector search

Advantages: Existing Elasticsearch infrastructure, hybrid search Tradeoffs: Requires Elasticsearch 8.0+

External Resources

Official Documentation

  • LangChain4j Documentation: https://docs.langchain4j.dev/
  • RAG Tutorial: https://docs.langchain4j.dev/tutorials/rag/
  • GitHub Repository: https://github.com/langchain4j/langchain4j
  • API Javadoc: https://javadoc.io/doc/dev.langchain4j/langchain4j

Maven Central

Apache Tika

BGE Model

Community

Related Topics

Version History

1.11.0-beta19:

  • Initial easy-rag package release
  • Bundles Apache Tika 3.2.3
  • Bundles BGE-small-en-v1.5 quantized model
  • Recursive document splitter with 300 token chunks

Future versions:

  • Check Maven Central for latest version
  • Review release notes on GitHub

Related Documentation

  • Quick Start - Get started quickly
  • Architecture - How zero-configuration works
  • Configuration - Customization options
  • Troubleshooting - Common issues
  • Examples - Complete working applications

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j-easy-rag

docs

api-document-loading.md

api-ingestion.md

api-retrieval.md

api-types-chat.md

api-types-core.md

api-types-storage.md

architecture.md

configuration.md

examples.md

index.md

quickstart.md

reference.md

troubleshooting.md

tile.json