CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-dev-langchain4j--langchain4j

Build LLM-powered applications in Java with support for chatbots, agents, RAG, tools, and much more

Overview
Eval results
Files

document-processing.mddocs/

Document Processing

Loaders, parsers, splitters, and sources for working with documents. Supports loading from file system, classpath, and URLs, with various splitting strategies for creating text segments.

Capabilities

Document Loaders

FileSystemDocumentLoader

Load documents from the file system.

package dev.langchain4j.data.document.loader;

/**
 * DocumentLoader for loading documents from the file system
 */
public class FileSystemDocumentLoader {
    /**
     * Load a single document from path
     * @param filePath Path to the document
     * @return Loaded document
     */
    public static Document loadDocument(Path filePath);

    /**
     * Load a single document from string path
     * @param filePath String path to the document
     * @return Loaded document
     */
    public static Document loadDocument(String filePath);

    /**
     * Load a single document with custom parser
     * @param filePath Path to the document
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document loadDocument(Path filePath, DocumentParser documentParser);

    /**
     * Load a single document with custom parser from string path
     * @param filePath String path to the document
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document loadDocument(String filePath, DocumentParser documentParser);

    /**
     * Load all documents from directory (non-recursive)
     * @param directoryPath Path to directory
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(Path directoryPath);

    /**
     * Load all documents from directory with custom parser (non-recursive)
     * @param directoryPath Path to directory
     * @param documentParser Parser to use for all documents
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(Path directoryPath, DocumentParser documentParser);

    /**
     * Load matching documents from directory (non-recursive)
     * @param directoryPath Path to directory
     * @param pathMatcher Matcher to filter files
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(Path directoryPath, PathMatcher pathMatcher);

    /**
     * Load documents recursively from directory
     * @param directoryPath Path to directory
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(Path directoryPath);

    /**
     * Load documents recursively with matcher and parser
     * @param directoryPath Path to directory
     * @param pathMatcher Matcher to filter files
     * @param documentParser Parser to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(
        Path directoryPath,
        PathMatcher pathMatcher,
        DocumentParser documentParser
    );
}

Thread Safety: All methods are static and stateless. Safe for concurrent use across threads. However, when loading the same file concurrently, OS-level file locks may apply. The Document objects returned are immutable after construction.

Common Pitfalls:

  • Loading binary files without appropriate parser causes encoding errors
  • Loading extremely large files (>1GB) can cause OutOfMemoryError
  • No built-in filtering for hidden files or system files (e.g., .DS_Store, Thumbs.db)
  • PathMatcher is case-sensitive on Linux/Unix, case-insensitive on Windows
  • Non-recursive methods silently skip subdirectories

Edge Cases:

  • Empty files return Document with empty text content
  • Symbolic links are followed by default; can cause infinite loops if circular
  • Files without read permissions throw AccessDeniedException
  • Non-existent paths throw NoSuchFileException
  • Directories passed to loadDocument() throw IOException
  • File modified during read may return inconsistent content

Performance Notes:

  • I/O bound operations; consider parallel processing for multiple files
  • loadDocumentsRecursively() walks entire tree before loading; use PathMatcher to filter early
  • Each file read opens a new FileInputStream; OS file descriptor limits apply
  • Default buffer size is 8KB; consider custom parsers for large files

Cost Considerations:

  • Large documents should be split before embedding; each segment incurs embedding API cost
  • Recursive loading of deep directories can load thousands of files unintentionally
  • Consider document count limits based on embedding budget

Exception Handling:

  • NoSuchFileException - File or directory does not exist
  • AccessDeniedException - Insufficient permissions to read file
  • IOException - Generic I/O errors (disk full, network mount issues)
  • OutOfMemoryError - File too large to load into memory
  • MalformedInputException - Invalid character encoding in file

Related APIs: ClassPathDocumentLoader, UrlDocumentLoader, FileSystemSource, TextDocumentParser


ClassPathDocumentLoader

Load documents from classpath resources.

package dev.langchain4j.data.document.loader;

/**
 * DocumentLoader implementation for loading documents using ClassPathSource
 */
public class ClassPathDocumentLoader {
    /**
     * Load document from classpath
     * @param pathOnClasspath Path to resource on classpath
     * @return Loaded document
     */
    public static Document loadDocument(String pathOnClasspath);

    /**
     * Load document from classpath with custom classloader
     * @param pathOnClasspath Path to resource on classpath
     * @param classLoader ClassLoader to use
     * @return Loaded document
     */
    public static Document loadDocument(String pathOnClasspath, ClassLoader classLoader);

    /**
     * Load document from classpath with custom parser
     * @param pathOnClasspath Path to resource on classpath
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document loadDocument(String pathOnClasspath, DocumentParser documentParser);

    /**
     * Load document from classpath with parser and classloader
     * @param pathOnClasspath Path to resource on classpath
     * @param documentParser Parser to use
     * @param classLoader ClassLoader to use
     * @return Loaded document
     */
    public static Document loadDocument(
        String pathOnClasspath,
        DocumentParser documentParser,
        ClassLoader classLoader
    );

    /**
     * Load all documents from directory on classpath (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath);

    /**
     * Load all documents from directory with custom classloader (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @param classLoader ClassLoader to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath, ClassLoader classLoader);

    /**
     * Load documents from directory with custom parser (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @param documentParser Parser to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath, DocumentParser documentParser);

    /**
     * Load matching documents from directory (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @param pathMatcher Matcher to filter files
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath, PathMatcher pathMatcher);

    /**
     * Load documents recursively from directory
     * @param directoryOnClasspath Path to directory on classpath
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(String directoryOnClasspath);

    /**
     * Load documents recursively with matcher and parser
     * @param directoryOnClasspath Path to directory on classpath
     * @param pathMatcher Matcher to filter files
     * @param documentParser Parser to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(
        String directoryOnClasspath,
        PathMatcher pathMatcher,
        DocumentParser documentParser
    );
}

Thread Safety: All methods are static and thread-safe. ClassLoader instances are typically thread-safe. Safe for concurrent loading of different resources. Loading same resource concurrently is safe but inefficient.

Common Pitfalls:

  • Path must NOT start with "/" in most cases (e.g., "documents/file.txt" not "/documents/file.txt")
  • Resources inside JARs are read-only; cannot modify
  • Wrong ClassLoader may not find resources (use Thread.currentThread().getContextClassLoader() if unsure)
  • Directory loading only works if JAR manifest includes directory entries
  • No built-in caching; same resource loaded multiple times reads from JAR each time

Edge Cases:

  • Resource not found returns null and throws exception on parse
  • Empty resources return Document with empty content
  • Resources in JAR files have no lastModified timestamp (uses JAR timestamp)
  • Nested JARs (JAR in JAR) may not be accessible depending on classloader
  • Resources from file system vs JAR have different URL schemes (file:// vs jar://)

Performance Notes:

  • Loading from JAR requires decompression; slower than file system
  • Each load opens new InputStream; consider caching loaded documents
  • Recursive loading of large JARs can be slow; filter with PathMatcher
  • ClassLoader.getResources() enumerates all JARs on classpath

Cost Considerations:

  • Embedded resources increase JAR size and application startup time
  • Large resources in classpath affect Docker image size
  • Multiple copies of same resource across JARs waste memory when loaded

Exception Handling:

  • NullPointerException - Resource not found on classpath
  • IOException - Error reading from JAR file
  • IllegalArgumentException - Invalid path format
  • OutOfMemoryError - Resource too large to load

Related APIs: FileSystemDocumentLoader, ClassPathSource, UrlDocumentLoader


UrlDocumentLoader

Load documents from URLs.

package dev.langchain4j.data.document.loader;

/**
 * DocumentLoader for loading documents from URLs
 */
public class UrlDocumentLoader {
    /**
     * Load document from URL
     * @param url URL to load from
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document load(URL url, DocumentParser documentParser);

    /**
     * Load document from string URL
     * @param url String URL to load from
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document load(String url, DocumentParser documentParser);
}

Thread Safety: Static methods are thread-safe. However, underlying HTTP client uses default configuration which may have connection pool limits. Concurrent loads share connection pool.

Common Pitfalls:

  • No timeout configuration; may hang indefinitely on slow connections
  • No retry logic; transient network failures cause immediate exception
  • No authentication support; protected URLs fail with 401/403
  • Redirects followed automatically; may load unexpected content
  • No content-type validation; binary files may be parsed as text
  • Large files loaded into memory entirely; can cause OutOfMemoryError

Edge Cases:

  • HTTP 404/500 errors throw IOException
  • HTTPS with invalid certificates throw SSLException
  • URLs with special characters need proper encoding
  • Data URLs (data:text/plain;base64,...) are supported if URL class handles them
  • File URLs (file://) work but use FileSystemDocumentLoader for better error handling
  • Empty response body returns Document with empty content

Performance Notes:

  • Network I/O bound; much slower than file system
  • No connection pooling configuration exposed
  • Each load creates new connection; consider caching
  • Large downloads limited by available memory
  • DNS resolution occurs for each unique hostname

Cost Considerations:

  • External API calls may have rate limits or usage costs
  • Cloud storage URLs (S3, GCS) may incur egress charges
  • Large document downloads consume bandwidth

Exception Handling:

  • MalformedURLException - Invalid URL format
  • IOException - Network errors, HTTP errors (404, 500)
  • UnknownHostException - DNS resolution failure
  • SocketTimeoutException - Connection or read timeout
  • SSLException - HTTPS certificate validation failure
  • OutOfMemoryError - Response too large for memory

Related APIs: UrlSource, FileSystemDocumentLoader, ClassPathDocumentLoader


Document Parsers

TextDocumentParser

Parse plain text documents.

package dev.langchain4j.data.document.parser;

/**
 * DocumentParser implementation for parsing plain text documents
 */
public class TextDocumentParser implements DocumentParser {
    /**
     * Constructor with default UTF-8 charset
     */
    public TextDocumentParser();

    /**
     * Constructor with custom charset
     * @param charset Charset to use for reading text
     */
    public TextDocumentParser(Charset charset);

    /**
     * Parse input stream into document
     * @param inputStream Input stream to parse
     * @return Parsed document
     */
    public Document parse(InputStream inputStream);
}

Thread Safety: Instances are stateless and thread-safe. Safe to share single instance across threads. Parse method is reentrant.

Common Pitfalls:

  • Default UTF-8 may fail on files with different encoding (e.g., Windows-1252, ISO-8859-1)
  • Binary files parsed as text produce gibberish with replacement characters
  • BOM (Byte Order Mark) included in parsed content if present
  • No line ending normalization; mixed \r\n, \n, \r preserved as-is
  • Entire content read into memory; large files cause OutOfMemoryError

Edge Cases:

  • Empty InputStream returns Document with empty text
  • InputStream with only whitespace returns Document with whitespace
  • Invalid byte sequences replaced with � (U+FFFD) in UTF-8
  • InputStream not at position 0 reads from current position
  • InputStream closed after parsing; reuse requires new stream

Performance Notes:

  • Reads entire stream into memory using BufferedReader
  • 8KB default buffer size; performance degrades for very large files
  • Character decoding CPU-intensive for large files
  • No streaming support; entire document must fit in memory

Cost Considerations:

  • Memory usage = file size × 2 (bytes + char array)
  • Large files (>100MB) better split at file system level before loading

Exception Handling:

  • IOException - Stream read errors
  • MalformedInputException - Invalid character encoding
  • UnmappableCharacterException - Characters not supported in charset
  • OutOfMemoryError - File too large for available heap

Related APIs: DocumentParser interface, ApachePdfBoxParser, ApacheTikaParser, TextDocumentParser subclasses


Document Sources

FileSystemSource

Document source for file system files.

package dev.langchain4j.data.document.source;

/**
 * DocumentSource for file system sources
 */
public class FileSystemSource implements DocumentSource {
    /**
     * Constructor
     * @param path Path to file
     */
    public FileSystemSource(Path path);

    /**
     * Create from path
     * @param filePath Path to file
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(Path filePath);

    /**
     * Create from string path
     * @param filePath String path to file
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(String filePath);

    /**
     * Create from URI
     * @param fileUri URI to file
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(URI fileUri);

    /**
     * Create from File
     * @param file File object
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(File file);

    /**
     * Get input stream
     * @return InputStream for reading file
     */
    public InputStream inputStream();

    /**
     * Get metadata
     * @return Metadata for the source
     */
    public Metadata metadata();
}

Thread Safety: Immutable after construction. Safe to share across threads. Each inputStream() call creates new FileInputStream, allowing concurrent reads.

Common Pitfalls:

  • InputStream must be closed by caller; resource leak if forgotten
  • Multiple inputStream() calls on same instance each open new file handle
  • Metadata extracted only once at construction; changes to file not reflected
  • Symbolic links resolved automatically; metadata reflects target file
  • Relative paths resolved against JVM working directory, not classpath

Edge Cases:

  • File deleted between construction and inputStream() throws NoSuchFileException
  • File modified between construction and read may have inconsistent metadata
  • Empty files return valid InputStream with 0 bytes available
  • Directories passed to constructor throw IOException on inputStream()
  • Files without read permission throw AccessDeniedException

Performance Notes:

  • Metadata extraction requires stat() system call at construction
  • Each inputStream() opens new file descriptor; OS limits apply (typically 1024-4096)
  • No buffering applied; wrap with BufferedInputStream for better performance
  • Network mounted files (NFS, SMB) have high latency

Cost Considerations:

  • File descriptor leaks prevent other files from being opened
  • Large files should be streamed, not loaded entirely into memory

Exception Handling:

  • NoSuchFileException - File does not exist
  • AccessDeniedException - Insufficient permissions
  • IOException - Generic I/O errors
  • FileSystemException - File system specific errors

Related APIs: FileSystemDocumentLoader, UrlSource, ClassPathSource, DocumentSource interface


ClassPathSource

Document source for classpath resources.

package dev.langchain4j.data.document.source;

/**
 * DocumentSource specialization that reads from classpath
 */
public class ClassPathSource implements DocumentSource {
    /**
     * Create from classpath resource
     * @param classPathResource Path to resource on classpath
     * @return ClassPathSource instance
     */
    public static ClassPathSource from(String classPathResource);

    /**
     * Create with custom classloader
     * @param classPathResource Path to resource on classpath
     * @param classLoader ClassLoader to use
     * @return ClassPathSource instance
     */
    public static ClassPathSource from(String classPathResource, ClassLoader classLoader);

    /**
     * Get the URL
     * @return URL of the resource
     */
    public URL url();

    /**
     * Get the classloader
     * @return ClassLoader used
     */
    public ClassLoader classLoader();

    /**
     * Check if inside archive (JAR)
     * @return true if resource is inside a JAR file
     */
    public boolean isInsideArchive();

    /**
     * Get input stream
     * @return InputStream for reading resource
     */
    public InputStream inputStream();

    /**
     * Get metadata
     * @return Metadata for the source
     */
    public Metadata metadata();
}

Thread Safety: Immutable after construction. Thread-safe for concurrent access. Each inputStream() call creates independent stream.

Common Pitfalls:

  • Resource path must NOT start with "/" for most classloaders
  • Fails silently if resource not found (returns null, then NPE on parse)
  • Wrong classloader won't find resources in specific JARs
  • Resources inside nested JARs may not be accessible
  • isInsideArchive() only checks JAR, not ZIP or other archives

Edge Cases:

  • Resource not found throws NullPointerException on inputStream()
  • Empty resources return valid InputStream with 0 bytes
  • Resources from exploded directories vs JARs have different URL schemes
  • Metadata lastModified uses JAR timestamp, not resource timestamp
  • ClassLoader hierarchy may find different resource than expected

Performance Notes:

  • Resources in JARs require ZIP decompression
  • Repeated access to same resource re-reads from JAR each time
  • No caching at framework level; consider caching loaded Documents
  • isInsideArchive() parses URL string; mildly expensive

Cost Considerations:

  • JAR resources increase application package size
  • Large resources in classpath affect startup time and memory

Exception Handling:

  • NullPointerException - Resource not found on classpath
  • IOException - Error reading from JAR
  • IllegalArgumentException - Invalid resource path
  • OutOfMemoryError - Resource too large

Related APIs: ClassPathDocumentLoader, FileSystemSource, UrlSource


UrlSource

Document source for URLs.

package dev.langchain4j.data.document.source;

/**
 * DocumentSource for URL sources
 */
public class UrlSource implements DocumentSource {
    /**
     * Constructor
     * @param url URL to load from
     */
    public UrlSource(URL url);

    /**
     * Create from string URL
     * @param url String URL
     * @return UrlSource instance
     */
    public static UrlSource from(String url);

    /**
     * Create from URL
     * @param url URL object
     * @return UrlSource instance
     */
    public static UrlSource from(URL url);

    /**
     * Create from URI
     * @param uri URI object
     * @return UrlSource instance
     */
    public static UrlSource from(URI uri);

    /**
     * Get input stream
     * @return InputStream for reading from URL
     */
    public InputStream inputStream();

    /**
     * Get metadata
     * @return Metadata for the source
     */
    public Metadata metadata();
}

Thread Safety: Immutable after construction. Thread-safe for concurrent access. Each inputStream() call makes new HTTP request.

Common Pitfalls:

  • No timeout configuration; may hang on slow connections
  • No authentication support built-in
  • No retry logic for transient failures
  • HTTP errors (404, 500) not detected until inputStream() called
  • Redirects followed automatically; may fetch unexpected URL
  • No connection pooling; each inputStream() opens new connection

Edge Cases:

  • HTTP 404/500 throw IOException from inputStream()
  • Empty response body returns valid InputStream with 0 bytes
  • HTTPS with invalid certificate throws SSLException
  • Network unreachable throws UnknownHostException
  • URL connection timeout defaults to infinite

Performance Notes:

  • Network latency much higher than file system
  • Each inputStream() call makes new HTTP request; very inefficient for multiple reads
  • No caching; same URL fetched repeatedly
  • Large responses loaded entirely into memory by some parsers

Cost Considerations:

  • External API calls may have rate limits or per-request costs
  • Cloud storage egress charges for S3, GCS, Azure Blob
  • Bandwidth costs for large documents

Exception Handling:

  • MalformedURLException - Invalid URL format
  • IOException - Network errors, HTTP errors
  • UnknownHostException - DNS failure
  • SSLException - HTTPS certificate errors
  • SocketTimeoutException - Connection timeout
  • OutOfMemoryError - Response too large

Related APIs: UrlDocumentLoader, FileSystemSource, ClassPathSource


Document Splitters

DocumentSplitters Utility

Factory methods for recommended document splitters.

package dev.langchain4j.data.document.splitter;

/**
 * Utility class providing factory methods for recommended document splitters
 */
public class DocumentSplitters {
    /**
     * Create recursive splitter with token limits (recommended for generic text)
     * Splits by paragraphs, then lines, then sentences, then words, then characters
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     * @return Configured document splitter
     */
    public static DocumentSplitter recursive(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Create recursive splitter with character limits
     * Splits by paragraphs, then lines, then sentences, then words, then characters
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @return Configured document splitter
     */
    public static DocumentSplitter recursive(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars
    );
}

Thread Safety: Factory methods are static and thread-safe. Returned DocumentSplitter instances are stateless and thread-safe. Safe to share splitter instance across threads.

Common Pitfalls:

  • Token-based splitting requires TokenCountEstimator; forgetting causes NPE
  • Overlap size >= segment size causes infinite loop or empty segments
  • Character count != token count; models have token limits not character limits
  • OpenAI tokenizers differ (GPT-3.5 vs GPT-4); use matching tokenizer
  • Recursive splitting can be slow on very large documents (>1MB)

Edge Cases:

  • Document with no paragraph/line breaks falls back to sentence splitting
  • Document with no sentence breaks falls back to word splitting
  • Single long word exceeding maxSegmentSize splits by characters
  • Empty document returns empty list of segments
  • Document smaller than maxSegmentSize returns single segment

Performance Notes:

  • Token-based splitting requires tokenization; 10-50x slower than character-based
  • Sentence detection uses Apache OpenNLP; loads model on first use (~50ms)
  • Recursive strategy tries each level; worst case processes text 5 times
  • Large overlap ratios cause redundant processing and storage

Cost Considerations:

  • More segments = more embedding API calls = higher cost
  • Overlap increases total token count sent to embedding API
  • Smaller segments improve retrieval precision but increase storage and costs
  • Typical sweet spot: 300-500 tokens per segment, 10% overlap

Exception Handling:

  • IllegalArgumentException - Invalid parameters (negative sizes, overlap > segment size)
  • NullPointerException - Null tokenizer for token-based splitting
  • OutOfMemoryError - Document too large with very small segment size

Related APIs: DocumentByParagraphSplitter, DocumentBySentenceSplitter, HierarchicalDocumentSplitter


HierarchicalDocumentSplitter

Base class for hierarchical document splitters.

package dev.langchain4j.data.document.splitter;

/**
 * Base class for hierarchical document splitters
 * Provides machinery for sub-splitting documents when a single segment is too long
 */
public abstract class HierarchicalDocumentSplitter implements DocumentSplitter {
    /**
     * Split document into segments
     * @param document Document to split
     * @return List of text segments
     */
    public List<TextSegment> split(Document document);

    /**
     * Split text implementation (abstract)
     * @param text Text to split
     * @return Array of split parts
     */
    protected abstract String[] split(String text);

    /**
     * Get join delimiter (abstract)
     * @return Delimiter used to join parts
     */
    protected abstract String joinDelimiter();

    /**
     * Get default sub-splitter (abstract)
     * @return Default sub-splitter to use if segment is too large
     */
    protected abstract DocumentSplitter defaultSubSplitter();

    /**
     * Get overlap region at end of segment
     * @param segmentText Segment text
     * @return Overlap text
     */
    protected String overlapFrom(String segmentText);

    /**
     * Estimate size in tokens or characters
     * @param text Text to estimate
     * @return Estimated size
     */
    protected int estimateSize(String text);

    /**
     * Create segment with metadata
     * @param text Segment text
     * @param document Source document
     * @param index Segment index
     * @return Text segment with metadata
     */
    protected static TextSegment createSegment(String text, Document document, int index);
}

Thread Safety: Implementations are stateless and thread-safe if TokenCountEstimator is thread-safe. Safe to share across threads for splitting different documents concurrently.

Common Pitfalls:

  • Sub-splitter must handle segments that still exceed maxSegmentSize
  • Infinite recursion possible if sub-splitter doesn't make progress
  • Metadata copied to all segments; large metadata multiplies memory usage
  • Overlap calculation at boundary may cut words/sentences mid-way
  • Join delimiter added between parts; affects final character/token count

Edge Cases:

  • Zero-length segments filtered out automatically
  • Segment exactly at maxSegmentSize does not trigger sub-splitting
  • Empty document returns empty list
  • Document with only whitespace may produce empty segments
  • Very small maxSegmentSize (< 10) may cause all text to be dropped

Performance Notes:

  • Recursive sub-splitting can process text multiple times
  • Overlap extraction scans segment from end; O(segment_length)
  • Creating segments with metadata involves string copying
  • Token counting called repeatedly; cache if possible

Cost Considerations:

  • More aggressive splitting (smaller segments) = more embedding calls
  • Overlap duplicates content in embeddings; increases storage and cost

Exception Handling:

  • IllegalArgumentException - Invalid configuration (overlap > segment size)
  • StackOverflowError - Infinite sub-splitting recursion
  • OutOfMemoryError - Too many segments generated

Related APIs: DocumentSplitters, DocumentByParagraphSplitter, DocumentBySentenceSplitter


DocumentByParagraphSplitter

Split documents by paragraphs.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into paragraphs and fits as many as possible into a single TextSegment
 * Paragraph boundaries detected by double newlines
 * Default sub-splitter is DocumentBySentenceSplitter
 */
public class DocumentByParagraphSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByParagraphSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large paragraphs
     */
    public DocumentByParagraphSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByParagraphSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Split text by paragraphs
     * @param text Text to split
     * @return Array of paragraphs
     */
    protected String[] split(String text);

    /**
     * Get join delimiter
     * @return "\n\n" (double newline)
     */
    protected String joinDelimiter();

    /**
     * Get default sub-splitter
     * @return DocumentBySentenceSplitter instance
     */
    protected DocumentSplitter defaultSubSplitter();
}

Thread Safety: Stateless and thread-safe. Safe to share instance across threads. Token counter must be thread-safe if used.

Common Pitfalls:

  • Paragraphs detected by "\n\n" only; single newline not recognized
  • Mixed line endings (\r\n, \n, \r) may not be handled consistently
  • Documents without paragraph breaks treated as single paragraph, fall back to sub-splitter
  • Very long paragraphs (> maxSegmentSize) always trigger sub-splitting
  • Trailing whitespace in paragraphs preserved; affects size calculations

Edge Cases:

  • Document with only "\n\n" produces empty paragraphs (filtered out)
  • Three or more newlines treated same as two (paragraph boundary)
  • Paragraph consisting of only whitespace may be preserved or dropped
  • Single line document with no "\n\n" falls back to sentence splitting
  • Empty paragraphs filtered automatically

Performance Notes:

  • Paragraph detection via string split is fast (O(n))
  • Sub-splitting large paragraphs more expensive (sentence detection)
  • Token-based limits require tokenizing each paragraph
  • Character-based limits are ~10x faster than token-based

Cost Considerations:

  • Paragraph boundaries preserve semantic coherence; better for RAG quality
  • Fewer segments than sentence splitting = lower embedding costs
  • Overlap at paragraph level includes entire paragraphs in adjacent segments

Exception Handling:

  • IllegalArgumentException - Invalid size parameters
  • NullPointerException - Null text or tokenizer
  • OutOfMemoryError - Too many small paragraphs with large document

Related APIs: DocumentBySentenceSplitter, DocumentByLineSplitter, HierarchicalDocumentSplitter


DocumentByLineSplitter

Split documents by lines.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into lines and fits as many as possible into a single TextSegment
 * Line boundaries detected by newline characters
 * Default sub-splitter is DocumentBySentenceSplitter
 */
public class DocumentByLineSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByLineSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large lines
     */
    public DocumentByLineSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByLineSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Split text by lines
     * @param text Text to split
     * @return Array of lines
     */
    protected String[] split(String text);

    /**
     * Get join delimiter
     * @return "\n" (newline)
     */
    protected String joinDelimiter();

    /**
     * Get default sub-splitter
     * @return DocumentBySentenceSplitter instance
     */
    protected DocumentSplitter defaultSubSplitter();
}

Thread Safety: Stateless and thread-safe. Safe for concurrent use. TokenCountEstimator must be thread-safe.

Common Pitfalls:

  • Mixed line endings (\r\n vs \n) may cause inconsistent splits
  • Empty lines preserved as empty segments (then filtered)
  • Very long lines (e.g., minified code) exceed maxSegmentSize and trigger sub-splitting
  • Good for structured data (CSV, logs); poor for prose text
  • Windows CRLF (\r\n) splits into "\r" + empty line on Unix systems

Edge Cases:

  • Empty lines filtered automatically
  • Lines with only whitespace may be preserved
  • No trailing newline: last line still included
  • Consecutive newlines create empty line segments (filtered)
  • Single character per line with small maxSegmentSize causes many segments

Performance Notes:

  • Line splitting is very fast (O(n) string split)
  • Good for line-oriented formats (logs, CSV, code)
  • Sub-splitting long lines uses sentence detection (slower)

Cost Considerations:

  • Line-based splitting often creates more segments than paragraph-based
  • Good for structured data where lines are semantic units
  • Poor for prose where sentences span multiple lines

Exception Handling:

  • IllegalArgumentException - Invalid parameters
  • NullPointerException - Null input
  • OutOfMemoryError - Too many lines

Related APIs: DocumentByParagraphSplitter, DocumentBySentenceSplitter, HierarchicalDocumentSplitter


DocumentBySentenceSplitter

Split documents by sentences.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into sentences and fits as many as possible into a single TextSegment
 * Uses Apache OpenNLP for sentence detection
 */
public class DocumentBySentenceSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentBySentenceSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large sentences
     */
    public DocumentBySentenceSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentBySentenceSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );
}

Thread Safety: Sentence detector is NOT thread-safe (OpenNLP limitation). Do NOT share instance across threads. Create one instance per thread or synchronize access.

Common Pitfalls:

  • Requires Apache OpenNLP dependency; missing JAR causes ClassNotFoundException
  • Sentence model (en-sent.bin) loaded lazily; first split is slow (~50ms)
  • Not thread-safe; concurrent splits corrupt sentence detector state
  • Abbreviations (Dr., Inc.) may cause incorrect sentence boundaries
  • Languages other than English not supported by default model
  • URLs and emails containing periods may split incorrectly

Edge Cases:

  • Single sentence document returns single segment
  • Document with no sentence endings falls back to word splitting
  • Sentence exceeding maxSegmentSize triggers sub-splitter
  • Empty sentences after whitespace trimming filtered out
  • Ellipsis (...) may or may not be sentence boundary

Performance Notes:

  • Sentence detection ~10x slower than paragraph/line splitting
  • OpenNLP model loaded once and cached
  • Good for natural language text; overkill for structured data
  • Sub-splitting long sentences uses word splitter (fast)

Cost Considerations:

  • More segments than paragraph splitting = higher embedding costs
  • Better semantic coherence improves retrieval quality
  • Overlap at sentence level provides good context

Exception Handling:

  • ClassNotFoundException - OpenNLP dependency missing
  • IOException - Sentence model file not found
  • IllegalArgumentException - Invalid parameters
  • ConcurrentModificationException - Concurrent access (not thread-safe)

Related APIs: DocumentByWordSplitter, DocumentByParagraphSplitter, HierarchicalDocumentSplitter


DocumentByWordSplitter

Split documents by words.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into words and fits as many as possible into a single TextSegment
 */
public class DocumentByWordSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByWordSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large words
     */
    public DocumentByWordSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByWordSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );
}

Thread Safety: Stateless and thread-safe. Safe for concurrent use across threads.

Common Pitfalls:

  • Word boundaries defined by whitespace; punctuation attached to words
  • Contractions (don't, I'll) treated as single words
  • Hyphenated words (e.g., self-driving) treated as single word
  • Very long words (URLs, base64) may exceed maxSegmentSize and trigger character splitting
  • Multiple spaces between words collapsed to single space in output

Edge Cases:

  • Empty string returns empty list
  • String with only whitespace returns empty list
  • Single word exceeding maxSegmentSize triggers character sub-splitter
  • Consecutive whitespace treated as single word boundary
  • Punctuation-only "words" preserved

Performance Notes:

  • Very fast; simple whitespace split (O(n))
  • Sub-splitter (character) only invoked for extremely long words
  • Good fallback when sentence detection fails

Cost Considerations:

  • Breaks sentence coherence; worse for semantic search
  • Many segments = higher costs
  • Rarely used as primary splitter; usually sub-splitter

Exception Handling:

  • IllegalArgumentException - Invalid parameters
  • NullPointerException - Null input

Related APIs: DocumentByCharacterSplitter, DocumentBySentenceSplitter, HierarchicalDocumentSplitter


DocumentByCharacterSplitter

Split documents by characters.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into characters and fits as many as possible into a single TextSegment
 * Supports character or token-based limits
 */
public class DocumentByCharacterSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByCharacterSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter (typically null for character splitter)
     */
    public DocumentByCharacterSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByCharacterSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Full constructor with token limits and sub-splitter
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     * @param subSplitter Sub-splitter (typically null)
     */
    public DocumentByCharacterSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator,
        DocumentSplitter subSplitter
    );

    /**
     * Split text implementation
     * @param text Text to split
     * @return Array of characters as strings
     */
    protected String[] split(String text);

    /**
     * Get join delimiter
     * @return "" (empty string)
     */
    protected String joinDelimiter();

    /**
     * Get default sub-splitter
     * @return null (no sub-splitter)
     */
    protected DocumentSplitter defaultSubSplitter();
}

Thread Safety: Stateless and thread-safe. Safe for concurrent use.

Common Pitfalls:

  • Destroys all semantic structure (words, sentences, paragraphs)
  • Poor for retrieval quality; segments often meaningless
  • Multibyte characters (UTF-8) counted as single character
  • Emoji and special Unicode may span multiple Java chars (surrogate pairs)
  • No sub-splitter available; maxSegmentSize is hard limit

Edge Cases:

  • Empty string returns empty list
  • Single character document returns single segment
  • Segment size = 1 splits every character separately
  • Unicode surrogate pairs may be split incorrectly
  • Overlap must be < segment size to make progress

Performance Notes:

  • Extremely fast; no parsing logic
  • Last resort fallback in hierarchical splitters
  • Rarely used as primary splitter

Cost Considerations:

  • Destroys semantic meaning; worst for RAG quality
  • Use only when all other splitters fail (e.g., binary data as text)

Exception Handling:

  • IllegalArgumentException - Invalid parameters (overlap >= segment size)
  • NullPointerException - Null input

Related APIs: DocumentByWordSplitter, HierarchicalDocumentSplitter


DocumentByRegexSplitter

Split documents using custom regex pattern.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents using a custom regex pattern
 */
public class DocumentByRegexSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param regex Regular expression pattern for splitting
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByRegexSplitter(String regex, int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param regex Regular expression pattern for splitting
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large segments
     */
    public DocumentByRegexSplitter(
        String regex,
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param regex Regular expression pattern for splitting
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByRegexSplitter(
        String regex,
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );
}

Thread Safety: Pattern compiled at construction time. Thread-safe if Pattern.split() is used correctly (stateless). Safe for concurrent use.

Common Pitfalls:

  • Invalid regex throws PatternSyntaxException at construction
  • Regex must match delimiters, not content (use lookahead/lookbehind if needed)
  • Greedy vs non-greedy matching affects results
  • Regex performance degrades with catastrophic backtracking
  • Delimiter not preserved in segments (unless using lookaround assertions)

Edge Cases:

  • Regex never matches: document not split at all (becomes single segment)
  • Regex matches everywhere: produces many empty segments (filtered)
  • Empty segments after split are filtered automatically
  • Regex matching newlines may interact poorly with line-based logic

Performance Notes:

  • Regex compilation happens once at construction
  • Complex regex can be slow (O(n²) or worse with backtracking)
  • Simple patterns (literal strings) almost as fast as string.split()

Cost Considerations:

  • Custom splitting can preserve domain-specific structure
  • Good for log files, structured text, code with special delimiters

Exception Handling:

  • PatternSyntaxException - Invalid regex pattern
  • IllegalArgumentException - Invalid size parameters
  • StackOverflowError - Catastrophic regex backtracking

Related APIs: Pattern class, HierarchicalDocumentSplitter, DocumentByLineSplitter


Chunking Strategy Guide

Choosing the Right Splitter

Use Cases by Content Type:

Content TypeRecommended SplitterReasoning
Documentation, articlesDocumentSplitters.recursive()Preserves semantic structure (paragraphs > sentences > words)
Code filesDocumentByLineSplitterCode structure aligned with lines
Log filesDocumentByRegexSplitterCustom delimiters (timestamps, log levels)
CSV/TSVDocumentByLineSplitterEach line is semantic unit
Legal documentsDocumentByParagraphSplitterParagraph = logical unit
Chat transcriptsDocumentByRegexSplitterSplit by speaker or timestamp
MarkdownDocumentByParagraphSplitterRespects document structure
JSON/XMLCustom parser + DocumentByLineSplitterParse first, then split logical blocks

Segment Size Guidelines

Token-based sizing (recommended):

  • Small segments (100-200 tokens): High precision, more segments, higher cost
  • Medium segments (300-500 tokens): Balanced precision and context
  • Large segments (800-1000 tokens): More context, lower precision, risk of exceeding model limits

Character-based sizing:

  • 1 token ≈ 4 characters (English text)
  • 500 characters ≈ 125 tokens
  • Use character-based for simplicity; token-based for accuracy

Overlap Strategy

Overlap benefits:

  • Prevents information loss at segment boundaries
  • Improves retrieval when query terms span boundary
  • Provides context continuity

Overlap sizing:

  • 10-20% of segment size is typical
  • Paragraph splitter: 1-2 paragraphs overlap
  • Sentence splitter: 1-3 sentences overlap
  • Too much overlap: redundant storage and embedding costs
  • Too little overlap: information loss at boundaries

When to skip overlap:

  • Structured data where boundaries are clear (CSV rows)
  • Storage/cost constrained scenarios
  • Segments already large enough to provide context

Hierarchical Splitting Strategy

Recursive splitting hierarchy:

  1. Paragraph (double newline) - preserves major structure
  2. Line (single newline) - preserves minor structure
  3. Sentence (OpenNLP) - preserves semantic units
  4. Word (whitespace) - preserves lexical units
  5. Character (fallback) - guaranteed progress

Custom hierarchy example:

DocumentSplitter customSplitter = new DocumentByRegexSplitter(
    "\\n---\\n", // Custom section delimiter
    1000,
    100,
    new DocumentByParagraphSplitter(1000, 100)
);

Performance Optimization

Parallel processing pattern:

List<Document> documents = loadDocuments();
List<TextSegment> allSegments = documents.parallelStream()
    .flatMap(doc -> splitter.split(doc).stream())
    .collect(Collectors.toList());

Batching for embedding:

int batchSize = 100;
for (int i = 0; i < segments.size(); i += batchSize) {
    List<TextSegment> batch = segments.subList(
        i,
        Math.min(i + batchSize, segments.size())
    );
    List<Embedding> embeddings = embeddingModel.embedAll(batch).content();
    // Store embeddings
}

File Type Handling Patterns

Plain Text Files

Supported encodings:

  • UTF-8 (default)
  • UTF-16 (with BOM detection)
  • ISO-8859-1 (Latin-1)
  • Windows-1252
  • Custom Charset via TextDocumentParser(charset)

Pattern:

// Auto-detect encoding (defaults to UTF-8)
Document doc = FileSystemDocumentLoader.loadDocument("file.txt");

// Explicit encoding
Document doc2 = FileSystemDocumentLoader.loadDocument(
    "file.txt",
    new TextDocumentParser(StandardCharsets.ISO_8859_1)
);

Markdown Files

Pattern:

// Load as text (preserves markdown syntax)
Document doc = FileSystemDocumentLoader.loadDocument("README.md");

// Split by headers (custom regex)
DocumentSplitter splitter = new DocumentByRegexSplitter(
    "\\n##? ",  // Split on ## or # headers
    2000,
    200
);

Code Files

Pattern:

// Load source code
Document code = FileSystemDocumentLoader.loadDocument("App.java");

// Split by lines (preserves structure)
DocumentSplitter splitter = new DocumentByLineSplitter(500, 50);

// Or split by functions (custom regex for Java)
DocumentSplitter functionSplitter = new DocumentByRegexSplitter(
    "\\n\\s*(public|private|protected)\\s+",
    1000,
    100
);

CSV Files

Pattern:

// Load CSV
Document csv = FileSystemDocumentLoader.loadDocument("data.csv");

// Split by lines (each row is a segment)
DocumentSplitter splitter = new DocumentByLineSplitter(
    1000, // Max chars per segment
    0     // No overlap for structured data
);

// Filter header if needed
List<TextSegment> segments = splitter.split(csv).stream()
    .skip(1) // Skip header row
    .collect(Collectors.toList());

Log Files

Pattern:

// Load log file
Document logs = FileSystemDocumentLoader.loadDocument("app.log");

// Split by timestamp pattern
DocumentSplitter splitter = new DocumentByRegexSplitter(
    "\\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}",  // ISO timestamp
    2000,
    0  // No overlap for logs
);

JSON Files

Pattern:

// Parse JSON first, then create documents per object
String jsonContent = Files.readString(Path.of("data.json"));
JsonArray array = JsonParser.parseString(jsonContent).getAsJsonArray();

List<Document> documents = new ArrayList<>();
for (JsonElement element : array) {
    String text = element.toString();
    documents.add(Document.from(text));
}

// Split each document
List<TextSegment> segments = documents.stream()
    .flatMap(doc -> splitter.split(doc).stream())
    .collect(Collectors.toList());

PDF Files

Pattern (requires Apache PDFBox):

// Add dependency: dev.langchain4j:langchain4j-document-parser-apache-pdfbox
import dev.langchain4j.data.document.parser.apache.pdfbox.ApachePdfBoxDocumentParser;

Document pdf = FileSystemDocumentLoader.loadDocument(
    "document.pdf",
    new ApachePdfBoxDocumentParser()
);

// Split with token-based limits (PDFs often verbose)
DocumentSplitter splitter = DocumentSplitters.recursive(500, 50, tokenizer);

Binary Files

Pattern:

  • Do NOT load binary files (images, videos, executables) with text loaders
  • Use specialized parsers or skip binary files
  • Filter by extension:
PathMatcher textFilesOnly = FileSystems.getDefault().getPathMatcher(
    "glob:*.{txt,md,java,py,js,json,xml,csv,log}"
);

List<Document> docs = FileSystemDocumentLoader.loadDocuments(
    Path.of("/path/to/dir"),
    textFilesOnly
);

Testing Patterns

Unit Testing Document Loaders

import org.junit.jupiter.api.Test;
import static org.assertj.core.api.Assertions.*;

class DocumentLoaderTest {
    @Test
    void testLoadSingleDocument() {
        // Given
        Path testFile = Path.of("src/test/resources/test.txt");

        // When
        Document doc = FileSystemDocumentLoader.loadDocument(testFile);

        // Then
        assertThat(doc.text()).isNotEmpty();
        assertThat(doc.metadata().get("file_name")).isEqualTo("test.txt");
    }

    @Test
    void testLoadNonExistentFile() {
        // Given
        Path nonExistent = Path.of("does-not-exist.txt");

        // When/Then
        assertThatThrownBy(() -> FileSystemDocumentLoader.loadDocument(nonExistent))
            .isInstanceOf(NoSuchFileException.class);
    }

    @Test
    void testLoadWithCustomCharset() {
        // Given
        Path latin1File = Path.of("src/test/resources/latin1.txt");
        TextDocumentParser parser = new TextDocumentParser(StandardCharsets.ISO_8859_1);

        // When
        Document doc = FileSystemDocumentLoader.loadDocument(latin1File, parser);

        // Then
        assertThat(doc.text()).contains("café"); // Correctly decoded
    }
}

Unit Testing Document Splitters

class DocumentSplitterTest {
    private DocumentSplitter splitter;

    @BeforeEach
    void setUp() {
        splitter = DocumentSplitters.recursive(100, 10);
    }

    @Test
    void testSplitSmallDocument() {
        // Given
        Document doc = Document.from("Short text.");

        // When
        List<TextSegment> segments = splitter.split(doc);

        // Then
        assertThat(segments).hasSize(1);
        assertThat(segments.get(0).text()).isEqualTo("Short text.");
    }

    @Test
    void testSplitLargeDocument() {
        // Given
        String longText = "A ".repeat(100); // 200 characters
        Document doc = Document.from(longText);

        // When
        List<TextSegment> segments = splitter.split(doc);

        // Then
        assertThat(segments).hasSizeGreaterThan(1);
        assertThat(segments).allMatch(s -> s.text().length() <= 100);
    }

    @Test
    void testOverlapBetweenSegments() {
        // Given
        String text = "Sentence one. Sentence two. Sentence three. Sentence four.";
        Document doc = Document.from(text);
        DocumentSplitter splitterWithOverlap = new DocumentBySentenceSplitter(30, 10);

        // When
        List<TextSegment> segments = splitterWithOverlap.split(doc);

        // Then
        assertThat(segments.size()).isGreaterThan(1);
        // Verify overlap exists
        for (int i = 0; i < segments.size() - 1; i++) {
            String currentEnd = segments.get(i).text().substring(
                Math.max(0, segments.get(i).text().length() - 10)
            );
            String nextStart = segments.get(i + 1).text().substring(0,
                Math.min(10, segments.get(i + 1).text().length())
            );
            // Some overlap should exist
            assertThat(nextStart).containsAnyOf(currentEnd.split(" "));
        }
    }

    @Test
    void testMetadataPreserved() {
        // Given
        Metadata metadata = new Metadata();
        metadata.put("source", "test.txt");
        Document doc = Document.from("Text content", metadata);

        // When
        List<TextSegment> segments = splitter.split(doc);

        // Then
        assertThat(segments).allMatch(s ->
            s.metadata().get("source").equals("test.txt")
        );
    }
}

Integration Testing RAG Pipeline

class RAGPipelineTest {
    private EmbeddingModel embeddingModel;
    private EmbeddingStore<TextSegment> embeddingStore;
    private DocumentSplitter splitter;

    @BeforeEach
    void setUp() {
        embeddingModel = new AllMiniLmL6V2EmbeddingModel();
        embeddingStore = new InMemoryEmbeddingStore<>();
        splitter = DocumentSplitters.recursive(300, 30,
            new OpenAiTokenizer("gpt-3.5-turbo"));
    }

    @Test
    void testCompleteRAGPipeline() {
        // Given: Load and split documents
        List<Document> docs = FileSystemDocumentLoader.loadDocuments(
            Path.of("src/test/resources/docs")
        );

        List<TextSegment> segments = docs.stream()
            .flatMap(doc -> splitter.split(doc).stream())
            .collect(Collectors.toList());

        // Index segments
        for (TextSegment segment : segments) {
            Embedding embedding = embeddingModel.embed(segment).content();
            embeddingStore.add(embedding, segment);
        }

        // When: Search
        String query = "What is document processing?";
        Embedding queryEmbedding = embeddingModel.embed(query).content();
        List<EmbeddingMatch<TextSegment>> matches =
            embeddingStore.findRelevant(queryEmbedding, 3);

        // Then: Verify results
        assertThat(matches).isNotEmpty();
        assertThat(matches).hasSizeLessThanOrEqualTo(3);
        assertThat(matches.get(0).score()).isGreaterThan(0.5);
        assertThat(matches.get(0).embedded().text()).containsIgnoringCase("document");
    }
}

Testing Error Handling

class ErrorHandlingTest {
    @Test
    void testLargeFileHandling() {
        // Given: Simulate large file
        Path largeFile = createLargeTestFile(1_000_000_000); // 1GB

        // When/Then: Should handle gracefully or throw OOME
        assertThatThrownBy(() ->
            FileSystemDocumentLoader.loadDocument(largeFile)
        ).isInstanceOfAny(OutOfMemoryError.class, IOException.class);

        // Cleanup
        Files.deleteIfExists(largeFile);
    }

    @Test
    void testInvalidEncodingHandling() {
        // Given: File with invalid UTF-8
        Path invalidFile = Path.of("src/test/resources/invalid-utf8.txt");

        // When: Load with UTF-8 parser
        Document doc = FileSystemDocumentLoader.loadDocument(invalidFile);

        // Then: Should contain replacement characters
        assertThat(doc.text()).contains("\uFFFD"); // Replacement character
    }

    @Test
    void testEmptyFileHandling() {
        // Given: Empty file
        Path emptyFile = Files.createTempFile("empty", ".txt");

        // When
        Document doc = FileSystemDocumentLoader.loadDocument(emptyFile);

        // Then
        assertThat(doc.text()).isEmpty();

        // Cleanup
        Files.deleteIfExists(emptyFile);
    }
}

Usage Examples

Loading Documents

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.document.parser.TextDocumentParser;
import java.nio.file.Path;
import java.util.List;

// Load single document
Document doc = FileSystemDocumentLoader.loadDocument(Path.of("/path/to/file.txt"));

// Load with custom parser
Document doc2 = FileSystemDocumentLoader.loadDocument(
    Path.of("/path/to/file.txt"),
    new TextDocumentParser()
);

// Load all documents from directory
List<Document> docs = FileSystemDocumentLoader.loadDocuments(Path.of("/path/to/dir"));

// Load recursively
List<Document> allDocs = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/dir")
);

Loading from Classpath

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.loader.ClassPathDocumentLoader;

// Load single file from classpath
Document doc = ClassPathDocumentLoader.loadDocument("documents/guide.txt");

// Load all documents from classpath directory
List<Document> docs = ClassPathDocumentLoader.loadDocuments("documents");

// Load recursively
List<Document> allDocs = ClassPathDocumentLoader.loadDocumentsRecursively("documents");

Splitting Documents

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.splitter.DocumentSplitter;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiTokenizer;
import java.util.List;

// Recommended: recursive splitter with token limits
DocumentSplitter splitter = DocumentSplitters.recursive(
    500, // max tokens per segment
    50,  // overlap tokens
    new OpenAiTokenizer()
);

List<TextSegment> segments = splitter.split(document);

// Simple: recursive splitter with character limits
DocumentSplitter charSplitter = DocumentSplitters.recursive(2000, 200);
List<TextSegment> charSegments = charSplitter.split(document);

Custom Splitting Strategy

import dev.langchain4j.data.document.splitter.DocumentByParagraphSplitter;
import dev.langchain4j.data.document.splitter.DocumentBySentenceSplitter;

// Split by paragraphs with custom sub-splitter
DocumentSplitter splitter = new DocumentByParagraphSplitter(
    1000, // max characters per segment
    100,  // overlap characters
    new DocumentBySentenceSplitter(1000, 100) // sub-splitter for large paragraphs
);

List<TextSegment> segments = splitter.split(document);

Complete RAG Pipeline Example

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;
import java.nio.file.Path;
import java.util.List;

// 1. Load documents
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/docs")
);

// 2. Split into segments
DocumentSplitter splitter = DocumentSplitters.recursive(300, 30, tokenizer);
List<TextSegment> segments = new ArrayList<>();
for (Document doc : documents) {
    segments.addAll(splitter.split(doc));
}

// 3. Embed segments
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
for (TextSegment segment : segments) {
    Embedding embedding = embeddingModel.embed(segment).content();
    embeddingStore.add(embedding, segment);
}

// 4. Use with AI service for RAG
ContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder()
    .embeddingStore(embeddingStore)
    .embeddingModel(embeddingModel)
    .maxResults(3)
    .build();

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(chatModel)
    .contentRetriever(contentRetriever)
    .build();

Filtering Files by Type

import java.nio.file.FileSystems;
import java.nio.file.PathMatcher;

// Create matcher for text files only
PathMatcher textFiles = FileSystems.getDefault().getPathMatcher(
    "glob:*.{txt,md,java,py,js}"
);

// Load only matching files
List<Document> docs = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/code"),
    textFiles,
    new TextDocumentParser()
);

Parallel Processing Pattern

import java.util.concurrent.ForkJoinPool;
import java.util.stream.Collectors;

// Load documents in parallel
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/docs")
);

// Split in parallel
ForkJoinPool customThreadPool = new ForkJoinPool(4);
List<TextSegment> allSegments = customThreadPool.submit(() ->
    documents.parallelStream()
        .flatMap(doc -> splitter.split(doc).stream())
        .collect(Collectors.toList())
).join();

customThreadPool.shutdown();

Handling Different Encodings

import java.nio.charset.StandardCharsets;

// Latin-1 encoded file
Document latin1Doc = FileSystemDocumentLoader.loadDocument(
    Path.of("latin1-file.txt"),
    new TextDocumentParser(StandardCharsets.ISO_8859_1)
);

// Windows-1252 encoded file
Document windowsDoc = FileSystemDocumentLoader.loadDocument(
    Path.of("windows-file.txt"),
    new TextDocumentParser(Charset.forName("Windows-1252"))
);

Custom Regex Splitter for Logs

// Split log file by timestamp entries
DocumentSplitter logSplitter = new DocumentByRegexSplitter(
    "\\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}",  // ISO timestamp pattern
    2000,  // max chars per segment
    0      // no overlap for logs
);

Document logs = FileSystemDocumentLoader.loadDocument("application.log");
List<TextSegment> logEntries = logSplitter.split(logs);

Batch Embedding for Cost Efficiency

// Batch segments for embedding API calls
int batchSize = 100;
List<TextSegment> allSegments = splitter.split(document);

for (int i = 0; i < allSegments.size(); i += batchSize) {
    List<TextSegment> batch = allSegments.subList(
        i,
        Math.min(i + batchSize, allSegments.size())
    );

    // Embed entire batch in one API call
    List<Embedding> embeddings = embeddingModel.embedAll(batch).content();

    // Store embeddings
    for (int j = 0; j < batch.size(); j++) {
        embeddingStore.add(embeddings.get(j), batch.get(j));
    }
}

Related APIs

Document Loading:

  • FileSystemDocumentLoader - Load from file system
  • ClassPathDocumentLoader - Load from classpath
  • UrlDocumentLoader - Load from URLs
  • FileSystemSource, ClassPathSource, UrlSource - Document source abstractions

Document Parsing:

  • TextDocumentParser - Parse plain text
  • ApachePdfBoxParser - Parse PDF files (separate dependency)
  • ApacheTikaParser - Parse multiple formats (separate dependency)

Document Splitting:

  • DocumentSplitters - Factory for recursive splitters
  • DocumentByParagraphSplitter - Split by paragraphs
  • DocumentByLineSplitter - Split by lines
  • DocumentBySentenceSplitter - Split by sentences
  • DocumentByWordSplitter - Split by words
  • DocumentByCharacterSplitter - Split by characters
  • DocumentByRegexSplitter - Split by custom regex
  • HierarchicalDocumentSplitter - Base class for hierarchical splitting

Tokenization:

  • OpenAiTokenizer - OpenAI token counting
  • TokenCountEstimator - Interface for token estimation

Embedding:

  • EmbeddingModel - Generate embeddings
  • EmbeddingStore - Store and retrieve embeddings
  • EmbeddingStoreContentRetriever - RAG retrieval from embedding store

Data Types:

  • Document - Represents loaded document
  • TextSegment - Represents document segment
  • Metadata - Key-value metadata storage
  • Embedding - Vector embedding representation

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j@1.11.0

docs

ai-services.md

chains.md

classification.md

data-types.md

document-processing.md

embedding-store.md

guardrails.md

index.md

memory.md

messages.md

models.md

output-parsing.md

prompts.md

rag.md

request-response.md

spi.md

tools.md

README.md

tile.json