tessl/maven-dev-langchain4j--langchain4j

Build LLM-powered applications in Java with support for chatbots, agents, RAG, tools, and much more

Overview

Eval results

Files

Document Processing

Name: tessl/maven-dev-langchain4j--langchain4j
Author: tessl

Loaders, parsers, splitters, and sources for working with documents. Supports loading from file system, classpath, and URLs, with various splitting strategies for creating text segments.

Capabilities

Document Loaders

FileSystemDocumentLoader

Load documents from the file system.

package dev.langchain4j.data.document.loader;

/**
 * DocumentLoader for loading documents from the file system
 */
public class FileSystemDocumentLoader {
    /**
     * Load a single document from path
     * @param filePath Path to the document
     * @return Loaded document
     */
    public static Document loadDocument(Path filePath);

    /**
     * Load a single document from string path
     * @param filePath String path to the document
     * @return Loaded document
     */
    public static Document loadDocument(String filePath);

    /**
     * Load a single document with custom parser
     * @param filePath Path to the document
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document loadDocument(Path filePath, DocumentParser documentParser);

    /**
     * Load a single document with custom parser from string path
     * @param filePath String path to the document
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document loadDocument(String filePath, DocumentParser documentParser);

    /**
     * Load all documents from directory (non-recursive)
     * @param directoryPath Path to directory
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(Path directoryPath);

    /**
     * Load all documents from directory with custom parser (non-recursive)
     * @param directoryPath Path to directory
     * @param documentParser Parser to use for all documents
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(Path directoryPath, DocumentParser documentParser);

    /**
     * Load matching documents from directory (non-recursive)
     * @param directoryPath Path to directory
     * @param pathMatcher Matcher to filter files
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(Path directoryPath, PathMatcher pathMatcher);

    /**
     * Load documents recursively from directory
     * @param directoryPath Path to directory
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(Path directoryPath);

    /**
     * Load documents recursively with matcher and parser
     * @param directoryPath Path to directory
     * @param pathMatcher Matcher to filter files
     * @param documentParser Parser to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(
        Path directoryPath,
        PathMatcher pathMatcher,
        DocumentParser documentParser
    );
}

Thread Safety: All methods are static and stateless. Safe for concurrent use across threads. However, when loading the same file concurrently, OS-level file locks may apply. The Document objects returned are immutable after construction.

Common Pitfalls:

Loading binary files without appropriate parser causes encoding errors
Loading extremely large files (>1GB) can cause OutOfMemoryError
No built-in filtering for hidden files or system files (e.g., .DS_Store, Thumbs.db)
PathMatcher is case-sensitive on Linux/Unix, case-insensitive on Windows
Non-recursive methods silently skip subdirectories

Edge Cases:

Empty files return Document with empty text content
Symbolic links are followed by default; can cause infinite loops if circular
Files without read permissions throw AccessDeniedException
Non-existent paths throw NoSuchFileException
Directories passed to loadDocument() throw IOException
File modified during read may return inconsistent content

Performance Notes:

I/O bound operations; consider parallel processing for multiple files
loadDocumentsRecursively() walks entire tree before loading; use PathMatcher to filter early
Each file read opens a new FileInputStream; OS file descriptor limits apply
Default buffer size is 8KB; consider custom parsers for large files

Cost Considerations:

Large documents should be split before embedding; each segment incurs embedding API cost
Recursive loading of deep directories can load thousands of files unintentionally
Consider document count limits based on embedding budget

Exception Handling:

NoSuchFileException - File or directory does not exist
AccessDeniedException - Insufficient permissions to read file
IOException - Generic I/O errors (disk full, network mount issues)
OutOfMemoryError - File too large to load into memory
MalformedInputException - Invalid character encoding in file

Related APIs: ClassPathDocumentLoader, UrlDocumentLoader, FileSystemSource, TextDocumentParser

ClassPathDocumentLoader

Load documents from classpath resources.

package dev.langchain4j.data.document.loader;

/**
 * DocumentLoader implementation for loading documents using ClassPathSource
 */
public class ClassPathDocumentLoader {
    /**
     * Load document from classpath
     * @param pathOnClasspath Path to resource on classpath
     * @return Loaded document
     */
    public static Document loadDocument(String pathOnClasspath);

    /**
     * Load document from classpath with custom classloader
     * @param pathOnClasspath Path to resource on classpath
     * @param classLoader ClassLoader to use
     * @return Loaded document
     */
    public static Document loadDocument(String pathOnClasspath, ClassLoader classLoader);

    /**
     * Load document from classpath with custom parser
     * @param pathOnClasspath Path to resource on classpath
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document loadDocument(String pathOnClasspath, DocumentParser documentParser);

    /**
     * Load document from classpath with parser and classloader
     * @param pathOnClasspath Path to resource on classpath
     * @param documentParser Parser to use
     * @param classLoader ClassLoader to use
     * @return Loaded document
     */
    public static Document loadDocument(
        String pathOnClasspath,
        DocumentParser documentParser,
        ClassLoader classLoader
    );

    /**
     * Load all documents from directory on classpath (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath);

    /**
     * Load all documents from directory with custom classloader (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @param classLoader ClassLoader to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath, ClassLoader classLoader);

    /**
     * Load documents from directory with custom parser (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @param documentParser Parser to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath, DocumentParser documentParser);

    /**
     * Load matching documents from directory (non-recursive)
     * @param directoryOnClasspath Path to directory on classpath
     * @param pathMatcher Matcher to filter files
     * @return List of loaded documents
     */
    public static List<Document> loadDocuments(String directoryOnClasspath, PathMatcher pathMatcher);

    /**
     * Load documents recursively from directory
     * @param directoryOnClasspath Path to directory on classpath
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(String directoryOnClasspath);

    /**
     * Load documents recursively with matcher and parser
     * @param directoryOnClasspath Path to directory on classpath
     * @param pathMatcher Matcher to filter files
     * @param documentParser Parser to use
     * @return List of loaded documents
     */
    public static List<Document> loadDocumentsRecursively(
        String directoryOnClasspath,
        PathMatcher pathMatcher,
        DocumentParser documentParser
    );
}

Thread Safety: All methods are static and thread-safe. ClassLoader instances are typically thread-safe. Safe for concurrent loading of different resources. Loading same resource concurrently is safe but inefficient.

Common Pitfalls:

Path must NOT start with "/" in most cases (e.g., "documents/file.txt" not "/documents/file.txt")
Resources inside JARs are read-only; cannot modify
Wrong ClassLoader may not find resources (use Thread.currentThread().getContextClassLoader() if unsure)
Directory loading only works if JAR manifest includes directory entries
No built-in caching; same resource loaded multiple times reads from JAR each time

Edge Cases:

Resource not found returns null and throws exception on parse
Empty resources return Document with empty content
Resources in JAR files have no lastModified timestamp (uses JAR timestamp)
Nested JARs (JAR in JAR) may not be accessible depending on classloader
Resources from file system vs JAR have different URL schemes (file:// vs jar://)

Performance Notes:

Loading from JAR requires decompression; slower than file system
Each load opens new InputStream; consider caching loaded documents
Recursive loading of large JARs can be slow; filter with PathMatcher
ClassLoader.getResources() enumerates all JARs on classpath

Cost Considerations:

Embedded resources increase JAR size and application startup time
Large resources in classpath affect Docker image size
Multiple copies of same resource across JARs waste memory when loaded

Exception Handling:

NullPointerException - Resource not found on classpath
IOException - Error reading from JAR file
IllegalArgumentException - Invalid path format
OutOfMemoryError - Resource too large to load

Related APIs: FileSystemDocumentLoader, ClassPathSource, UrlDocumentLoader

UrlDocumentLoader

Load documents from URLs.

package dev.langchain4j.data.document.loader;

/**
 * DocumentLoader for loading documents from URLs
 */
public class UrlDocumentLoader {
    /**
     * Load document from URL
     * @param url URL to load from
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document load(URL url, DocumentParser documentParser);

    /**
     * Load document from string URL
     * @param url String URL to load from
     * @param documentParser Parser to use
     * @return Loaded document
     */
    public static Document load(String url, DocumentParser documentParser);
}

Thread Safety: Static methods are thread-safe. However, underlying HTTP client uses default configuration which may have connection pool limits. Concurrent loads share connection pool.

Common Pitfalls:

No timeout configuration; may hang indefinitely on slow connections
No retry logic; transient network failures cause immediate exception
No authentication support; protected URLs fail with 401/403
Redirects followed automatically; may load unexpected content
No content-type validation; binary files may be parsed as text
Large files loaded into memory entirely; can cause OutOfMemoryError

Edge Cases:

HTTP 404/500 errors throw IOException
HTTPS with invalid certificates throw SSLException
URLs with special characters need proper encoding
Data URLs (data:text/plain;base64,...) are supported if URL class handles them
File URLs (file://) work but use FileSystemDocumentLoader for better error handling
Empty response body returns Document with empty content

Performance Notes:

Network I/O bound; much slower than file system
No connection pooling configuration exposed
Each load creates new connection; consider caching
Large downloads limited by available memory
DNS resolution occurs for each unique hostname

Cost Considerations:

External API calls may have rate limits or usage costs
Cloud storage URLs (S3, GCS) may incur egress charges
Large document downloads consume bandwidth

Exception Handling:

MalformedURLException - Invalid URL format
IOException - Network errors, HTTP errors (404, 500)
UnknownHostException - DNS resolution failure
SocketTimeoutException - Connection or read timeout
SSLException - HTTPS certificate validation failure
OutOfMemoryError - Response too large for memory

Related APIs: UrlSource, FileSystemDocumentLoader, ClassPathDocumentLoader

Document Parsers

TextDocumentParser

Parse plain text documents.

package dev.langchain4j.data.document.parser;

/**
 * DocumentParser implementation for parsing plain text documents
 */
public class TextDocumentParser implements DocumentParser {
    /**
     * Constructor with default UTF-8 charset
     */
    public TextDocumentParser();

    /**
     * Constructor with custom charset
     * @param charset Charset to use for reading text
     */
    public TextDocumentParser(Charset charset);

    /**
     * Parse input stream into document
     * @param inputStream Input stream to parse
     * @return Parsed document
     */
    public Document parse(InputStream inputStream);
}

Thread Safety: Instances are stateless and thread-safe. Safe to share single instance across threads. Parse method is reentrant.

Common Pitfalls:

Default UTF-8 may fail on files with different encoding (e.g., Windows-1252, ISO-8859-1)
Binary files parsed as text produce gibberish with replacement characters
BOM (Byte Order Mark) included in parsed content if present
No line ending normalization; mixed \r\n, \n, \r preserved as-is
Entire content read into memory; large files cause OutOfMemoryError

Edge Cases:

Empty InputStream returns Document with empty text
InputStream with only whitespace returns Document with whitespace
Invalid byte sequences replaced with � (U+FFFD) in UTF-8
InputStream not at position 0 reads from current position
InputStream closed after parsing; reuse requires new stream

Performance Notes:

Reads entire stream into memory using BufferedReader
8KB default buffer size; performance degrades for very large files
Character decoding CPU-intensive for large files
No streaming support; entire document must fit in memory

Cost Considerations:

Memory usage = file size × 2 (bytes + char array)
Large files (>100MB) better split at file system level before loading

Exception Handling:

IOException - Stream read errors
MalformedInputException - Invalid character encoding
UnmappableCharacterException - Characters not supported in charset
OutOfMemoryError - File too large for available heap

Related APIs: DocumentParser interface, ApachePdfBoxParser, ApacheTikaParser, TextDocumentParser subclasses

Document Sources

FileSystemSource

Document source for file system files.

package dev.langchain4j.data.document.source;

/**
 * DocumentSource for file system sources
 */
public class FileSystemSource implements DocumentSource {
    /**
     * Constructor
     * @param path Path to file
     */
    public FileSystemSource(Path path);

    /**
     * Create from path
     * @param filePath Path to file
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(Path filePath);

    /**
     * Create from string path
     * @param filePath String path to file
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(String filePath);

    /**
     * Create from URI
     * @param fileUri URI to file
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(URI fileUri);

    /**
     * Create from File
     * @param file File object
     * @return FileSystemSource instance
     */
    public static FileSystemSource from(File file);

    /**
     * Get input stream
     * @return InputStream for reading file
     */
    public InputStream inputStream();

    /**
     * Get metadata
     * @return Metadata for the source
     */
    public Metadata metadata();
}

Thread Safety: Immutable after construction. Safe to share across threads. Each inputStream() call creates new FileInputStream, allowing concurrent reads.

Common Pitfalls:

InputStream must be closed by caller; resource leak if forgotten
Multiple inputStream() calls on same instance each open new file handle
Metadata extracted only once at construction; changes to file not reflected
Symbolic links resolved automatically; metadata reflects target file
Relative paths resolved against JVM working directory, not classpath

Edge Cases:

File deleted between construction and inputStream() throws NoSuchFileException
File modified between construction and read may have inconsistent metadata
Empty files return valid InputStream with 0 bytes available
Directories passed to constructor throw IOException on inputStream()
Files without read permission throw AccessDeniedException

Performance Notes:

Metadata extraction requires stat() system call at construction
Each inputStream() opens new file descriptor; OS limits apply (typically 1024-4096)
No buffering applied; wrap with BufferedInputStream for better performance
Network mounted files (NFS, SMB) have high latency

Cost Considerations:

File descriptor leaks prevent other files from being opened
Large files should be streamed, not loaded entirely into memory

Exception Handling:

NoSuchFileException - File does not exist
AccessDeniedException - Insufficient permissions
IOException - Generic I/O errors
FileSystemException - File system specific errors

Related APIs: FileSystemDocumentLoader, UrlSource, ClassPathSource, DocumentSource interface

ClassPathSource

Document source for classpath resources.

package dev.langchain4j.data.document.source;

/**
 * DocumentSource specialization that reads from classpath
 */
public class ClassPathSource implements DocumentSource {
    /**
     * Create from classpath resource
     * @param classPathResource Path to resource on classpath
     * @return ClassPathSource instance
     */
    public static ClassPathSource from(String classPathResource);

    /**
     * Create with custom classloader
     * @param classPathResource Path to resource on classpath
     * @param classLoader ClassLoader to use
     * @return ClassPathSource instance
     */
    public static ClassPathSource from(String classPathResource, ClassLoader classLoader);

    /**
     * Get the URL
     * @return URL of the resource
     */
    public URL url();

    /**
     * Get the classloader
     * @return ClassLoader used
     */
    public ClassLoader classLoader();

    /**
     * Check if inside archive (JAR)
     * @return true if resource is inside a JAR file
     */
    public boolean isInsideArchive();

    /**
     * Get input stream
     * @return InputStream for reading resource
     */
    public InputStream inputStream();

    /**
     * Get metadata
     * @return Metadata for the source
     */
    public Metadata metadata();
}

Thread Safety: Immutable after construction. Thread-safe for concurrent access. Each inputStream() call creates independent stream.

Common Pitfalls:

Resource path must NOT start with "/" for most classloaders
Fails silently if resource not found (returns null, then NPE on parse)
Wrong classloader won't find resources in specific JARs
Resources inside nested JARs may not be accessible
isInsideArchive() only checks JAR, not ZIP or other archives

Edge Cases:

Resource not found throws NullPointerException on inputStream()
Empty resources return valid InputStream with 0 bytes
Resources from exploded directories vs JARs have different URL schemes
Metadata lastModified uses JAR timestamp, not resource timestamp
ClassLoader hierarchy may find different resource than expected

Performance Notes:

Resources in JARs require ZIP decompression
Repeated access to same resource re-reads from JAR each time
No caching at framework level; consider caching loaded Documents
isInsideArchive() parses URL string; mildly expensive

Cost Considerations:

JAR resources increase application package size
Large resources in classpath affect startup time and memory

Exception Handling:

NullPointerException - Resource not found on classpath
IOException - Error reading from JAR
IllegalArgumentException - Invalid resource path
OutOfMemoryError - Resource too large

Related APIs: ClassPathDocumentLoader, FileSystemSource, UrlSource

UrlSource

Document source for URLs.

package dev.langchain4j.data.document.source;

/**
 * DocumentSource for URL sources
 */
public class UrlSource implements DocumentSource {
    /**
     * Constructor
     * @param url URL to load from
     */
    public UrlSource(URL url);

    /**
     * Create from string URL
     * @param url String URL
     * @return UrlSource instance
     */
    public static UrlSource from(String url);

    /**
     * Create from URL
     * @param url URL object
     * @return UrlSource instance
     */
    public static UrlSource from(URL url);

    /**
     * Create from URI
     * @param uri URI object
     * @return UrlSource instance
     */
    public static UrlSource from(URI uri);

    /**
     * Get input stream
     * @return InputStream for reading from URL
     */
    public InputStream inputStream();

    /**
     * Get metadata
     * @return Metadata for the source
     */
    public Metadata metadata();
}

Thread Safety: Immutable after construction. Thread-safe for concurrent access. Each inputStream() call makes new HTTP request.

Common Pitfalls:

No timeout configuration; may hang on slow connections
No authentication support built-in
No retry logic for transient failures
HTTP errors (404, 500) not detected until inputStream() called
Redirects followed automatically; may fetch unexpected URL
No connection pooling; each inputStream() opens new connection

Edge Cases:

HTTP 404/500 throw IOException from inputStream()
Empty response body returns valid InputStream with 0 bytes
HTTPS with invalid certificate throws SSLException
Network unreachable throws UnknownHostException
URL connection timeout defaults to infinite

Performance Notes:

Network latency much higher than file system
Each inputStream() call makes new HTTP request; very inefficient for multiple reads
No caching; same URL fetched repeatedly
Large responses loaded entirely into memory by some parsers

Cost Considerations:

External API calls may have rate limits or per-request costs
Cloud storage egress charges for S3, GCS, Azure Blob
Bandwidth costs for large documents

Exception Handling:

MalformedURLException - Invalid URL format
IOException - Network errors, HTTP errors
UnknownHostException - DNS failure
SSLException - HTTPS certificate errors
SocketTimeoutException - Connection timeout
OutOfMemoryError - Response too large

Related APIs: UrlDocumentLoader, FileSystemSource, ClassPathSource

Document Splitters

DocumentSplitters Utility

Factory methods for recommended document splitters.

package dev.langchain4j.data.document.splitter;

/**
 * Utility class providing factory methods for recommended document splitters
 */
public class DocumentSplitters {
    /**
     * Create recursive splitter with token limits (recommended for generic text)
     * Splits by paragraphs, then lines, then sentences, then words, then characters
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     * @return Configured document splitter
     */
    public static DocumentSplitter recursive(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Create recursive splitter with character limits
     * Splits by paragraphs, then lines, then sentences, then words, then characters
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @return Configured document splitter
     */
    public static DocumentSplitter recursive(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars
    );
}

Thread Safety: Factory methods are static and thread-safe. Returned DocumentSplitter instances are stateless and thread-safe. Safe to share splitter instance across threads.

Common Pitfalls:

Token-based splitting requires TokenCountEstimator; forgetting causes NPE
Overlap size >= segment size causes infinite loop or empty segments
Character count != token count; models have token limits not character limits
OpenAI tokenizers differ (GPT-3.5 vs GPT-4); use matching tokenizer
Recursive splitting can be slow on very large documents (>1MB)

Edge Cases:

Document with no paragraph/line breaks falls back to sentence splitting
Document with no sentence breaks falls back to word splitting
Single long word exceeding maxSegmentSize splits by characters
Empty document returns empty list of segments
Document smaller than maxSegmentSize returns single segment

Performance Notes:

Token-based splitting requires tokenization; 10-50x slower than character-based
Sentence detection uses Apache OpenNLP; loads model on first use (~50ms)
Recursive strategy tries each level; worst case processes text 5 times
Large overlap ratios cause redundant processing and storage

Cost Considerations:

More segments = more embedding API calls = higher cost
Overlap increases total token count sent to embedding API
Smaller segments improve retrieval precision but increase storage and costs
Typical sweet spot: 300-500 tokens per segment, 10% overlap

Exception Handling:

IllegalArgumentException - Invalid parameters (negative sizes, overlap > segment size)
NullPointerException - Null tokenizer for token-based splitting
OutOfMemoryError - Document too large with very small segment size

Related APIs: DocumentByParagraphSplitter, DocumentBySentenceSplitter, HierarchicalDocumentSplitter

HierarchicalDocumentSplitter

Base class for hierarchical document splitters.

package dev.langchain4j.data.document.splitter;

/**
 * Base class for hierarchical document splitters
 * Provides machinery for sub-splitting documents when a single segment is too long
 */
public abstract class HierarchicalDocumentSplitter implements DocumentSplitter {
    /**
     * Split document into segments
     * @param document Document to split
     * @return List of text segments
     */
    public List<TextSegment> split(Document document);

    /**
     * Split text implementation (abstract)
     * @param text Text to split
     * @return Array of split parts
     */
    protected abstract String[] split(String text);

    /**
     * Get join delimiter (abstract)
     * @return Delimiter used to join parts
     */
    protected abstract String joinDelimiter();

    /**
     * Get default sub-splitter (abstract)
     * @return Default sub-splitter to use if segment is too large
     */
    protected abstract DocumentSplitter defaultSubSplitter();

    /**
     * Get overlap region at end of segment
     * @param segmentText Segment text
     * @return Overlap text
     */
    protected String overlapFrom(String segmentText);

    /**
     * Estimate size in tokens or characters
     * @param text Text to estimate
     * @return Estimated size
     */
    protected int estimateSize(String text);

    /**
     * Create segment with metadata
     * @param text Segment text
     * @param document Source document
     * @param index Segment index
     * @return Text segment with metadata
     */
    protected static TextSegment createSegment(String text, Document document, int index);
}

Thread Safety: Implementations are stateless and thread-safe if TokenCountEstimator is thread-safe. Safe to share across threads for splitting different documents concurrently.

Common Pitfalls:

Sub-splitter must handle segments that still exceed maxSegmentSize
Infinite recursion possible if sub-splitter doesn't make progress
Metadata copied to all segments; large metadata multiplies memory usage
Overlap calculation at boundary may cut words/sentences mid-way
Join delimiter added between parts; affects final character/token count

Edge Cases:

Zero-length segments filtered out automatically
Segment exactly at maxSegmentSize does not trigger sub-splitting
Empty document returns empty list
Document with only whitespace may produce empty segments
Very small maxSegmentSize (< 10) may cause all text to be dropped

Performance Notes:

Recursive sub-splitting can process text multiple times
Overlap extraction scans segment from end; O(segment_length)
Creating segments with metadata involves string copying
Token counting called repeatedly; cache if possible

Cost Considerations:

More aggressive splitting (smaller segments) = more embedding calls
Overlap duplicates content in embeddings; increases storage and cost

Exception Handling:

IllegalArgumentException - Invalid configuration (overlap > segment size)
StackOverflowError - Infinite sub-splitting recursion
OutOfMemoryError - Too many segments generated

Related APIs: DocumentSplitters, DocumentByParagraphSplitter, DocumentBySentenceSplitter

DocumentByParagraphSplitter

Split documents by paragraphs.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into paragraphs and fits as many as possible into a single TextSegment
 * Paragraph boundaries detected by double newlines
 * Default sub-splitter is DocumentBySentenceSplitter
 */
public class DocumentByParagraphSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByParagraphSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large paragraphs
     */
    public DocumentByParagraphSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByParagraphSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Split text by paragraphs
     * @param text Text to split
     * @return Array of paragraphs
     */
    protected String[] split(String text);

    /**
     * Get join delimiter
     * @return "\n\n" (double newline)
     */
    protected String joinDelimiter();

    /**
     * Get default sub-splitter
     * @return DocumentBySentenceSplitter instance
     */
    protected DocumentSplitter defaultSubSplitter();
}

Thread Safety: Stateless and thread-safe. Safe to share instance across threads. Token counter must be thread-safe if used.

Common Pitfalls:

Paragraphs detected by "\n\n" only; single newline not recognized
Mixed line endings (\r\n, \n, \r) may not be handled consistently
Documents without paragraph breaks treated as single paragraph, fall back to sub-splitter
Very long paragraphs (> maxSegmentSize) always trigger sub-splitting
Trailing whitespace in paragraphs preserved; affects size calculations

Edge Cases:

Document with only "\n\n" produces empty paragraphs (filtered out)
Three or more newlines treated same as two (paragraph boundary)
Paragraph consisting of only whitespace may be preserved or dropped
Single line document with no "\n\n" falls back to sentence splitting
Empty paragraphs filtered automatically

Performance Notes:

Paragraph detection via string split is fast (O(n))
Sub-splitting large paragraphs more expensive (sentence detection)
Token-based limits require tokenizing each paragraph
Character-based limits are ~10x faster than token-based

Cost Considerations:

Paragraph boundaries preserve semantic coherence; better for RAG quality
Fewer segments than sentence splitting = lower embedding costs
Overlap at paragraph level includes entire paragraphs in adjacent segments

Exception Handling:

IllegalArgumentException - Invalid size parameters
NullPointerException - Null text or tokenizer
OutOfMemoryError - Too many small paragraphs with large document

Related APIs: DocumentBySentenceSplitter, DocumentByLineSplitter, HierarchicalDocumentSplitter

DocumentByLineSplitter

Split documents by lines.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into lines and fits as many as possible into a single TextSegment
 * Line boundaries detected by newline characters
 * Default sub-splitter is DocumentBySentenceSplitter
 */
public class DocumentByLineSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByLineSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large lines
     */
    public DocumentByLineSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByLineSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Split text by lines
     * @param text Text to split
     * @return Array of lines
     */
    protected String[] split(String text);

    /**
     * Get join delimiter
     * @return "\n" (newline)
     */
    protected String joinDelimiter();

    /**
     * Get default sub-splitter
     * @return DocumentBySentenceSplitter instance
     */
    protected DocumentSplitter defaultSubSplitter();
}

Thread Safety: Stateless and thread-safe. Safe for concurrent use. TokenCountEstimator must be thread-safe.

Common Pitfalls:

Mixed line endings (\r\n vs \n) may cause inconsistent splits
Empty lines preserved as empty segments (then filtered)
Very long lines (e.g., minified code) exceed maxSegmentSize and trigger sub-splitting
Good for structured data (CSV, logs); poor for prose text
Windows CRLF (\r\n) splits into "\r" + empty line on Unix systems

Edge Cases:

Empty lines filtered automatically
Lines with only whitespace may be preserved
No trailing newline: last line still included
Consecutive newlines create empty line segments (filtered)
Single character per line with small maxSegmentSize causes many segments

Performance Notes:

Line splitting is very fast (O(n) string split)
Good for line-oriented formats (logs, CSV, code)
Sub-splitting long lines uses sentence detection (slower)

Cost Considerations:

Line-based splitting often creates more segments than paragraph-based
Good for structured data where lines are semantic units
Poor for prose where sentences span multiple lines

Exception Handling:

IllegalArgumentException - Invalid parameters
NullPointerException - Null input
OutOfMemoryError - Too many lines

Related APIs: DocumentByParagraphSplitter, DocumentBySentenceSplitter, HierarchicalDocumentSplitter

DocumentBySentenceSplitter

Split documents by sentences.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into sentences and fits as many as possible into a single TextSegment
 * Uses Apache OpenNLP for sentence detection
 */
public class DocumentBySentenceSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentBySentenceSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large sentences
     */
    public DocumentBySentenceSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentBySentenceSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );
}

Thread Safety: Sentence detector is NOT thread-safe (OpenNLP limitation). Do NOT share instance across threads. Create one instance per thread or synchronize access.

Common Pitfalls:

Requires Apache OpenNLP dependency; missing JAR causes ClassNotFoundException
Sentence model (en-sent.bin) loaded lazily; first split is slow (~50ms)
Not thread-safe; concurrent splits corrupt sentence detector state
Abbreviations (Dr., Inc.) may cause incorrect sentence boundaries
Languages other than English not supported by default model
URLs and emails containing periods may split incorrectly

Edge Cases:

Single sentence document returns single segment
Document with no sentence endings falls back to word splitting
Sentence exceeding maxSegmentSize triggers sub-splitter
Empty sentences after whitespace trimming filtered out
Ellipsis (...) may or may not be sentence boundary

Performance Notes:

Sentence detection ~10x slower than paragraph/line splitting
OpenNLP model loaded once and cached
Good for natural language text; overkill for structured data
Sub-splitting long sentences uses word splitter (fast)

Cost Considerations:

More segments than paragraph splitting = higher embedding costs
Better semantic coherence improves retrieval quality
Overlap at sentence level provides good context

Exception Handling:

ClassNotFoundException - OpenNLP dependency missing
IOException - Sentence model file not found
IllegalArgumentException - Invalid parameters
ConcurrentModificationException - Concurrent access (not thread-safe)

Related APIs: DocumentByWordSplitter, DocumentByParagraphSplitter, HierarchicalDocumentSplitter

DocumentByWordSplitter

Split documents by words.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into words and fits as many as possible into a single TextSegment
 */
public class DocumentByWordSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByWordSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large words
     */
    public DocumentByWordSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByWordSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );
}

Thread Safety: Stateless and thread-safe. Safe for concurrent use across threads.

Common Pitfalls:

Word boundaries defined by whitespace; punctuation attached to words
Contractions (don't, I'll) treated as single words
Hyphenated words (e.g., self-driving) treated as single word
Very long words (URLs, base64) may exceed maxSegmentSize and trigger character splitting
Multiple spaces between words collapsed to single space in output

Edge Cases:

Empty string returns empty list
String with only whitespace returns empty list
Single word exceeding maxSegmentSize triggers character sub-splitter
Consecutive whitespace treated as single word boundary
Punctuation-only "words" preserved

Performance Notes:

Very fast; simple whitespace split (O(n))
Sub-splitter (character) only invoked for extremely long words
Good fallback when sentence detection fails

Cost Considerations:

Breaks sentence coherence; worse for semantic search
Many segments = higher costs
Rarely used as primary splitter; usually sub-splitter

Exception Handling:

IllegalArgumentException - Invalid parameters
NullPointerException - Null input

Related APIs: DocumentByCharacterSplitter, DocumentBySentenceSplitter, HierarchicalDocumentSplitter

DocumentByCharacterSplitter

Split documents by characters.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents into characters and fits as many as possible into a single TextSegment
 * Supports character or token-based limits
 */
public class DocumentByCharacterSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByCharacterSplitter(int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter (typically null for character splitter)
     */
    public DocumentByCharacterSplitter(
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByCharacterSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );

    /**
     * Full constructor with token limits and sub-splitter
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     * @param subSplitter Sub-splitter (typically null)
     */
    public DocumentByCharacterSplitter(
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator,
        DocumentSplitter subSplitter
    );

    /**
     * Split text implementation
     * @param text Text to split
     * @return Array of characters as strings
     */
    protected String[] split(String text);

    /**
     * Get join delimiter
     * @return "" (empty string)
     */
    protected String joinDelimiter();

    /**
     * Get default sub-splitter
     * @return null (no sub-splitter)
     */
    protected DocumentSplitter defaultSubSplitter();
}

Thread Safety: Stateless and thread-safe. Safe for concurrent use.

Common Pitfalls:

Destroys all semantic structure (words, sentences, paragraphs)
Poor for retrieval quality; segments often meaningless
Multibyte characters (UTF-8) counted as single character
Emoji and special Unicode may span multiple Java chars (surrogate pairs)
No sub-splitter available; maxSegmentSize is hard limit

Edge Cases:

Empty string returns empty list
Single character document returns single segment
Segment size = 1 splits every character separately
Unicode surrogate pairs may be split incorrectly
Overlap must be < segment size to make progress

Performance Notes:

Extremely fast; no parsing logic
Last resort fallback in hierarchical splitters
Rarely used as primary splitter

Cost Considerations:

Destroys semantic meaning; worst for RAG quality
Use only when all other splitters fail (e.g., binary data as text)

Exception Handling:

IllegalArgumentException - Invalid parameters (overlap >= segment size)
NullPointerException - Null input

Related APIs: DocumentByWordSplitter, HierarchicalDocumentSplitter

DocumentByRegexSplitter

Split documents using custom regex pattern.

package dev.langchain4j.data.document.splitter;

/**
 * Splits documents using a custom regex pattern
 */
public class DocumentByRegexSplitter extends HierarchicalDocumentSplitter {
    /**
     * Constructor with character limits
     * @param regex Regular expression pattern for splitting
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     */
    public DocumentByRegexSplitter(String regex, int maxSegmentSizeInChars, int maxOverlapSizeInChars);

    /**
     * Constructor with sub-splitter
     * @param regex Regular expression pattern for splitting
     * @param maxSegmentSizeInChars Maximum segment size in characters
     * @param maxOverlapSizeInChars Maximum overlap size in characters
     * @param subSplitter Sub-splitter to use for large segments
     */
    public DocumentByRegexSplitter(
        String regex,
        int maxSegmentSizeInChars,
        int maxOverlapSizeInChars,
        DocumentSplitter subSplitter
    );

    /**
     * Constructor with token limits
     * @param regex Regular expression pattern for splitting
     * @param maxSegmentSizeInTokens Maximum segment size in tokens
     * @param maxOverlapSizeInTokens Maximum overlap size in tokens
     * @param tokenCountEstimator Token count estimator
     */
    public DocumentByRegexSplitter(
        String regex,
        int maxSegmentSizeInTokens,
        int maxOverlapSizeInTokens,
        TokenCountEstimator tokenCountEstimator
    );
}

Thread Safety: Pattern compiled at construction time. Thread-safe if Pattern.split() is used correctly (stateless). Safe for concurrent use.

Common Pitfalls:

Invalid regex throws PatternSyntaxException at construction
Regex must match delimiters, not content (use lookahead/lookbehind if needed)
Greedy vs non-greedy matching affects results
Regex performance degrades with catastrophic backtracking
Delimiter not preserved in segments (unless using lookaround assertions)

Edge Cases:

Regex never matches: document not split at all (becomes single segment)
Regex matches everywhere: produces many empty segments (filtered)
Empty segments after split are filtered automatically
Regex matching newlines may interact poorly with line-based logic

Performance Notes:

Regex compilation happens once at construction
Complex regex can be slow (O(n²) or worse with backtracking)
Simple patterns (literal strings) almost as fast as string.split()

Cost Considerations:

Custom splitting can preserve domain-specific structure
Good for log files, structured text, code with special delimiters

Exception Handling:

PatternSyntaxException - Invalid regex pattern
IllegalArgumentException - Invalid size parameters
StackOverflowError - Catastrophic regex backtracking

Related APIs: Pattern class, HierarchicalDocumentSplitter, DocumentByLineSplitter

Chunking Strategy Guide

Choosing the Right Splitter

Use Cases by Content Type:

Content Type	Recommended Splitter	Reasoning
Documentation, articles	DocumentSplitters.recursive()	Preserves semantic structure (paragraphs > sentences > words)
Code files	DocumentByLineSplitter	Code structure aligned with lines
Log files	DocumentByRegexSplitter	Custom delimiters (timestamps, log levels)
CSV/TSV	DocumentByLineSplitter	Each line is semantic unit
Legal documents	DocumentByParagraphSplitter	Paragraph = logical unit
Chat transcripts	DocumentByRegexSplitter	Split by speaker or timestamp
Markdown	DocumentByParagraphSplitter	Respects document structure
JSON/XML	Custom parser + DocumentByLineSplitter	Parse first, then split logical blocks

Segment Size Guidelines

Token-based sizing (recommended):

Small segments (100-200 tokens): High precision, more segments, higher cost
Medium segments (300-500 tokens): Balanced precision and context
Large segments (800-1000 tokens): More context, lower precision, risk of exceeding model limits

Character-based sizing:

1 token ≈ 4 characters (English text)
500 characters ≈ 125 tokens
Use character-based for simplicity; token-based for accuracy

Overlap Strategy

Overlap benefits:

Prevents information loss at segment boundaries
Improves retrieval when query terms span boundary
Provides context continuity

Overlap sizing:

10-20% of segment size is typical
Paragraph splitter: 1-2 paragraphs overlap
Sentence splitter: 1-3 sentences overlap
Too much overlap: redundant storage and embedding costs
Too little overlap: information loss at boundaries

When to skip overlap:

Structured data where boundaries are clear (CSV rows)
Storage/cost constrained scenarios
Segments already large enough to provide context

Hierarchical Splitting Strategy

Recursive splitting hierarchy:

Paragraph (double newline) - preserves major structure
Line (single newline) - preserves minor structure
Sentence (OpenNLP) - preserves semantic units
Word (whitespace) - preserves lexical units
Character (fallback) - guaranteed progress

Custom hierarchy example:

DocumentSplitter customSplitter = new DocumentByRegexSplitter(
    "\\n---\\n", // Custom section delimiter
    1000,
    100,
    new DocumentByParagraphSplitter(1000, 100)
);

Performance Optimization

Parallel processing pattern:

List<Document> documents = loadDocuments();
List<TextSegment> allSegments = documents.parallelStream()
    .flatMap(doc -> splitter.split(doc).stream())
    .collect(Collectors.toList());

Batching for embedding:

int batchSize = 100;
for (int i = 0; i < segments.size(); i += batchSize) {
    List<TextSegment> batch = segments.subList(
        i,
        Math.min(i + batchSize, segments.size())
    );
    List<Embedding> embeddings = embeddingModel.embedAll(batch).content();
    // Store embeddings
}

File Type Handling Patterns

Plain Text Files

Supported encodings:

UTF-8 (default)
UTF-16 (with BOM detection)
ISO-8859-1 (Latin-1)
Windows-1252
Custom Charset via TextDocumentParser(charset)

Pattern:

// Auto-detect encoding (defaults to UTF-8)
Document doc = FileSystemDocumentLoader.loadDocument("file.txt");

// Explicit encoding
Document doc2 = FileSystemDocumentLoader.loadDocument(
    "file.txt",
    new TextDocumentParser(StandardCharsets.ISO_8859_1)
);

Markdown Files

Pattern:

// Load as text (preserves markdown syntax)
Document doc = FileSystemDocumentLoader.loadDocument("README.md");

// Split by headers (custom regex)
DocumentSplitter splitter = new DocumentByRegexSplitter(
    "\\n##? ",  // Split on ## or # headers
    2000,
    200
);

Code Files

Pattern:

// Load source code
Document code = FileSystemDocumentLoader.loadDocument("App.java");

// Split by lines (preserves structure)
DocumentSplitter splitter = new DocumentByLineSplitter(500, 50);

// Or split by functions (custom regex for Java)
DocumentSplitter functionSplitter = new DocumentByRegexSplitter(
    "\\n\\s*(public|private|protected)\\s+",
    1000,
    100
);

CSV Files

Pattern:

// Load CSV
Document csv = FileSystemDocumentLoader.loadDocument("data.csv");

// Split by lines (each row is a segment)
DocumentSplitter splitter = new DocumentByLineSplitter(
    1000, // Max chars per segment
    0     // No overlap for structured data
);

// Filter header if needed
List<TextSegment> segments = splitter.split(csv).stream()
    .skip(1) // Skip header row
    .collect(Collectors.toList());

Log Files

Pattern:

// Load log file
Document logs = FileSystemDocumentLoader.loadDocument("app.log");

// Split by timestamp pattern
DocumentSplitter splitter = new DocumentByRegexSplitter(
    "\\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}",  // ISO timestamp
    2000,
    0  // No overlap for logs
);

JSON Files

Pattern:

// Parse JSON first, then create documents per object
String jsonContent = Files.readString(Path.of("data.json"));
JsonArray array = JsonParser.parseString(jsonContent).getAsJsonArray();

List<Document> documents = new ArrayList<>();
for (JsonElement element : array) {
    String text = element.toString();
    documents.add(Document.from(text));
}

// Split each document
List<TextSegment> segments = documents.stream()
    .flatMap(doc -> splitter.split(doc).stream())
    .collect(Collectors.toList());

PDF Files

Pattern (requires Apache PDFBox):

// Add dependency: dev.langchain4j:langchain4j-document-parser-apache-pdfbox
import dev.langchain4j.data.document.parser.apache.pdfbox.ApachePdfBoxDocumentParser;

Document pdf = FileSystemDocumentLoader.loadDocument(
    "document.pdf",
    new ApachePdfBoxDocumentParser()
);

// Split with token-based limits (PDFs often verbose)
DocumentSplitter splitter = DocumentSplitters.recursive(500, 50, tokenizer);

Binary Files

Pattern:

Do NOT load binary files (images, videos, executables) with text loaders
Use specialized parsers or skip binary files
Filter by extension:

PathMatcher textFilesOnly = FileSystems.getDefault().getPathMatcher(
    "glob:*.{txt,md,java,py,js,json,xml,csv,log}"
);

List<Document> docs = FileSystemDocumentLoader.loadDocuments(
    Path.of("/path/to/dir"),
    textFilesOnly
);

Testing Patterns

Unit Testing Document Loaders

import org.junit.jupiter.api.Test;
import static org.assertj.core.api.Assertions.*;

class DocumentLoaderTest {
    @Test
    void testLoadSingleDocument() {
        // Given
        Path testFile = Path.of("src/test/resources/test.txt");

        // When
        Document doc = FileSystemDocumentLoader.loadDocument(testFile);

        // Then
        assertThat(doc.text()).isNotEmpty();
        assertThat(doc.metadata().get("file_name")).isEqualTo("test.txt");
    }

    @Test
    void testLoadNonExistentFile() {
        // Given
        Path nonExistent = Path.of("does-not-exist.txt");

        // When/Then
        assertThatThrownBy(() -> FileSystemDocumentLoader.loadDocument(nonExistent))
            .isInstanceOf(NoSuchFileException.class);
    }

    @Test
    void testLoadWithCustomCharset() {
        // Given
        Path latin1File = Path.of("src/test/resources/latin1.txt");
        TextDocumentParser parser = new TextDocumentParser(StandardCharsets.ISO_8859_1);

        // When
        Document doc = FileSystemDocumentLoader.loadDocument(latin1File, parser);

        // Then
        assertThat(doc.text()).contains("café"); // Correctly decoded
    }
}

Unit Testing Document Splitters

class DocumentSplitterTest {
    private DocumentSplitter splitter;

    @BeforeEach
    void setUp() {
        splitter = DocumentSplitters.recursive(100, 10);
    }

    @Test
    void testSplitSmallDocument() {
        // Given
        Document doc = Document.from("Short text.");

        // When
        List<TextSegment> segments = splitter.split(doc);

        // Then
        assertThat(segments).hasSize(1);
        assertThat(segments.get(0).text()).isEqualTo("Short text.");
    }

    @Test
    void testSplitLargeDocument() {
        // Given
        String longText = "A ".repeat(100); // 200 characters
        Document doc = Document.from(longText);

        // When
        List<TextSegment> segments = splitter.split(doc);

        // Then
        assertThat(segments).hasSizeGreaterThan(1);
        assertThat(segments).allMatch(s -> s.text().length() <= 100);
    }

    @Test
    void testOverlapBetweenSegments() {
        // Given
        String text = "Sentence one. Sentence two. Sentence three. Sentence four.";
        Document doc = Document.from(text);
        DocumentSplitter splitterWithOverlap = new DocumentBySentenceSplitter(30, 10);

        // When
        List<TextSegment> segments = splitterWithOverlap.split(doc);

        // Then
        assertThat(segments.size()).isGreaterThan(1);
        // Verify overlap exists
        for (int i = 0; i < segments.size() - 1; i++) {
            String currentEnd = segments.get(i).text().substring(
                Math.max(0, segments.get(i).text().length() - 10)
            );
            String nextStart = segments.get(i + 1).text().substring(0,
                Math.min(10, segments.get(i + 1).text().length())
            );
            // Some overlap should exist
            assertThat(nextStart).containsAnyOf(currentEnd.split(" "));
        }
    }

    @Test
    void testMetadataPreserved() {
        // Given
        Metadata metadata = new Metadata();
        metadata.put("source", "test.txt");
        Document doc = Document.from("Text content", metadata);

        // When
        List<TextSegment> segments = splitter.split(doc);

        // Then
        assertThat(segments).allMatch(s ->
            s.metadata().get("source").equals("test.txt")
        );
    }
}

Integration Testing RAG Pipeline

class RAGPipelineTest {
    private EmbeddingModel embeddingModel;
    private EmbeddingStore<TextSegment> embeddingStore;
    private DocumentSplitter splitter;

    @BeforeEach
    void setUp() {
        embeddingModel = new AllMiniLmL6V2EmbeddingModel();
        embeddingStore = new InMemoryEmbeddingStore<>();
        splitter = DocumentSplitters.recursive(300, 30,
            new OpenAiTokenizer("gpt-3.5-turbo"));
    }

    @Test
    void testCompleteRAGPipeline() {
        // Given: Load and split documents
        List<Document> docs = FileSystemDocumentLoader.loadDocuments(
            Path.of("src/test/resources/docs")
        );

        List<TextSegment> segments = docs.stream()
            .flatMap(doc -> splitter.split(doc).stream())
            .collect(Collectors.toList());

        // Index segments
        for (TextSegment segment : segments) {
            Embedding embedding = embeddingModel.embed(segment).content();
            embeddingStore.add(embedding, segment);
        }

        // When: Search
        String query = "What is document processing?";
        Embedding queryEmbedding = embeddingModel.embed(query).content();
        List<EmbeddingMatch<TextSegment>> matches =
            embeddingStore.findRelevant(queryEmbedding, 3);

        // Then: Verify results
        assertThat(matches).isNotEmpty();
        assertThat(matches).hasSizeLessThanOrEqualTo(3);
        assertThat(matches.get(0).score()).isGreaterThan(0.5);
        assertThat(matches.get(0).embedded().text()).containsIgnoringCase("document");
    }
}

Testing Error Handling

class ErrorHandlingTest {
    @Test
    void testLargeFileHandling() {
        // Given: Simulate large file
        Path largeFile = createLargeTestFile(1_000_000_000); // 1GB

        // When/Then: Should handle gracefully or throw OOME
        assertThatThrownBy(() ->
            FileSystemDocumentLoader.loadDocument(largeFile)
        ).isInstanceOfAny(OutOfMemoryError.class, IOException.class);

        // Cleanup
        Files.deleteIfExists(largeFile);
    }

    @Test
    void testInvalidEncodingHandling() {
        // Given: File with invalid UTF-8
        Path invalidFile = Path.of("src/test/resources/invalid-utf8.txt");

        // When: Load with UTF-8 parser
        Document doc = FileSystemDocumentLoader.loadDocument(invalidFile);

        // Then: Should contain replacement characters
        assertThat(doc.text()).contains("\uFFFD"); // Replacement character
    }

    @Test
    void testEmptyFileHandling() {
        // Given: Empty file
        Path emptyFile = Files.createTempFile("empty", ".txt");

        // When
        Document doc = FileSystemDocumentLoader.loadDocument(emptyFile);

        // Then
        assertThat(doc.text()).isEmpty();

        // Cleanup
        Files.deleteIfExists(emptyFile);
    }
}

Usage Examples

Loading Documents

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.document.parser.TextDocumentParser;
import java.nio.file.Path;
import java.util.List;

// Load single document
Document doc = FileSystemDocumentLoader.loadDocument(Path.of("/path/to/file.txt"));

// Load with custom parser
Document doc2 = FileSystemDocumentLoader.loadDocument(
    Path.of("/path/to/file.txt"),
    new TextDocumentParser()
);

// Load all documents from directory
List<Document> docs = FileSystemDocumentLoader.loadDocuments(Path.of("/path/to/dir"));

// Load recursively
List<Document> allDocs = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/dir")
);

Loading from Classpath

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.loader.ClassPathDocumentLoader;

// Load single file from classpath
Document doc = ClassPathDocumentLoader.loadDocument("documents/guide.txt");

// Load all documents from classpath directory
List<Document> docs = ClassPathDocumentLoader.loadDocuments("documents");

// Load recursively
List<Document> allDocs = ClassPathDocumentLoader.loadDocumentsRecursively("documents");

Splitting Documents

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.splitter.DocumentSplitter;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.openai.OpenAiTokenizer;
import java.util.List;

// Recommended: recursive splitter with token limits
DocumentSplitter splitter = DocumentSplitters.recursive(
    500, // max tokens per segment
    50,  // overlap tokens
    new OpenAiTokenizer()
);

List<TextSegment> segments = splitter.split(document);

// Simple: recursive splitter with character limits
DocumentSplitter charSplitter = DocumentSplitters.recursive(2000, 200);
List<TextSegment> charSegments = charSplitter.split(document);

Custom Splitting Strategy

import dev.langchain4j.data.document.splitter.DocumentByParagraphSplitter;
import dev.langchain4j.data.document.splitter.DocumentBySentenceSplitter;

// Split by paragraphs with custom sub-splitter
DocumentSplitter splitter = new DocumentByParagraphSplitter(
    1000, // max characters per segment
    100,  // overlap characters
    new DocumentBySentenceSplitter(1000, 100) // sub-splitter for large paragraphs
);

List<TextSegment> segments = splitter.split(document);

Complete RAG Pipeline Example

import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.loader.FileSystemDocumentLoader;
import dev.langchain4j.data.document.splitter.DocumentSplitters;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore;
import java.nio.file.Path;
import java.util.List;

// 1. Load documents
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/docs")
);

// 2. Split into segments
DocumentSplitter splitter = DocumentSplitters.recursive(300, 30, tokenizer);
List<TextSegment> segments = new ArrayList<>();
for (Document doc : documents) {
    segments.addAll(splitter.split(doc));
}

// 3. Embed segments
EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>();
for (TextSegment segment : segments) {
    Embedding embedding = embeddingModel.embed(segment).content();
    embeddingStore.add(embedding, segment);
}

// 4. Use with AI service for RAG
ContentRetriever contentRetriever = EmbeddingStoreContentRetriever.builder()
    .embeddingStore(embeddingStore)
    .embeddingModel(embeddingModel)
    .maxResults(3)
    .build();

Assistant assistant = AiServices.builder(Assistant.class)
    .chatModel(chatModel)
    .contentRetriever(contentRetriever)
    .build();

Filtering Files by Type

import java.nio.file.FileSystems;
import java.nio.file.PathMatcher;

// Create matcher for text files only
PathMatcher textFiles = FileSystems.getDefault().getPathMatcher(
    "glob:*.{txt,md,java,py,js}"
);

// Load only matching files
List<Document> docs = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/code"),
    textFiles,
    new TextDocumentParser()
);

Parallel Processing Pattern

import java.util.concurrent.ForkJoinPool;
import java.util.stream.Collectors;

// Load documents in parallel
List<Document> documents = FileSystemDocumentLoader.loadDocumentsRecursively(
    Path.of("/path/to/docs")
);

// Split in parallel
ForkJoinPool customThreadPool = new ForkJoinPool(4);
List<TextSegment> allSegments = customThreadPool.submit(() ->
    documents.parallelStream()
        .flatMap(doc -> splitter.split(doc).stream())
        .collect(Collectors.toList())
).join();

customThreadPool.shutdown();

Handling Different Encodings

import java.nio.charset.StandardCharsets;

// Latin-1 encoded file
Document latin1Doc = FileSystemDocumentLoader.loadDocument(
    Path.of("latin1-file.txt"),
    new TextDocumentParser(StandardCharsets.ISO_8859_1)
);

// Windows-1252 encoded file
Document windowsDoc = FileSystemDocumentLoader.loadDocument(
    Path.of("windows-file.txt"),
    new TextDocumentParser(Charset.forName("Windows-1252"))
);

Custom Regex Splitter for Logs

// Split log file by timestamp entries
DocumentSplitter logSplitter = new DocumentByRegexSplitter(
    "\\n\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}",  // ISO timestamp pattern
    2000,  // max chars per segment
    0      // no overlap for logs
);

Document logs = FileSystemDocumentLoader.loadDocument("application.log");
List<TextSegment> logEntries = logSplitter.split(logs);

Batch Embedding for Cost Efficiency

// Batch segments for embedding API calls
int batchSize = 100;
List<TextSegment> allSegments = splitter.split(document);

for (int i = 0; i < allSegments.size(); i += batchSize) {
    List<TextSegment> batch = allSegments.subList(
        i,
        Math.min(i + batchSize, allSegments.size())
    );

    // Embed entire batch in one API call
    List<Embedding> embeddings = embeddingModel.embedAll(batch).content();

    // Store embeddings
    for (int j = 0; j < batch.size(); j++) {
        embeddingStore.add(embeddings.get(j), batch.get(j));
    }
}

Related APIs

Document Loading:

FileSystemDocumentLoader - Load from file system
ClassPathDocumentLoader - Load from classpath
UrlDocumentLoader - Load from URLs
FileSystemSource, ClassPathSource, UrlSource - Document source abstractions

Document Parsing:

TextDocumentParser - Parse plain text
ApachePdfBoxParser - Parse PDF files (separate dependency)
ApacheTikaParser - Parse multiple formats (separate dependency)

Document Splitting:

DocumentSplitters - Factory for recursive splitters
DocumentByParagraphSplitter - Split by paragraphs
DocumentByLineSplitter - Split by lines
DocumentBySentenceSplitter - Split by sentences
DocumentByWordSplitter - Split by words
DocumentByCharacterSplitter - Split by characters
DocumentByRegexSplitter - Split by custom regex
HierarchicalDocumentSplitter - Base class for hierarchical splitting

Tokenization:

OpenAiTokenizer - OpenAI token counting
TokenCountEstimator - Interface for token estimation

Embedding:

EmbeddingModel - Generate embeddings
EmbeddingStore - Store and retrieve embeddings
EmbeddingStoreContentRetriever - RAG retrieval from embedding store

Data Types:

Document - Represents loaded document
TextSegment - Represents document segment
Metadata - Key-value metadata storage
Embedding - Vector embedding representation

Install with Tessl CLI

npx tessl i tessl/maven-dev-langchain4j--langchain4j@1.11.0

docs

document-processing.md

tessl/maven-dev-langchain4j--langchain4j