CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-springframework-ai--spring-ai-commons

Common classes used across Spring AI providing document processing, text transformation, embedding utilities, observability support, and tokenization capabilities for AI application development

Overview
Eval results
Files

document-processing.mddocs/reference/

Document Processing

Document processing provides core interfaces and implementations for ETL (Extract, Transform, Load) operations on documents in AI pipelines.

Overview

The document processing layer consists of:

  • DocumentReader - Read documents from sources
  • DocumentWriter - Write documents to destinations
  • DocumentTransformer - Transform document lists
  • IdGenerator - Generate unique document IDs

These interfaces follow functional programming patterns and extend standard Java functional interfaces (Supplier, Consumer, Function).

Capabilities

DocumentReader Interface

Reads documents from a source and returns a list of Document objects.

package org.springframework.ai.document;

import java.util.List;
import java.util.function.Supplier;

interface DocumentReader extends Supplier<List<Document>> {
    /**
     * Read documents from the source.
     * Default implementation calls get().
     * @return list of documents
     */
    default List<Document> read() {
        return get();
    }

    /**
     * Get documents from the source (from Supplier interface).
     * @return list of documents
     */
    List<Document> get();
}

Usage

import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.TextReader;
import org.springframework.ai.reader.JsonReader;
import org.springframework.core.io.ClassPathResource;
import java.util.List;

// Read text documents
DocumentReader textReader = new TextReader(new ClassPathResource("data.txt"));
List<Document> textDocs = textReader.get();

// Read JSON documents
DocumentReader jsonReader = new JsonReader(new ClassPathResource("data.json"));
List<Document> jsonDocs = jsonReader.read();

// Process documents
for (Document doc : textDocs) {
    System.out.println("ID: " + doc.getId());
    System.out.println("Content: " + doc.getText());
    System.out.println("Metadata: " + doc.getMetadata());
}

See Readers and Writers documentation for specific reader implementations (JsonReader, TextReader).

DocumentWriter Interface

Writes a list of Document instances to a destination.

package org.springframework.ai.document;

import java.util.List;
import java.util.function.Consumer;

interface DocumentWriter extends Consumer<List<Document>> {
    /**
     * Write documents to the destination.
     * Default implementation calls accept().
     * @param documents list of documents to write
     */
    default void write(List<Document> documents) {
        accept(documents);
    }

    /**
     * Accept documents (from Consumer interface).
     * @param documents list of documents to write
     */
    void accept(List<Document> documents);
}

Usage

import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentWriter;
import org.springframework.ai.document.MetadataMode;
import org.springframework.ai.writer.FileDocumentWriter;
import java.util.List;

// Create documents
List<Document> docs = List.of(
    new Document("First document"),
    new Document("Second document"),
    new Document("Third document")
);

// Write to file
DocumentWriter writer = new FileDocumentWriter("output.txt");
writer.write(docs);

// Write with document markers and metadata
DocumentWriter writerWithMarkers = new FileDocumentWriter(
    "output-with-metadata.txt",
    true,  // with document markers
    MetadataMode.ALL,
    false  // don't append
);
writerWithMarkers.accept(docs);

See Readers and Writers documentation for specific writer implementations (FileDocumentWriter).

DocumentTransformer Interface

Transforms a list of documents into another list of documents.

package org.springframework.ai.document;

import java.util.List;
import java.util.function.Function;

interface DocumentTransformer extends Function<List<Document>, List<Document>> {
    /**
     * Transform documents.
     * Default implementation calls apply().
     * @param documents list of documents to transform
     * @return transformed list of documents
     */
    default List<Document> transform(List<Document> documents) {
        return apply(documents);
    }

    /**
     * Apply transformation (from Function interface).
     * @param input list of documents to transform
     * @return transformed list of documents
     */
    List<Document> apply(List<Document> input);
}

Usage

import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.transformer.ContentFormatTransformer;
import org.springframework.ai.document.ContentFormatter;
import org.springframework.ai.document.DefaultContentFormatter;
import java.util.List;

// Split documents into chunks
DocumentTransformer splitter = TokenTextSplitter.builder()
    .withChunkSize(500)
    .build();

List<Document> originalDocs = List.of(
    new Document("Long document content that needs to be split...")
);

List<Document> chunks = splitter.apply(originalDocs);

// Apply content formatting
ContentFormatter formatter = DefaultContentFormatter.defaultConfig();
DocumentTransformer formatTransformer = new ContentFormatTransformer(formatter);
List<Document> formattedDocs = formatTransformer.transform(chunks);

// Chain transformers
List<Document> result = formatTransformer.apply(splitter.apply(originalDocs));

Common DocumentTransformer implementations:

  • TokenTextSplitter - Splits text based on token count (see Text Splitting)
  • ContentFormatTransformer - Applies ContentFormatter to documents (see Content Formatting)

IdGenerator Interface

Generates unique document IDs from content.

package org.springframework.ai.document.id;

interface IdGenerator {
    /**
     * Generate a unique ID from content.
     * @param contents variable content to generate ID from
     * @return unique ID string
     */
    String generateId(Object... contents);
}

RandomIdGenerator

Generates random UUID-based IDs.

package org.springframework.ai.document.id;

class RandomIdGenerator implements IdGenerator {
    RandomIdGenerator();

    /**
     * Generate a random UUID.
     * @param contents ignored
     * @return random UUID as string
     */
    String generateId(Object... contents);
}

JdkSha256HexIdGenerator

Generates IDs based on SHA-256 hash of content.

package org.springframework.ai.document.id;

import java.nio.charset.Charset;

class JdkSha256HexIdGenerator implements IdGenerator {
    /**
     * Create generator with SHA-256 and UTF-8.
     */
    JdkSha256HexIdGenerator();

    /**
     * Create generator with custom algorithm and charset.
     * @param algorithm hash algorithm (e.g., "SHA-256", "MD5")
     * @param charset character encoding
     */
    JdkSha256HexIdGenerator(String algorithm, Charset charset);

    /**
     * Generate ID from content hash.
     * @param contents content to hash
     * @return hash-based UUID string
     */
    String generateId(Object... contents);
}

Usage Examples

import org.springframework.ai.document.Document;
import org.springframework.ai.document.id.IdGenerator;
import org.springframework.ai.document.id.RandomIdGenerator;
import org.springframework.ai.document.id.JdkSha256HexIdGenerator;
import java.nio.charset.StandardCharsets;

// Random ID generator
IdGenerator randomGen = new RandomIdGenerator();
String id1 = randomGen.generateId("content");  // e.g., "a1b2c3d4-..."
String id2 = randomGen.generateId("content");  // Different ID

// SHA-256 based generator (deterministic)
IdGenerator sha256Gen = new JdkSha256HexIdGenerator();
String id3 = sha256Gen.generateId("same content");
String id4 = sha256Gen.generateId("same content");  // Same ID as id3
String id5 = sha256Gen.generateId("different content");  // Different ID

// Custom algorithm generator
IdGenerator md5Gen = new JdkSha256HexIdGenerator("MD5", StandardCharsets.UTF_8);
String id6 = md5Gen.generateId("content");

// Use with Document builder
Document doc1 = Document.builder()
    .idGenerator(sha256Gen)
    .text("Document content")
    .build();

Document doc2 = Document.builder()
    .idGenerator(randomGen)
    .text("Another document")
    .build();

// Multiple content inputs for hash
String compositeId = sha256Gen.generateId("part1", "part2", "part3");

When to Use Each Generator

RandomIdGenerator:

  • Default choice for most use cases
  • Each document gets unique ID regardless of content
  • Good for documents that may have duplicate content
  • Non-deterministic - same content produces different IDs

JdkSha256HexIdGenerator:

  • Content-based IDs - same content always produces same ID
  • Useful for deduplication
  • Enables idempotent document ingestion
  • Deterministic - same content always produces same ID

Pipeline Patterns

Common patterns for chaining readers, transformers, and writers.

Read-Transform-Write Pipeline

import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.document.DocumentTransformer;
import org.springframework.ai.document.DocumentWriter;
import org.springframework.ai.reader.TextReader;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.writer.FileDocumentWriter;
import org.springframework.core.io.ClassPathResource;
import java.util.List;

// 1. Read
DocumentReader reader = new TextReader(new ClassPathResource("input.txt"));
List<Document> documents = reader.get();

// 2. Transform
DocumentTransformer splitter = TokenTextSplitter.builder()
    .withChunkSize(800)
    .build();
List<Document> chunks = splitter.apply(documents);

// 3. Write
DocumentWriter writer = new FileDocumentWriter("output.txt");
writer.write(chunks);

Multi-Stage Transformation

import org.springframework.ai.document.Document;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.transformer.ContentFormatTransformer;
import org.springframework.ai.document.DefaultContentFormatter;
import java.util.List;
import java.util.function.Function;

// Create transformers
TokenTextSplitter splitter = TokenTextSplitter.builder()
    .withChunkSize(500)
    .build();

ContentFormatTransformer formatter = new ContentFormatTransformer(
    DefaultContentFormatter.defaultConfig()
);

// Compose transformers
Function<List<Document>, List<Document>> pipeline =
    splitter.andThen(formatter);

// Apply pipeline
List<Document> input = // ... source documents
List<Document> output = pipeline.apply(input);

Parallel Processing

import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.TextReader;
import org.springframework.core.io.Resource;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.stream.Stream;

// Read from multiple sources in parallel
List<Resource> resources = List.of(/* multiple resources */);

List<CompletableFuture<List<Document>>> futures = resources.stream()
    .map(resource -> CompletableFuture.supplyAsync(() -> {
        DocumentReader reader = new TextReader(resource);
        return reader.get();
    }))
    .toList();

// Wait for all and flatten results
List<Document> allDocuments = futures.stream()
    .map(CompletableFuture::join)
    .flatMap(List::stream)
    .toList();

Conditional Transformation

import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentTransformer;
import java.util.List;

class ConditionalTransformer implements DocumentTransformer {
    private final DocumentTransformer transformer;
    private final String metadataKey;
    private final Object requiredValue;

    public ConditionalTransformer(DocumentTransformer transformer,
                                   String metadataKey,
                                   Object requiredValue) {
        this.transformer = transformer;
        this.metadataKey = metadataKey;
        this.requiredValue = requiredValue;
    }

    @Override
    public List<Document> apply(List<Document> documents) {
        // Filter documents matching condition
        List<Document> matching = documents.stream()
            .filter(doc -> requiredValue.equals(doc.getMetadata().get(metadataKey)))
            .toList();

        // Transform only matching documents
        List<Document> transformed = transformer.apply(matching);

        // Combine with non-matching documents
        List<Document> nonMatching = documents.stream()
            .filter(doc -> !requiredValue.equals(doc.getMetadata().get(metadataKey)))
            .toList();

        return Stream.concat(transformed.stream(), nonMatching.stream()).toList();
    }
}

// Usage
DocumentTransformer conditionalSplitter = new ConditionalTransformer(
    TokenTextSplitter.builder().build(),
    "type",
    "long-form"
);

List<Document> result = conditionalSplitter.apply(documents);

Thread Safety and Performance

Thread Safety:

  • DocumentReader implementations: Generally thread-safe (stateless)
  • DocumentWriter implementations: Check specific implementation (FileDocumentWriter is NOT thread-safe for concurrent writes to same file)
  • DocumentTransformer implementations: Stateless and thread-safe
  • IdGenerator implementations: RandomIdGenerator and JdkSha256HexIdGenerator are thread-safe

Performance:

  • Document creation: O(1), lightweight
  • Reader operations: I/O bound, depends on source
  • Transformer operations: Typically O(n) where n is number of documents
  • Writer operations: I/O bound, depends on destination

Error Handling

Common Exceptions:

  • IOException: File not found, network errors, permission denied (readers/writers)
  • IllegalArgumentException: Invalid parameters (null documents, negative sizes)
  • RuntimeException: Unexpected processing errors (JSON parsing, encoding issues)

Edge Cases:

// Empty document list
List<Document> empty = List.of();
DocumentTransformer transformer = // ... any transformer
List<Document> result = transformer.apply(empty);  // Returns empty list

// Null handling
try {
    transformer.apply(null);  // Throws NullPointerException
} catch (NullPointerException e) {
    // Handle null input
}

// Reader with missing resource
try {
    DocumentReader reader = new TextReader(new ClassPathResource("missing.txt"));
    List<Document> docs = reader.get();  // Throws IOException wrapped in RuntimeException
} catch (RuntimeException e) {
    // Handle missing resource
}

Best Practices

  1. Reuse Readers/Writers: Create once for multiple operations
  2. Handle I/O Exceptions: Always wrap reader/writer calls in try-catch
  3. Use Appropriate ID Generators: Random for uniqueness, SHA-256 for deduplication
  4. Chain Transformers: Use functional composition for pipelines
  5. Validate Inputs: Check for null/empty before processing

Install with Tessl CLI

npx tessl i tessl/maven-org-springframework-ai--spring-ai-commons

docs

index.md

README.md

tile.json