tessl/maven-org-apache-tika--tika-core

Apache Tika Core provides the foundational APIs for detecting and extracting metadata and structured text content from various document formats.

—

Pending

Overview

Eval results

Files

Document Parsing

Name: tessl/maven-org-apache-tika--tika-core
Author: tessl

Core document parsing functionality using the Parser interface and implementations for extracting content and metadata from various document formats with automatic format detection and flexible parsing contexts.

Capabilities

Parser Interface

The fundamental interface for all document parsers in Tika, defining the contract for parsing documents into structured content with metadata extraction.

/**
 * Interface for document parsers that extract content and metadata from input streams
 */
public interface Parser {
    /**
     * Parses a document from the given input stream
     * @param stream Input stream containing the document to parse
     * @param handler Content handler to receive parsed content events
     * @param metadata Metadata object to populate with extracted metadata
     * @param context Parse context containing parser configuration and state
     * @throws IOException If an I/O error occurs during parsing
     * @throws SAXException If a SAX parsing error occurs
     * @throws TikaException If a Tika-specific parsing error occurs
     */
    void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
        throws IOException, SAXException, TikaException;
        
    /**
     * Returns the set of media types supported by this parser
     * @param context Parse context for configuration-dependent type support
     * @return Set of supported MediaType objects
     */
    Set<MediaType> getSupportedTypes(ParseContext context);
}

AutoDetectParser

The default parser implementation that automatically detects document type and delegates to appropriate specialized parsers, providing the most convenient entry point for parsing unknown document formats.

/**
 * Parser that automatically detects document type and delegates to appropriate parsers
 */
public class AutoDetectParser implements Parser {
    /**
     * Creates an AutoDetectParser with default configuration
     */
    public AutoDetectParser();
    
    /**
     * Creates an AutoDetectParser with the specified Tika configuration
     * @param config TikaConfig instance containing parser and detector configuration
     */
    public AutoDetectParser(TikaConfig config);
    
    /**
     * Sets the fallback parser used when no suitable parser is found
     * @param fallback Parser to use as fallback, or null to disable fallback
     */
    public void setFallback(Parser fallback);
    
    /**
     * Gets the current fallback parser
     * @return The fallback parser, or null if no fallback is configured
     */
    public Parser getFallback();
    
    /**
     * Gets the detector used for content type detection
     * @return The Detector instance used by this parser
     */
    public Detector getDetector();
    
    /**
     * Gets the map of parsers by media type
     * @return Map from MediaType to Parser instances
     */
    public Map<MediaType, Parser> getParsers();
}

Usage Examples:

import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.BodyContentHandler;
import java.io.FileInputStream;
import java.io.InputStream;

// Basic parsing with auto-detection
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
BodyContentHandler handler = new BodyContentHandler();

try (InputStream stream = new FileInputStream("document.pdf")) {
    parser.parse(stream, handler, metadata, context);
    String content = handler.toString();
    String title = metadata.get("title");
    String author = metadata.get("dc:creator");
}

// Custom configuration with fallback
AutoDetectParser customParser = new AutoDetectParser();
customParser.setFallback(new EmptyParser()); // Use empty parser as fallback

CompositeParser

A parser that delegates parsing to a collection of sub-parsers based on media type, allowing for modular parser composition and custom parser configurations.

/**
 * Parser that delegates to different parsers based on media type
 */
public class CompositeParser extends AbstractParser {
    /**
     * Creates an empty CompositeParser
     */
    public CompositeParser();
    
    /**
     * Creates a CompositeParser with the specified parser mappings
     * @param parsers Map from MediaType to Parser instances
     */
    public CompositeParser(Map<MediaType, Parser> parsers);
    
    /**
     * Gets the map of parsers by media type
     * @return Map from MediaType to Parser instances
     */
    public Map<MediaType, Parser> getParsers();
    
    /**
     * Sets the parser mappings
     * @param parsers Map from MediaType to Parser instances
     */
    public void setParsers(Map<MediaType, Parser> parsers);
    
    /**
     * Gets all media types supported by the configured parsers
     * @param context Parse context for configuration
     * @return Set of supported MediaType objects
     */
    public Set<MediaType> getSupportedTypes(ParseContext context);
}

ParseContext

Context object that carries configuration and state information during parsing operations, allowing parsers to share resources and configuration.

/**
 * Context object for parser configuration and state sharing
 */
public class ParseContext {
    /**
     * Creates an empty ParseContext
     */
    public ParseContext();
    
    /**
     * Sets a context object of the specified type
     * @param type Class type of the context object
     * @param context The context object to set
     * @param <T> Type parameter for the context object
     */
    public <T> void set(Class<T> type, T context);
    
    /**
     * Gets a context object of the specified type
     * @param type Class type of the context object to retrieve
     * @param <T> Type parameter for the context object
     * @return The context object, or null if not set
     */
    public <T> T get(Class<T> type);
    
    /**
     * Gets a context object of the specified type with a default value
     * @param type Class type of the context object to retrieve
     * @param defaultValue Default value to return if not set
     * @param <T> Type parameter for the context object
     * @return The context object, or defaultValue if not set
     */
    public <T> T get(Class<T> type, T defaultValue);
}

ParsingReader

A Reader implementation that parses documents on-demand, providing character-based access to parsed content with automatic format detection.

/**
 * Reader that parses documents on-demand and provides character access to content
 */
public class ParsingReader extends Reader {
    /**
     * Creates a ParsingReader for the specified input stream
     * @param stream Input stream containing the document to parse
     */
    public ParsingReader(InputStream stream);
    
    /**
     * Creates a ParsingReader with custom parser and metadata
     * @param parser Parser to use for document parsing
     * @param stream Input stream containing the document
     * @param metadata Metadata object to populate during parsing
     * @param context Parse context for configuration
     */
    public ParsingReader(Parser parser, InputStream stream, Metadata metadata, ParseContext context);
    
    /**
     * Gets the metadata populated during parsing
     * @return Metadata object containing extracted metadata
     */
    public Metadata getMetadata();
}

DefaultParser

A preconfigured parser with common settings and reasonable defaults for most parsing scenarios.

/**
 * Parser with common configurations and reasonable defaults
 */
public class DefaultParser extends CompositeParser {
    /**
     * Creates a DefaultParser with standard configuration
     */
    public DefaultParser();
    
    /**
     * Creates a DefaultParser with the specified configuration
     * @param config TikaConfig instance for parser configuration
     */
    public DefaultParser(TikaConfig config);
}

AutoDetectParserConfig

Configuration class for customizing AutoDetectParser behavior with various parsing options and limits.

/**
 * Configuration options for AutoDetectParser
 */
public class AutoDetectParserConfig {
    /**
     * Creates default configuration
     */
    public AutoDetectParserConfig();
    
    /**
     * Sets the maximum string length for text extraction
     * @param maxStringLength Maximum length in characters
     */
    public void setMaxStringLength(int maxStringLength);
    
    /**
     * Gets the maximum string length for text extraction
     * @return Maximum length in characters
     */
    public int getMaxStringLength();
}

AbstractParser

Base class for parser implementations providing common functionality and utilities for custom parser development.

/**
 * Abstract base class for parser implementations
 */
public abstract class AbstractParser implements Parser {
    /**
     * Gets the supported types for this parser
     * @param context Parse context for configuration
     * @return Set of supported MediaType objects
     */
    public abstract Set<MediaType> getSupportedTypes(ParseContext context);
    
    /**
     * Parses the document with the given parameters
     * @param stream Input stream containing the document
     * @param handler Content handler to receive parsed content
     * @param metadata Metadata object to populate
     * @param context Parse context for configuration
     */
    public abstract void parse(InputStream stream, ContentHandler handler, 
                              Metadata metadata, ParseContext context)
        throws IOException, SAXException, TikaException;
}

Advanced Parsing Features

Parse Limits and Resource Management

/**
 * Configures parsing limits to prevent resource exhaustion
 */
public class ParseContext {
    // Set maximum text extraction length
    public void set(Class<WriteOutContentHandler>, new WriteOutContentHandler(100000));
    
    // Configure memory limits for embedded document extraction
    public void set(Class<EmbeddedDocumentExtractor>, new ParsingEmbeddedDocumentExtractor(context));
}

Custom Parser Integration

// Example of custom parser configuration
Map<MediaType, Parser> parsers = new HashMap<>();
parsers.put(MediaType.parse("application/custom"), new CustomParser());

CompositeParser compositeParser = new CompositeParser(parsers);
AutoDetectParser parser = new AutoDetectParser();
// Configure parser with custom types

Error Handling

Common exceptions thrown during parsing operations:

TikaException: General parsing errors
EncryptedDocumentException: Document is password-protected
UnsupportedFormatException: No parser available for document format
CorruptedFileException: Document structure is corrupted
WriteLimitReachedException: Content extraction limit exceeded

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-tika--tika-core

docs

configuration.md

content-processing.md

detection.md

embedded-extraction.md