CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/maven-org-apache-tika--tika-core

Apache Tika Core provides the foundational APIs for detecting and extracting metadata and structured text content from various document formats.

Pending
Overview
Eval results
Files

detection.mddocs/

Content Type Detection

Detection system for identifying document formats and MIME types using various detection strategies including magic numbers, file extensions, neural network models, and composite detection approaches.

Capabilities

Detector Interface

The fundamental interface for content type detection, providing the contract for identifying document formats from input streams and metadata.

/**
 * Interface for detecting the media type of documents
 */
public interface Detector {
    /**
     * Detects the media type of the given document
     * @param input Input stream containing document data (may be null)
     * @param metadata Metadata containing hints like filename or content type
     * @return MediaType representing the detected content type
     * @throws IOException If an I/O error occurs during detection
     */
    MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

DefaultDetector

The primary detector implementation that combines multiple detection strategies in a layered approach for robust content type identification.

/**
 * Default composite detector combining multiple detection strategies
 */
public class DefaultDetector extends CompositeDetector {
    /**
     * Creates a DefaultDetector with standard detection strategies
     */
    public DefaultDetector();
    
    /**
     * Creates a DefaultDetector with custom MIME types registry
     * @param types MimeTypes registry for magic number detection
     */
    public DefaultDetector(MimeTypes types);
    
    /**
     * Creates a DefaultDetector with custom class loader for service discovery
     * @param loader ClassLoader for discovering detector services
     */
    public DefaultDetector(ClassLoader loader);
    
    /**
     * Creates a DefaultDetector with custom types and class loader
     * @param types MimeTypes registry for magic number detection
     * @param loader ClassLoader for discovering detector services
     */
    public DefaultDetector(MimeTypes types, ClassLoader loader);
}

Usage Examples:

import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import java.io.FileInputStream;
import java.io.InputStream;

// Basic content type detection
Detector detector = new DefaultDetector();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, "document.pdf");

try (InputStream stream = new FileInputStream("document.pdf")) {
    MediaType mediaType = detector.detect(stream, metadata);
    System.out.println("Detected type: " + mediaType.toString());
}

// Detection from filename only
Metadata filenameMetadata = new Metadata();
filenameMetadata.set(Metadata.RESOURCE_NAME_KEY, "spreadsheet.xlsx");
MediaType typeFromName = detector.detect(null, filenameMetadata);

CompositeDetector

A detector that combines multiple detection strategies, allowing for layered detection approaches with fallback mechanisms.

/**
 * Detector that combines multiple detection strategies
 */
public class CompositeDetector implements Detector {
    /**
     * Creates a CompositeDetector with the specified detectors
     * @param detectors List of detectors to combine, applied in order
     */
    public CompositeDetector(List<Detector> detectors);
    
    /**
     * Creates a CompositeDetector with the specified detectors
     * @param detectors Array of detectors to combine, applied in order
     */
    public CompositeDetector(Detector... detectors);
    
    /**
     * Gets the list of detectors used by this composite
     * @return List of Detector instances in application order
     */
    public List<Detector> getDetectors();
}

TypeDetector

A detector that identifies content types based solely on file extensions and naming patterns, useful for quick filename-based detection.

/**
 * Detector based on file extensions and naming patterns
 */
public class TypeDetector implements Detector {
    /**
     * Creates a TypeDetector with default MIME types registry
     */
    public TypeDetector();
    
    /**
     * Creates a TypeDetector with custom MIME types registry
     * @param types MimeTypes registry containing type mappings
     */
    public TypeDetector(MimeTypes types);
    
    /**
     * Detects media type based on filename extension
     * @param input Input stream (ignored by this detector)
     * @param metadata Metadata containing filename information
     * @return MediaType based on file extension, or OCTET_STREAM if unknown
     */
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

NameDetector

A more sophisticated filename-based detector that uses pattern matching and heuristics for filename analysis.

/**
 * Detector based on filename patterns and heuristics
 */
public class NameDetector implements Detector {
    /**
     * Creates a NameDetector with default configuration
     */
    public NameDetector();
    
    /**
     * Detects media type based on filename patterns
     * @param input Input stream (not used by this detector)
     * @param metadata Metadata containing filename or resource name
     * @return MediaType based on filename analysis
     */
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

TextDetector

A detector that identifies text content and attempts to determine specific text formats and encodings.

/**
 * Detector for identifying text content and formats
 */
public class TextDetector implements Detector {
    /**
     * Creates a TextDetector with default configuration
     */
    public TextDetector();
    
    /**
     * Detects text content types and formats
     * @param input Input stream containing potential text data
     * @param metadata Metadata with additional hints
     * @return MediaType for detected text format
     */
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

MagicDetector

A detector that uses magic number patterns and byte signatures to identify file formats, providing the most reliable binary-based detection.

/**
 * Detector using magic numbers and byte signatures
 */
public class MagicDetector implements Detector {
    /**
     * Creates a MagicDetector with default MIME types registry
     */
    public MagicDetector();
    
    /**
     * Creates a MagicDetector with custom MIME types registry
     * @param types MimeTypes registry containing magic patterns
     */
    public MagicDetector(MimeTypes types);
    
    /**
     * Detects media type using magic number analysis
     * @param input Input stream to analyze for magic patterns
     * @param metadata Metadata (may provide additional context)
     * @return MediaType based on magic number detection
     */
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

EncodingDetector Interface

Interface for character encoding detection, used to identify text encoding in documents and streams.

/**
 * Interface for detecting character encodings
 */
public interface EncodingDetector {
    /**
     * Detects the character encoding of the given text stream
     * @param input Input stream containing text data
     * @param metadata Metadata with encoding hints
     * @return Charset representing the detected encoding, or null if unknown
     * @throws IOException If an I/O error occurs during detection
     */
    Charset detect(InputStream input, Metadata metadata) throws IOException;
}

DefaultEncodingDetector

Default implementation of character encoding detection using multiple detection strategies.

/**
 * Default character encoding detector
 */
public class DefaultEncodingDetector implements EncodingDetector {
    /**
     * Creates a DefaultEncodingDetector with standard detection algorithms
     */
    public DefaultEncodingDetector();
    
    /**
     * Detects character encoding using multiple strategies
     * @param input Input stream containing text data
     * @param metadata Metadata containing encoding hints
     * @return Charset representing detected encoding
     */
    public Charset detect(InputStream input, Metadata metadata) throws IOException;
}

AutoDetectReader

A Reader implementation that automatically detects character encoding and provides transparent text access with proper encoding handling.

/**
 * Reader with automatic encoding detection
 */
public class AutoDetectReader extends Reader {
    /**
     * Creates an AutoDetectReader for the given input stream
     * @param input Input stream containing text data
     */
    public AutoDetectReader(InputStream input);
    
    /**
     * Creates an AutoDetectReader with custom encoding detector
     * @param input Input stream containing text data
     * @param detector EncodingDetector to use for encoding detection
     */
    public AutoDetectReader(InputStream input, EncodingDetector detector);
    
    /**
     * Creates an AutoDetectReader with metadata hints
     * @param input Input stream containing text data
     * @param metadata Metadata containing encoding hints
     */
    public AutoDetectReader(InputStream input, Metadata metadata);
    
    /**
     * Gets the detected character encoding
     * @return Charset representing the detected encoding
     */
    public Charset getCharset();
}

Neural Network Detection

Advanced detectors using machine learning models for content type identification.

/**
 * Interface for trained detection models
 */
public interface TrainedModel {
    /**
     * Predicts content type using the trained model
     * @param input Byte array containing document data
     * @return Probability distribution over content types
     */
    float[] predict(byte[] input);
    
    /**
     * Gets the content types supported by this model
     * @return Array of MediaType objects supported by the model
     */
    MediaType[] getSupportedTypes();
}

/**
 * Neural network-based trained model implementation
 */
public class NNTrainedModel implements TrainedModel {
    /**
     * Creates an NNTrainedModel from model data
     * @param modelData Byte array containing the trained model
     */
    public NNTrainedModel(byte[] modelData);
    
    /**
     * Loads a model from resources
     * @param modelPath Path to model resource
     * @return NNTrainedModel instance
     */
    public static NNTrainedModel loadFromResource(String modelPath);
}

/**
 * Detector using neural network models
 */
public class NNExampleModelDetector implements Detector {
    /**
     * Creates an NN detector with default model
     */
    public NNExampleModelDetector();
    
    /**
     * Creates an NN detector with custom model
     * @param model TrainedModel to use for detection
     */
    public NNExampleModelDetector(TrainedModel model);
}

Specialized Detectors

/**
 * Detector for empty files
 */
public class EmptyDetector implements Detector {
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

/**
 * Detector that can override other detectors based on metadata
 */
public class OverrideDetector implements Detector {
    public OverrideDetector(Detector originalDetector);
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

/**
 * Detector for zero-byte files
 */
public class ZeroSizeFileDetector implements Detector {
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

/**
 * Detector using system file command (Unix/Linux)
 */
public class FileCommandDetector implements Detector {
    public FileCommandDetector();
    public boolean isAvailable();
    public MediaType detect(InputStream input, Metadata metadata) throws IOException;
}

Text Analysis Utilities

/**
 * Statistical analysis of text content
 */
public class TextStatistics {
    /**
     * Analyzes text statistics from input stream
     * @param input Input stream containing text data
     * @return TextStatistics object with analysis results
     */
    public static TextStatistics calculate(InputStream input) throws IOException;
    
    /**
     * Gets the percentage of printable characters
     * @return Percentage (0.0 to 1.0) of printable characters
     */
    public double getPrintableRatio();
    
    /**
     * Gets the average line length
     * @return Average number of characters per line
     */
    public double getAverageLineLength();
    
    /**
     * Determines if content appears to be text
     * @return true if content appears to be text
     */
    public boolean isText();
}

Detection Strategies

Layered Detection Approach

The DefaultDetector uses a layered approach combining multiple strategies:

  1. Magic Number Detection: Analyzes byte patterns at file beginning
  2. Filename Extension: Uses file extension for type hints
  3. Content Analysis: Examines document structure and patterns
  4. Neural Network Models: Uses trained models for complex detection
  5. Metadata Hints: Considers existing content-type information

Custom Detection Configuration

// Create custom detector chain
List<Detector> detectors = Arrays.asList(
    new MagicDetector(),           // Prioritize magic numbers
    new TypeDetector(),            // Fall back to filename
    new NNExampleModelDetector(),  // Use ML for ambiguous cases
    new EmptyDetector()            // Handle empty files
);

CompositeDetector customDetector = new CompositeDetector(detectors);

Performance Considerations

  • Stream Buffering: Detectors typically read only the first few KB
  • Mark/Reset: Input streams should support mark/reset for efficient detection
  • Caching: Detection results can be cached based on content hashes
  • Resource Management: Some detectors (like FileCommandDetector) use external processes

Install with Tessl CLI

npx tessl i tessl/maven-org-apache-tika--tika-core

docs

configuration.md

content-processing.md

detection.md

embedded-extraction.md

embedding.md

exceptions.md

index.md

io-utilities.md

language.md

metadata.md

mime-types.md

parsing.md

pipes.md

process-forking.md

rendering.md

tile.json