Apache Tika Core provides the foundational APIs for detecting and extracting metadata and structured text content from various document formats.
npx @tessl/cli install tessl/maven-org-apache-tika--tika-core@3.2.00
# Apache Tika Core
1
2
Apache Tika Core is the foundational library of the Apache Tika toolkit, providing essential functionality for detecting and extracting metadata and structured text content from various document formats. As the base module from which all other Tika modules inherit functionality, it defines the core APIs, interfaces, and architectural components for document processing, content type identification, metadata handling, and content extraction.
3
4
## Package Information
5
6
- **Package Name**: tika-core (org.apache.tika:tika-core)
7
- **Package Type**: maven
8
- **Language**: Java
9
- **Installation**: Add to Maven dependencies:
10
```xml
11
<dependency>
12
<groupId>org.apache.tika</groupId>
13
<artifactId>tika-core</artifactId>
14
<version>3.2.2</version>
15
</dependency>
16
```
17
- **Gradle**: `implementation 'org.apache.tika:tika-core:3.2.2'`
18
19
## Core Imports
20
21
```java
22
import org.apache.tika.Tika;
23
import org.apache.tika.parser.Parser;
24
import org.apache.tika.parser.AutoDetectParser;
25
import org.apache.tika.detect.Detector;
26
import org.apache.tika.detect.DefaultDetector;
27
import org.apache.tika.metadata.Metadata;
28
import org.apache.tika.config.TikaConfig;
29
```
30
31
## Basic Usage
32
33
```java
34
import org.apache.tika.Tika;
35
import org.apache.tika.metadata.Metadata;
36
import java.io.File;
37
import java.io.FileInputStream;
38
import java.io.InputStream;
39
40
// Simple facade usage
41
Tika tika = new Tika();
42
43
// Detect content type
44
String mimeType = tika.detect(new File("document.pdf"));
45
System.out.println("MIME type: " + mimeType);
46
47
// Extract text content
48
String text = tika.parseToString(new File("document.pdf"));
49
System.out.println("Extracted text: " + text);
50
51
// Parse with metadata extraction
52
try (InputStream stream = new FileInputStream("document.pdf")) {
53
Metadata metadata = new Metadata();
54
String content = tika.parseToString(stream, metadata);
55
56
// Access metadata
57
String title = metadata.get("title");
58
String author = metadata.get("dc:creator");
59
}
60
```
61
62
## Architecture
63
64
Apache Tika Core is built around several key architectural components:
65
66
- **Tika Facade**: The `org.apache.tika.Tika` class provides simplified access to all Tika functionality
67
- **Parser Framework**: `Parser` interface and implementations for document parsing
68
- **Detection System**: `Detector` interface and implementations for content type detection
69
- **Metadata System**: `Metadata` class and property interfaces for document metadata
70
- **Content Processing**: SAX-based content handlers for text extraction and processing
71
- **Configuration**: `TikaConfig` for advanced setup and service loading
72
- **I/O Utilities**: Enhanced streams and utilities for efficient document processing
73
74
## Capabilities
75
76
### Document Parsing
77
78
Core document parsing functionality using the Parser interface and AutoDetectParser for automatic format detection. Supports parsing of documents into structured content with metadata extraction.
79
80
```java { .api }
81
public interface Parser {
82
void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
83
throws IOException, SAXException, TikaException;
84
Set<MediaType> getSupportedTypes(ParseContext context);
85
}
86
87
public class AutoDetectParser implements Parser {
88
public AutoDetectParser();
89
public AutoDetectParser(TikaConfig config);
90
public void setFallback(Parser fallback);
91
public Parser getFallback();
92
}
93
```
94
95
[Document Parsing](./parsing.md)
96
97
### Content Type Detection
98
99
Detection system for identifying document formats and MIME types using various detection strategies including magic numbers, file extensions, and neural network models.
100
101
```java { .api }
102
public interface Detector {
103
MediaType detect(InputStream input, Metadata metadata) throws IOException;
104
}
105
106
public class DefaultDetector extends CompositeDetector {
107
public DefaultDetector();
108
public DefaultDetector(MimeTypes types);
109
}
110
```
111
112
[Content Type Detection](./detection.md)
113
114
### Metadata Management
115
116
Comprehensive metadata system for extracting, storing, and manipulating document properties with support for standard metadata schemas and custom properties.
117
118
```java { .api }
119
public class Metadata implements Serializable {
120
public String get(String name);
121
public String[] getValues(String name);
122
public void set(String name, String value);
123
public void add(String name, String value);
124
public void remove(String name);
125
public String[] names();
126
}
127
```
128
129
[Metadata Management](./metadata.md)
130
131
### Content Processing
132
133
SAX-based content handler system for extracting, transforming, and processing document content with support for various output formats and specialized processing needs.
134
135
```java { .api }
136
public class BodyContentHandler extends WriteOutContentHandler {
137
public BodyContentHandler();
138
public BodyContentHandler(Writer writer);
139
public BodyContentHandler(int writeLimit);
140
}
141
142
public class ToXMLContentHandler extends ContentHandlerDecorator {
143
public ToXMLContentHandler();
144
public ToXMLContentHandler(ContentHandler handler, String encoding);
145
}
146
```
147
148
[Content Processing](./content-processing.md)
149
150
### MIME Type System
151
152
Comprehensive MIME type registry and media type handling with support for type relationships, detection patterns, and custom type definitions.
153
154
```java { .api }
155
public final class MediaType implements Serializable {
156
public static MediaType parse(String type);
157
public String getType();
158
public String getSubtype();
159
public String toString();
160
}
161
162
public class MimeTypes {
163
public static MimeTypes getDefaultMimeTypes();
164
public MediaType detect(InputStream input, String name) throws IOException;
165
public MimeType forName(String name) throws MimeTypeException;
166
}
167
```
168
169
[MIME Type System](./mime-types.md)
170
171
### Configuration and Service Loading
172
173
Configuration management system with support for custom parsers, detectors, and service loading with parameter configuration and initialization handling.
174
175
```java { .api }
176
public class TikaConfig {
177
public static TikaConfig getDefaultConfig();
178
public Parser getParser();
179
public Detector getDetector();
180
public Translator getTranslator();
181
}
182
```
183
184
[Configuration](./configuration.md)
185
186
### Language Processing
187
188
Language detection and translation capabilities for identifying document languages and translating text content with pluggable translator implementations.
189
190
```java { .api }
191
public class LanguageIdentifier {
192
public LanguageIdentifier(String text);
193
public String getLanguage();
194
public boolean isReasonablyCertain();
195
}
196
197
public interface Translator {
198
String translate(String text, String sourceLanguage, String targetLanguage)
199
throws TikaException, IOException;
200
boolean isAvailable();
201
}
202
```
203
204
[Language Processing](./language.md)
205
206
### Batch Processing (Pipes)
207
208
Enterprise-grade batch processing framework using the Fetcher/Emitter pattern for scalable document processing with support for async operations and error handling.
209
210
```java { .api }
211
public interface Fetcher<T extends FetchKey> {
212
InputStream fetch(String fetchKey, Metadata metadata) throws IOException, TikaException;
213
String getName();
214
}
215
216
public interface Emitter {
217
void emit(String emitKey, List<Metadata> metadataList) throws IOException, TikaException;
218
String getName();
219
}
220
```
221
222
[Batch Processing](./pipes.md)
223
224
### Exception Handling
225
226
Comprehensive exception hierarchy for handling various error conditions in document processing with specific exceptions for encryption, corruption, and format issues.
227
228
```java { .api }
229
public class TikaException extends Exception {
230
public TikaException(String message);
231
public TikaException(String message, Throwable cause);
232
}
233
234
public class EncryptedDocumentException extends TikaException;
235
public class UnsupportedFormatException extends TikaException;
236
public class CorruptedFileException extends TikaException;
237
```
238
239
[Exception Handling](./exceptions.md)
240
241
### Document Rendering
242
243
Framework for rendering documents into visual representations such as images, with support for page-based rendering and custom render requests.
244
245
```java { .api }
246
public interface Renderer extends Serializable {
247
Set<MediaType> getSupportedTypes(ParseContext context);
248
RenderResults render(InputStream is, Metadata metadata, ParseContext parseContext,
249
RenderRequest... requests) throws IOException, TikaException;
250
}
251
252
public class CompositeRenderer implements Renderer, Initializable {
253
public CompositeRenderer(ServiceLoader serviceLoader);
254
public CompositeRenderer(List<Renderer> renderers);
255
}
256
```
257
258
[Document Rendering](./rendering.md)
259
260
### Embedded Document Extraction
261
262
Framework for extracting embedded documents and resources from container formats with support for selective extraction and custom processing strategies.
263
264
```java { .api }
265
public interface EmbeddedDocumentExtractor {
266
boolean shouldParseEmbedded(Metadata metadata);
267
void parseEmbedded(InputStream stream, ContentHandler handler,
268
Metadata metadata, boolean outputHtml) throws SAXException, IOException;
269
}
270
271
public interface ContainerExtractor extends Serializable {
272
boolean isSupported(TikaInputStream input) throws IOException;
273
void extract(TikaInputStream stream, ContainerExtractor recurseExtractor,
274
EmbeddedResourceHandler handler) throws IOException, TikaException;
275
}
276
```
277
278
[Embedded Document Extraction](./embedded-extraction.md)
279
280
### Document Embedding
281
282
Framework for embedding metadata into documents, allowing modification and insertion of metadata properties into existing files.
283
284
```java { .api }
285
public interface Embedder extends Serializable {
286
Set<MediaType> getSupportedEmbedTypes(ParseContext context);
287
void embed(Metadata metadata, InputStream originalStream, OutputStream outputStream,
288
ParseContext context) throws IOException, TikaException;
289
}
290
291
public class ExternalEmbedder implements Embedder {
292
public void setCommand(String... command);
293
public void setMetadataCommandArguments(Map<Property, String[]> arguments);
294
}
295
```
296
297
[Document Embedding](./embedding.md)
298
299
### Process Forking Infrastructure
300
301
Advanced infrastructure for running document parsing operations in separate JVM processes to provide isolation, memory management, and fault tolerance.
302
303
```java { .api }
304
public class ForkParser implements Parser, Closeable {
305
public ForkParser();
306
public ForkParser(ClassLoader loader, Parser parser);
307
public void setPoolSize(int poolSize);
308
public void setServerParseTimeoutMillis(long serverParseTimeoutMillis);
309
}
310
311
public interface ForkResource {
312
Throwable process(DataInputStream input, DataOutputStream output) throws IOException;
313
}
314
```
315
316
[Process Forking Infrastructure](./process-forking.md)
317
318
### I/O and Utilities
319
320
Enhanced I/O streams and utility classes for efficient document processing with support for temporary resources, bounded streams, and system integration.
321
322
```java { .api }
323
public class TikaInputStream extends TaggedInputStream {
324
public static TikaInputStream get(InputStream stream);
325
public static TikaInputStream get(File file);
326
public static TikaInputStream get(Path path);
327
public static TikaInputStream get(URL url);
328
public boolean hasFile();
329
public File getFile() throws IOException;
330
}
331
```
332
333
[I/O and Utilities](./io-utilities.md)