or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

tessl/maven-org-apache-tika--tika-core

Apache Tika Core provides the foundational APIs for detecting and extracting metadata and structured text content from various document formats.

Workspace
tessl
Visibility
Public
Created
Last updated
Describes
mavenpkg:maven/org.apache.tika/tika-core@3.2.x

To install, run

npx @tessl/cli install tessl/maven-org-apache-tika--tika-core@3.2.0

0

# Apache Tika Core

1

2

Apache Tika Core is the foundational library of the Apache Tika toolkit, providing essential functionality for detecting and extracting metadata and structured text content from various document formats. As the base module from which all other Tika modules inherit functionality, it defines the core APIs, interfaces, and architectural components for document processing, content type identification, metadata handling, and content extraction.

3

4

## Package Information

5

6

- **Package Name**: tika-core (org.apache.tika:tika-core)

7

- **Package Type**: maven

8

- **Language**: Java

9

- **Installation**: Add to Maven dependencies:

10

```xml

11

<dependency>

12

<groupId>org.apache.tika</groupId>

13

<artifactId>tika-core</artifactId>

14

<version>3.2.2</version>

15

</dependency>

16

```

17

- **Gradle**: `implementation 'org.apache.tika:tika-core:3.2.2'`

18

19

## Core Imports

20

21

```java

22

import org.apache.tika.Tika;

23

import org.apache.tika.parser.Parser;

24

import org.apache.tika.parser.AutoDetectParser;

25

import org.apache.tika.detect.Detector;

26

import org.apache.tika.detect.DefaultDetector;

27

import org.apache.tika.metadata.Metadata;

28

import org.apache.tika.config.TikaConfig;

29

```

30

31

## Basic Usage

32

33

```java

34

import org.apache.tika.Tika;

35

import org.apache.tika.metadata.Metadata;

36

import java.io.File;

37

import java.io.FileInputStream;

38

import java.io.InputStream;

39

40

// Simple facade usage

41

Tika tika = new Tika();

42

43

// Detect content type

44

String mimeType = tika.detect(new File("document.pdf"));

45

System.out.println("MIME type: " + mimeType);

46

47

// Extract text content

48

String text = tika.parseToString(new File("document.pdf"));

49

System.out.println("Extracted text: " + text);

50

51

// Parse with metadata extraction

52

try (InputStream stream = new FileInputStream("document.pdf")) {

53

Metadata metadata = new Metadata();

54

String content = tika.parseToString(stream, metadata);

55

56

// Access metadata

57

String title = metadata.get("title");

58

String author = metadata.get("dc:creator");

59

}

60

```

61

62

## Architecture

63

64

Apache Tika Core is built around several key architectural components:

65

66

- **Tika Facade**: The `org.apache.tika.Tika` class provides simplified access to all Tika functionality

67

- **Parser Framework**: `Parser` interface and implementations for document parsing

68

- **Detection System**: `Detector` interface and implementations for content type detection

69

- **Metadata System**: `Metadata` class and property interfaces for document metadata

70

- **Content Processing**: SAX-based content handlers for text extraction and processing

71

- **Configuration**: `TikaConfig` for advanced setup and service loading

72

- **I/O Utilities**: Enhanced streams and utilities for efficient document processing

73

74

## Capabilities

75

76

### Document Parsing

77

78

Core document parsing functionality using the Parser interface and AutoDetectParser for automatic format detection. Supports parsing of documents into structured content with metadata extraction.

79

80

```java { .api }

81

public interface Parser {

82

void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

83

throws IOException, SAXException, TikaException;

84

Set<MediaType> getSupportedTypes(ParseContext context);

85

}

86

87

public class AutoDetectParser implements Parser {

88

public AutoDetectParser();

89

public AutoDetectParser(TikaConfig config);

90

public void setFallback(Parser fallback);

91

public Parser getFallback();

92

}

93

```

94

95

[Document Parsing](./parsing.md)

96

97

### Content Type Detection

98

99

Detection system for identifying document formats and MIME types using various detection strategies including magic numbers, file extensions, and neural network models.

100

101

```java { .api }

102

public interface Detector {

103

MediaType detect(InputStream input, Metadata metadata) throws IOException;

104

}

105

106

public class DefaultDetector extends CompositeDetector {

107

public DefaultDetector();

108

public DefaultDetector(MimeTypes types);

109

}

110

```

111

112

[Content Type Detection](./detection.md)

113

114

### Metadata Management

115

116

Comprehensive metadata system for extracting, storing, and manipulating document properties with support for standard metadata schemas and custom properties.

117

118

```java { .api }

119

public class Metadata implements Serializable {

120

public String get(String name);

121

public String[] getValues(String name);

122

public void set(String name, String value);

123

public void add(String name, String value);

124

public void remove(String name);

125

public String[] names();

126

}

127

```

128

129

[Metadata Management](./metadata.md)

130

131

### Content Processing

132

133

SAX-based content handler system for extracting, transforming, and processing document content with support for various output formats and specialized processing needs.

134

135

```java { .api }

136

public class BodyContentHandler extends WriteOutContentHandler {

137

public BodyContentHandler();

138

public BodyContentHandler(Writer writer);

139

public BodyContentHandler(int writeLimit);

140

}

141

142

public class ToXMLContentHandler extends ContentHandlerDecorator {

143

public ToXMLContentHandler();

144

public ToXMLContentHandler(ContentHandler handler, String encoding);

145

}

146

```

147

148

[Content Processing](./content-processing.md)

149

150

### MIME Type System

151

152

Comprehensive MIME type registry and media type handling with support for type relationships, detection patterns, and custom type definitions.

153

154

```java { .api }

155

public final class MediaType implements Serializable {

156

public static MediaType parse(String type);

157

public String getType();

158

public String getSubtype();

159

public String toString();

160

}

161

162

public class MimeTypes {

163

public static MimeTypes getDefaultMimeTypes();

164

public MediaType detect(InputStream input, String name) throws IOException;

165

public MimeType forName(String name) throws MimeTypeException;

166

}

167

```

168

169

[MIME Type System](./mime-types.md)

170

171

### Configuration and Service Loading

172

173

Configuration management system with support for custom parsers, detectors, and service loading with parameter configuration and initialization handling.

174

175

```java { .api }

176

public class TikaConfig {

177

public static TikaConfig getDefaultConfig();

178

public Parser getParser();

179

public Detector getDetector();

180

public Translator getTranslator();

181

}

182

```

183

184

[Configuration](./configuration.md)

185

186

### Language Processing

187

188

Language detection and translation capabilities for identifying document languages and translating text content with pluggable translator implementations.

189

190

```java { .api }

191

public class LanguageIdentifier {

192

public LanguageIdentifier(String text);

193

public String getLanguage();

194

public boolean isReasonablyCertain();

195

}

196

197

public interface Translator {

198

String translate(String text, String sourceLanguage, String targetLanguage)

199

throws TikaException, IOException;

200

boolean isAvailable();

201

}

202

```

203

204

[Language Processing](./language.md)

205

206

### Batch Processing (Pipes)

207

208

Enterprise-grade batch processing framework using the Fetcher/Emitter pattern for scalable document processing with support for async operations and error handling.

209

210

```java { .api }

211

public interface Fetcher<T extends FetchKey> {

212

InputStream fetch(String fetchKey, Metadata metadata) throws IOException, TikaException;

213

String getName();

214

}

215

216

public interface Emitter {

217

void emit(String emitKey, List<Metadata> metadataList) throws IOException, TikaException;

218

String getName();

219

}

220

```

221

222

[Batch Processing](./pipes.md)

223

224

### Exception Handling

225

226

Comprehensive exception hierarchy for handling various error conditions in document processing with specific exceptions for encryption, corruption, and format issues.

227

228

```java { .api }

229

public class TikaException extends Exception {

230

public TikaException(String message);

231

public TikaException(String message, Throwable cause);

232

}

233

234

public class EncryptedDocumentException extends TikaException;

235

public class UnsupportedFormatException extends TikaException;

236

public class CorruptedFileException extends TikaException;

237

```

238

239

[Exception Handling](./exceptions.md)

240

241

### Document Rendering

242

243

Framework for rendering documents into visual representations such as images, with support for page-based rendering and custom render requests.

244

245

```java { .api }

246

public interface Renderer extends Serializable {

247

Set<MediaType> getSupportedTypes(ParseContext context);

248

RenderResults render(InputStream is, Metadata metadata, ParseContext parseContext,

249

RenderRequest... requests) throws IOException, TikaException;

250

}

251

252

public class CompositeRenderer implements Renderer, Initializable {

253

public CompositeRenderer(ServiceLoader serviceLoader);

254

public CompositeRenderer(List<Renderer> renderers);

255

}

256

```

257

258

[Document Rendering](./rendering.md)

259

260

### Embedded Document Extraction

261

262

Framework for extracting embedded documents and resources from container formats with support for selective extraction and custom processing strategies.

263

264

```java { .api }

265

public interface EmbeddedDocumentExtractor {

266

boolean shouldParseEmbedded(Metadata metadata);

267

void parseEmbedded(InputStream stream, ContentHandler handler,

268

Metadata metadata, boolean outputHtml) throws SAXException, IOException;

269

}

270

271

public interface ContainerExtractor extends Serializable {

272

boolean isSupported(TikaInputStream input) throws IOException;

273

void extract(TikaInputStream stream, ContainerExtractor recurseExtractor,

274

EmbeddedResourceHandler handler) throws IOException, TikaException;

275

}

276

```

277

278

[Embedded Document Extraction](./embedded-extraction.md)

279

280

### Document Embedding

281

282

Framework for embedding metadata into documents, allowing modification and insertion of metadata properties into existing files.

283

284

```java { .api }

285

public interface Embedder extends Serializable {

286

Set<MediaType> getSupportedEmbedTypes(ParseContext context);

287

void embed(Metadata metadata, InputStream originalStream, OutputStream outputStream,

288

ParseContext context) throws IOException, TikaException;

289

}

290

291

public class ExternalEmbedder implements Embedder {

292

public void setCommand(String... command);

293

public void setMetadataCommandArguments(Map<Property, String[]> arguments);

294

}

295

```

296

297

[Document Embedding](./embedding.md)

298

299

### Process Forking Infrastructure

300

301

Advanced infrastructure for running document parsing operations in separate JVM processes to provide isolation, memory management, and fault tolerance.

302

303

```java { .api }

304

public class ForkParser implements Parser, Closeable {

305

public ForkParser();

306

public ForkParser(ClassLoader loader, Parser parser);

307

public void setPoolSize(int poolSize);

308

public void setServerParseTimeoutMillis(long serverParseTimeoutMillis);

309

}

310

311

public interface ForkResource {

312

Throwable process(DataInputStream input, DataOutputStream output) throws IOException;

313

}

314

```

315

316

[Process Forking Infrastructure](./process-forking.md)

317

318

### I/O and Utilities

319

320

Enhanced I/O streams and utility classes for efficient document processing with support for temporary resources, bounded streams, and system integration.

321

322

```java { .api }

323

public class TikaInputStream extends TaggedInputStream {

324

public static TikaInputStream get(InputStream stream);

325

public static TikaInputStream get(File file);

326

public static TikaInputStream get(Path path);

327

public static TikaInputStream get(URL url);

328

public boolean hasFile();

329

public File getFile() throws IOException;

330

}

331

```

332

333

[I/O and Utilities](./io-utilities.md)