0
# Content Type Detection
1
2
Detection system for identifying document formats and MIME types using various detection strategies including magic numbers, file extensions, neural network models, and composite detection approaches.
3
4
## Capabilities
5
6
### Detector Interface
7
8
The fundamental interface for content type detection, providing the contract for identifying document formats from input streams and metadata.
9
10
```java { .api }
11
/**
12
* Interface for detecting the media type of documents
13
*/
14
public interface Detector {
15
/**
16
* Detects the media type of the given document
17
* @param input Input stream containing document data (may be null)
18
* @param metadata Metadata containing hints like filename or content type
19
* @return MediaType representing the detected content type
20
* @throws IOException If an I/O error occurs during detection
21
*/
22
MediaType detect(InputStream input, Metadata metadata) throws IOException;
23
}
24
```
25
26
### DefaultDetector
27
28
The primary detector implementation that combines multiple detection strategies in a layered approach for robust content type identification.
29
30
```java { .api }
31
/**
32
* Default composite detector combining multiple detection strategies
33
*/
34
public class DefaultDetector extends CompositeDetector {
35
/**
36
* Creates a DefaultDetector with standard detection strategies
37
*/
38
public DefaultDetector();
39
40
/**
41
* Creates a DefaultDetector with custom MIME types registry
42
* @param types MimeTypes registry for magic number detection
43
*/
44
public DefaultDetector(MimeTypes types);
45
46
/**
47
* Creates a DefaultDetector with custom class loader for service discovery
48
* @param loader ClassLoader for discovering detector services
49
*/
50
public DefaultDetector(ClassLoader loader);
51
52
/**
53
* Creates a DefaultDetector with custom types and class loader
54
* @param types MimeTypes registry for magic number detection
55
* @param loader ClassLoader for discovering detector services
56
*/
57
public DefaultDetector(MimeTypes types, ClassLoader loader);
58
}
59
```
60
61
**Usage Examples:**
62
63
```java
64
import org.apache.tika.detect.DefaultDetector;
65
import org.apache.tika.detect.Detector;
66
import org.apache.tika.metadata.Metadata;
67
import org.apache.tika.mime.MediaType;
68
import java.io.FileInputStream;
69
import java.io.InputStream;
70
71
// Basic content type detection
72
Detector detector = new DefaultDetector();
73
Metadata metadata = new Metadata();
74
metadata.set(Metadata.RESOURCE_NAME_KEY, "document.pdf");
75
76
try (InputStream stream = new FileInputStream("document.pdf")) {
77
MediaType mediaType = detector.detect(stream, metadata);
78
System.out.println("Detected type: " + mediaType.toString());
79
}
80
81
// Detection from filename only
82
Metadata filenameMetadata = new Metadata();
83
filenameMetadata.set(Metadata.RESOURCE_NAME_KEY, "spreadsheet.xlsx");
84
MediaType typeFromName = detector.detect(null, filenameMetadata);
85
```
86
87
### CompositeDetector
88
89
A detector that combines multiple detection strategies, allowing for layered detection approaches with fallback mechanisms.
90
91
```java { .api }
92
/**
93
* Detector that combines multiple detection strategies
94
*/
95
public class CompositeDetector implements Detector {
96
/**
97
* Creates a CompositeDetector with the specified detectors
98
* @param detectors List of detectors to combine, applied in order
99
*/
100
public CompositeDetector(List<Detector> detectors);
101
102
/**
103
* Creates a CompositeDetector with the specified detectors
104
* @param detectors Array of detectors to combine, applied in order
105
*/
106
public CompositeDetector(Detector... detectors);
107
108
/**
109
* Gets the list of detectors used by this composite
110
* @return List of Detector instances in application order
111
*/
112
public List<Detector> getDetectors();
113
}
114
```
115
116
### TypeDetector
117
118
A detector that identifies content types based solely on file extensions and naming patterns, useful for quick filename-based detection.
119
120
```java { .api }
121
/**
122
* Detector based on file extensions and naming patterns
123
*/
124
public class TypeDetector implements Detector {
125
/**
126
* Creates a TypeDetector with default MIME types registry
127
*/
128
public TypeDetector();
129
130
/**
131
* Creates a TypeDetector with custom MIME types registry
132
* @param types MimeTypes registry containing type mappings
133
*/
134
public TypeDetector(MimeTypes types);
135
136
/**
137
* Detects media type based on filename extension
138
* @param input Input stream (ignored by this detector)
139
* @param metadata Metadata containing filename information
140
* @return MediaType based on file extension, or OCTET_STREAM if unknown
141
*/
142
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
143
}
144
```
145
146
### NameDetector
147
148
A more sophisticated filename-based detector that uses pattern matching and heuristics for filename analysis.
149
150
```java { .api }
151
/**
152
* Detector based on filename patterns and heuristics
153
*/
154
public class NameDetector implements Detector {
155
/**
156
* Creates a NameDetector with default configuration
157
*/
158
public NameDetector();
159
160
/**
161
* Detects media type based on filename patterns
162
* @param input Input stream (not used by this detector)
163
* @param metadata Metadata containing filename or resource name
164
* @return MediaType based on filename analysis
165
*/
166
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
167
}
168
```
169
170
### TextDetector
171
172
A detector that identifies text content and attempts to determine specific text formats and encodings.
173
174
```java { .api }
175
/**
176
* Detector for identifying text content and formats
177
*/
178
public class TextDetector implements Detector {
179
/**
180
* Creates a TextDetector with default configuration
181
*/
182
public TextDetector();
183
184
/**
185
* Detects text content types and formats
186
* @param input Input stream containing potential text data
187
* @param metadata Metadata with additional hints
188
* @return MediaType for detected text format
189
*/
190
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
191
}
192
```
193
194
### MagicDetector
195
196
A detector that uses magic number patterns and byte signatures to identify file formats, providing the most reliable binary-based detection.
197
198
```java { .api }
199
/**
200
* Detector using magic numbers and byte signatures
201
*/
202
public class MagicDetector implements Detector {
203
/**
204
* Creates a MagicDetector with default MIME types registry
205
*/
206
public MagicDetector();
207
208
/**
209
* Creates a MagicDetector with custom MIME types registry
210
* @param types MimeTypes registry containing magic patterns
211
*/
212
public MagicDetector(MimeTypes types);
213
214
/**
215
* Detects media type using magic number analysis
216
* @param input Input stream to analyze for magic patterns
217
* @param metadata Metadata (may provide additional context)
218
* @return MediaType based on magic number detection
219
*/
220
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
221
}
222
```
223
224
### EncodingDetector Interface
225
226
Interface for character encoding detection, used to identify text encoding in documents and streams.
227
228
```java { .api }
229
/**
230
* Interface for detecting character encodings
231
*/
232
public interface EncodingDetector {
233
/**
234
* Detects the character encoding of the given text stream
235
* @param input Input stream containing text data
236
* @param metadata Metadata with encoding hints
237
* @return Charset representing the detected encoding, or null if unknown
238
* @throws IOException If an I/O error occurs during detection
239
*/
240
Charset detect(InputStream input, Metadata metadata) throws IOException;
241
}
242
```
243
244
### DefaultEncodingDetector
245
246
Default implementation of character encoding detection using multiple detection strategies.
247
248
```java { .api }
249
/**
250
* Default character encoding detector
251
*/
252
public class DefaultEncodingDetector implements EncodingDetector {
253
/**
254
* Creates a DefaultEncodingDetector with standard detection algorithms
255
*/
256
public DefaultEncodingDetector();
257
258
/**
259
* Detects character encoding using multiple strategies
260
* @param input Input stream containing text data
261
* @param metadata Metadata containing encoding hints
262
* @return Charset representing detected encoding
263
*/
264
public Charset detect(InputStream input, Metadata metadata) throws IOException;
265
}
266
```
267
268
### AutoDetectReader
269
270
A Reader implementation that automatically detects character encoding and provides transparent text access with proper encoding handling.
271
272
```java { .api }
273
/**
274
* Reader with automatic encoding detection
275
*/
276
public class AutoDetectReader extends Reader {
277
/**
278
* Creates an AutoDetectReader for the given input stream
279
* @param input Input stream containing text data
280
*/
281
public AutoDetectReader(InputStream input);
282
283
/**
284
* Creates an AutoDetectReader with custom encoding detector
285
* @param input Input stream containing text data
286
* @param detector EncodingDetector to use for encoding detection
287
*/
288
public AutoDetectReader(InputStream input, EncodingDetector detector);
289
290
/**
291
* Creates an AutoDetectReader with metadata hints
292
* @param input Input stream containing text data
293
* @param metadata Metadata containing encoding hints
294
*/
295
public AutoDetectReader(InputStream input, Metadata metadata);
296
297
/**
298
* Gets the detected character encoding
299
* @return Charset representing the detected encoding
300
*/
301
public Charset getCharset();
302
}
303
```
304
305
### Neural Network Detection
306
307
Advanced detectors using machine learning models for content type identification.
308
309
```java { .api }
310
/**
311
* Interface for trained detection models
312
*/
313
public interface TrainedModel {
314
/**
315
* Predicts content type using the trained model
316
* @param input Byte array containing document data
317
* @return Probability distribution over content types
318
*/
319
float[] predict(byte[] input);
320
321
/**
322
* Gets the content types supported by this model
323
* @return Array of MediaType objects supported by the model
324
*/
325
MediaType[] getSupportedTypes();
326
}
327
328
/**
329
* Neural network-based trained model implementation
330
*/
331
public class NNTrainedModel implements TrainedModel {
332
/**
333
* Creates an NNTrainedModel from model data
334
* @param modelData Byte array containing the trained model
335
*/
336
public NNTrainedModel(byte[] modelData);
337
338
/**
339
* Loads a model from resources
340
* @param modelPath Path to model resource
341
* @return NNTrainedModel instance
342
*/
343
public static NNTrainedModel loadFromResource(String modelPath);
344
}
345
346
/**
347
* Detector using neural network models
348
*/
349
public class NNExampleModelDetector implements Detector {
350
/**
351
* Creates an NN detector with default model
352
*/
353
public NNExampleModelDetector();
354
355
/**
356
* Creates an NN detector with custom model
357
* @param model TrainedModel to use for detection
358
*/
359
public NNExampleModelDetector(TrainedModel model);
360
}
361
```
362
363
### Specialized Detectors
364
365
```java { .api }
366
/**
367
* Detector for empty files
368
*/
369
public class EmptyDetector implements Detector {
370
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
371
}
372
373
/**
374
* Detector that can override other detectors based on metadata
375
*/
376
public class OverrideDetector implements Detector {
377
public OverrideDetector(Detector originalDetector);
378
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
379
}
380
381
/**
382
* Detector for zero-byte files
383
*/
384
public class ZeroSizeFileDetector implements Detector {
385
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
386
}
387
388
/**
389
* Detector using system file command (Unix/Linux)
390
*/
391
public class FileCommandDetector implements Detector {
392
public FileCommandDetector();
393
public boolean isAvailable();
394
public MediaType detect(InputStream input, Metadata metadata) throws IOException;
395
}
396
```
397
398
### Text Analysis Utilities
399
400
```java { .api }
401
/**
402
* Statistical analysis of text content
403
*/
404
public class TextStatistics {
405
/**
406
* Analyzes text statistics from input stream
407
* @param input Input stream containing text data
408
* @return TextStatistics object with analysis results
409
*/
410
public static TextStatistics calculate(InputStream input) throws IOException;
411
412
/**
413
* Gets the percentage of printable characters
414
* @return Percentage (0.0 to 1.0) of printable characters
415
*/
416
public double getPrintableRatio();
417
418
/**
419
* Gets the average line length
420
* @return Average number of characters per line
421
*/
422
public double getAverageLineLength();
423
424
/**
425
* Determines if content appears to be text
426
* @return true if content appears to be text
427
*/
428
public boolean isText();
429
}
430
```
431
432
## Detection Strategies
433
434
### Layered Detection Approach
435
436
The DefaultDetector uses a layered approach combining multiple strategies:
437
438
1. **Magic Number Detection**: Analyzes byte patterns at file beginning
439
2. **Filename Extension**: Uses file extension for type hints
440
3. **Content Analysis**: Examines document structure and patterns
441
4. **Neural Network Models**: Uses trained models for complex detection
442
5. **Metadata Hints**: Considers existing content-type information
443
444
### Custom Detection Configuration
445
446
```java
447
// Create custom detector chain
448
List<Detector> detectors = Arrays.asList(
449
new MagicDetector(), // Prioritize magic numbers
450
new TypeDetector(), // Fall back to filename
451
new NNExampleModelDetector(), // Use ML for ambiguous cases
452
new EmptyDetector() // Handle empty files
453
);
454
455
CompositeDetector customDetector = new CompositeDetector(detectors);
456
```
457
458
## Performance Considerations
459
460
- **Stream Buffering**: Detectors typically read only the first few KB
461
- **Mark/Reset**: Input streams should support mark/reset for efficient detection
462
- **Caching**: Detection results can be cached based on content hashes
463
- **Resource Management**: Some detectors (like FileCommandDetector) use external processes