or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

configuration.mdcontent-processing.mddetection.mdembedded-extraction.mdembedding.mdexceptions.mdindex.mdio-utilities.mdlanguage.mdmetadata.mdmime-types.mdparsing.mdpipes.mdprocess-forking.mdrendering.md

detection.mddocs/

0

# Content Type Detection

1

2

Detection system for identifying document formats and MIME types using various detection strategies including magic numbers, file extensions, neural network models, and composite detection approaches.

3

4

## Capabilities

5

6

### Detector Interface

7

8

The fundamental interface for content type detection, providing the contract for identifying document formats from input streams and metadata.

9

10

```java { .api }

11

/**

12

* Interface for detecting the media type of documents

13

*/

14

public interface Detector {

15

/**

16

* Detects the media type of the given document

17

* @param input Input stream containing document data (may be null)

18

* @param metadata Metadata containing hints like filename or content type

19

* @return MediaType representing the detected content type

20

* @throws IOException If an I/O error occurs during detection

21

*/

22

MediaType detect(InputStream input, Metadata metadata) throws IOException;

23

}

24

```

25

26

### DefaultDetector

27

28

The primary detector implementation that combines multiple detection strategies in a layered approach for robust content type identification.

29

30

```java { .api }

31

/**

32

* Default composite detector combining multiple detection strategies

33

*/

34

public class DefaultDetector extends CompositeDetector {

35

/**

36

* Creates a DefaultDetector with standard detection strategies

37

*/

38

public DefaultDetector();

39

40

/**

41

* Creates a DefaultDetector with custom MIME types registry

42

* @param types MimeTypes registry for magic number detection

43

*/

44

public DefaultDetector(MimeTypes types);

45

46

/**

47

* Creates a DefaultDetector with custom class loader for service discovery

48

* @param loader ClassLoader for discovering detector services

49

*/

50

public DefaultDetector(ClassLoader loader);

51

52

/**

53

* Creates a DefaultDetector with custom types and class loader

54

* @param types MimeTypes registry for magic number detection

55

* @param loader ClassLoader for discovering detector services

56

*/

57

public DefaultDetector(MimeTypes types, ClassLoader loader);

58

}

59

```

60

61

**Usage Examples:**

62

63

```java

64

import org.apache.tika.detect.DefaultDetector;

65

import org.apache.tika.detect.Detector;

66

import org.apache.tika.metadata.Metadata;

67

import org.apache.tika.mime.MediaType;

68

import java.io.FileInputStream;

69

import java.io.InputStream;

70

71

// Basic content type detection

72

Detector detector = new DefaultDetector();

73

Metadata metadata = new Metadata();

74

metadata.set(Metadata.RESOURCE_NAME_KEY, "document.pdf");

75

76

try (InputStream stream = new FileInputStream("document.pdf")) {

77

MediaType mediaType = detector.detect(stream, metadata);

78

System.out.println("Detected type: " + mediaType.toString());

79

}

80

81

// Detection from filename only

82

Metadata filenameMetadata = new Metadata();

83

filenameMetadata.set(Metadata.RESOURCE_NAME_KEY, "spreadsheet.xlsx");

84

MediaType typeFromName = detector.detect(null, filenameMetadata);

85

```

86

87

### CompositeDetector

88

89

A detector that combines multiple detection strategies, allowing for layered detection approaches with fallback mechanisms.

90

91

```java { .api }

92

/**

93

* Detector that combines multiple detection strategies

94

*/

95

public class CompositeDetector implements Detector {

96

/**

97

* Creates a CompositeDetector with the specified detectors

98

* @param detectors List of detectors to combine, applied in order

99

*/

100

public CompositeDetector(List<Detector> detectors);

101

102

/**

103

* Creates a CompositeDetector with the specified detectors

104

* @param detectors Array of detectors to combine, applied in order

105

*/

106

public CompositeDetector(Detector... detectors);

107

108

/**

109

* Gets the list of detectors used by this composite

110

* @return List of Detector instances in application order

111

*/

112

public List<Detector> getDetectors();

113

}

114

```

115

116

### TypeDetector

117

118

A detector that identifies content types based solely on file extensions and naming patterns, useful for quick filename-based detection.

119

120

```java { .api }

121

/**

122

* Detector based on file extensions and naming patterns

123

*/

124

public class TypeDetector implements Detector {

125

/**

126

* Creates a TypeDetector with default MIME types registry

127

*/

128

public TypeDetector();

129

130

/**

131

* Creates a TypeDetector with custom MIME types registry

132

* @param types MimeTypes registry containing type mappings

133

*/

134

public TypeDetector(MimeTypes types);

135

136

/**

137

* Detects media type based on filename extension

138

* @param input Input stream (ignored by this detector)

139

* @param metadata Metadata containing filename information

140

* @return MediaType based on file extension, or OCTET_STREAM if unknown

141

*/

142

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

143

}

144

```

145

146

### NameDetector

147

148

A more sophisticated filename-based detector that uses pattern matching and heuristics for filename analysis.

149

150

```java { .api }

151

/**

152

* Detector based on filename patterns and heuristics

153

*/

154

public class NameDetector implements Detector {

155

/**

156

* Creates a NameDetector with default configuration

157

*/

158

public NameDetector();

159

160

/**

161

* Detects media type based on filename patterns

162

* @param input Input stream (not used by this detector)

163

* @param metadata Metadata containing filename or resource name

164

* @return MediaType based on filename analysis

165

*/

166

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

167

}

168

```

169

170

### TextDetector

171

172

A detector that identifies text content and attempts to determine specific text formats and encodings.

173

174

```java { .api }

175

/**

176

* Detector for identifying text content and formats

177

*/

178

public class TextDetector implements Detector {

179

/**

180

* Creates a TextDetector with default configuration

181

*/

182

public TextDetector();

183

184

/**

185

* Detects text content types and formats

186

* @param input Input stream containing potential text data

187

* @param metadata Metadata with additional hints

188

* @return MediaType for detected text format

189

*/

190

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

191

}

192

```

193

194

### MagicDetector

195

196

A detector that uses magic number patterns and byte signatures to identify file formats, providing the most reliable binary-based detection.

197

198

```java { .api }

199

/**

200

* Detector using magic numbers and byte signatures

201

*/

202

public class MagicDetector implements Detector {

203

/**

204

* Creates a MagicDetector with default MIME types registry

205

*/

206

public MagicDetector();

207

208

/**

209

* Creates a MagicDetector with custom MIME types registry

210

* @param types MimeTypes registry containing magic patterns

211

*/

212

public MagicDetector(MimeTypes types);

213

214

/**

215

* Detects media type using magic number analysis

216

* @param input Input stream to analyze for magic patterns

217

* @param metadata Metadata (may provide additional context)

218

* @return MediaType based on magic number detection

219

*/

220

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

221

}

222

```

223

224

### EncodingDetector Interface

225

226

Interface for character encoding detection, used to identify text encoding in documents and streams.

227

228

```java { .api }

229

/**

230

* Interface for detecting character encodings

231

*/

232

public interface EncodingDetector {

233

/**

234

* Detects the character encoding of the given text stream

235

* @param input Input stream containing text data

236

* @param metadata Metadata with encoding hints

237

* @return Charset representing the detected encoding, or null if unknown

238

* @throws IOException If an I/O error occurs during detection

239

*/

240

Charset detect(InputStream input, Metadata metadata) throws IOException;

241

}

242

```

243

244

### DefaultEncodingDetector

245

246

Default implementation of character encoding detection using multiple detection strategies.

247

248

```java { .api }

249

/**

250

* Default character encoding detector

251

*/

252

public class DefaultEncodingDetector implements EncodingDetector {

253

/**

254

* Creates a DefaultEncodingDetector with standard detection algorithms

255

*/

256

public DefaultEncodingDetector();

257

258

/**

259

* Detects character encoding using multiple strategies

260

* @param input Input stream containing text data

261

* @param metadata Metadata containing encoding hints

262

* @return Charset representing detected encoding

263

*/

264

public Charset detect(InputStream input, Metadata metadata) throws IOException;

265

}

266

```

267

268

### AutoDetectReader

269

270

A Reader implementation that automatically detects character encoding and provides transparent text access with proper encoding handling.

271

272

```java { .api }

273

/**

274

* Reader with automatic encoding detection

275

*/

276

public class AutoDetectReader extends Reader {

277

/**

278

* Creates an AutoDetectReader for the given input stream

279

* @param input Input stream containing text data

280

*/

281

public AutoDetectReader(InputStream input);

282

283

/**

284

* Creates an AutoDetectReader with custom encoding detector

285

* @param input Input stream containing text data

286

* @param detector EncodingDetector to use for encoding detection

287

*/

288

public AutoDetectReader(InputStream input, EncodingDetector detector);

289

290

/**

291

* Creates an AutoDetectReader with metadata hints

292

* @param input Input stream containing text data

293

* @param metadata Metadata containing encoding hints

294

*/

295

public AutoDetectReader(InputStream input, Metadata metadata);

296

297

/**

298

* Gets the detected character encoding

299

* @return Charset representing the detected encoding

300

*/

301

public Charset getCharset();

302

}

303

```

304

305

### Neural Network Detection

306

307

Advanced detectors using machine learning models for content type identification.

308

309

```java { .api }

310

/**

311

* Interface for trained detection models

312

*/

313

public interface TrainedModel {

314

/**

315

* Predicts content type using the trained model

316

* @param input Byte array containing document data

317

* @return Probability distribution over content types

318

*/

319

float[] predict(byte[] input);

320

321

/**

322

* Gets the content types supported by this model

323

* @return Array of MediaType objects supported by the model

324

*/

325

MediaType[] getSupportedTypes();

326

}

327

328

/**

329

* Neural network-based trained model implementation

330

*/

331

public class NNTrainedModel implements TrainedModel {

332

/**

333

* Creates an NNTrainedModel from model data

334

* @param modelData Byte array containing the trained model

335

*/

336

public NNTrainedModel(byte[] modelData);

337

338

/**

339

* Loads a model from resources

340

* @param modelPath Path to model resource

341

* @return NNTrainedModel instance

342

*/

343

public static NNTrainedModel loadFromResource(String modelPath);

344

}

345

346

/**

347

* Detector using neural network models

348

*/

349

public class NNExampleModelDetector implements Detector {

350

/**

351

* Creates an NN detector with default model

352

*/

353

public NNExampleModelDetector();

354

355

/**

356

* Creates an NN detector with custom model

357

* @param model TrainedModel to use for detection

358

*/

359

public NNExampleModelDetector(TrainedModel model);

360

}

361

```

362

363

### Specialized Detectors

364

365

```java { .api }

366

/**

367

* Detector for empty files

368

*/

369

public class EmptyDetector implements Detector {

370

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

371

}

372

373

/**

374

* Detector that can override other detectors based on metadata

375

*/

376

public class OverrideDetector implements Detector {

377

public OverrideDetector(Detector originalDetector);

378

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

379

}

380

381

/**

382

* Detector for zero-byte files

383

*/

384

public class ZeroSizeFileDetector implements Detector {

385

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

386

}

387

388

/**

389

* Detector using system file command (Unix/Linux)

390

*/

391

public class FileCommandDetector implements Detector {

392

public FileCommandDetector();

393

public boolean isAvailable();

394

public MediaType detect(InputStream input, Metadata metadata) throws IOException;

395

}

396

```

397

398

### Text Analysis Utilities

399

400

```java { .api }

401

/**

402

* Statistical analysis of text content

403

*/

404

public class TextStatistics {

405

/**

406

* Analyzes text statistics from input stream

407

* @param input Input stream containing text data

408

* @return TextStatistics object with analysis results

409

*/

410

public static TextStatistics calculate(InputStream input) throws IOException;

411

412

/**

413

* Gets the percentage of printable characters

414

* @return Percentage (0.0 to 1.0) of printable characters

415

*/

416

public double getPrintableRatio();

417

418

/**

419

* Gets the average line length

420

* @return Average number of characters per line

421

*/

422

public double getAverageLineLength();

423

424

/**

425

* Determines if content appears to be text

426

* @return true if content appears to be text

427

*/

428

public boolean isText();

429

}

430

```

431

432

## Detection Strategies

433

434

### Layered Detection Approach

435

436

The DefaultDetector uses a layered approach combining multiple strategies:

437

438

1. **Magic Number Detection**: Analyzes byte patterns at file beginning

439

2. **Filename Extension**: Uses file extension for type hints

440

3. **Content Analysis**: Examines document structure and patterns

441

4. **Neural Network Models**: Uses trained models for complex detection

442

5. **Metadata Hints**: Considers existing content-type information

443

444

### Custom Detection Configuration

445

446

```java

447

// Create custom detector chain

448

List<Detector> detectors = Arrays.asList(

449

new MagicDetector(), // Prioritize magic numbers

450

new TypeDetector(), // Fall back to filename

451

new NNExampleModelDetector(), // Use ML for ambiguous cases

452

new EmptyDetector() // Handle empty files

453

);

454

455

CompositeDetector customDetector = new CompositeDetector(detectors);

456

```

457

458

## Performance Considerations

459

460

- **Stream Buffering**: Detectors typically read only the first few KB

461

- **Mark/Reset**: Input streams should support mark/reset for efficient detection

462

- **Caching**: Detection results can be cached based on content hashes

463

- **Resource Management**: Some detectors (like FileCommandDetector) use external processes