0
# File Processing
1
2
Document converters and preprocessors for handling PDF, DOCX, HTML, images, and other file formats with text extraction and cleaning.
3
4
## Core Imports
5
6
```python
7
from haystack.nodes import PDFToTextConverter, DocxToTextConverter, PreProcessor
8
from haystack.nodes.file_converter.base import BaseConverter
9
```
10
11
## Base Converter
12
13
```python { .api }
14
from haystack.nodes.file_converter.base import BaseConverter
15
from haystack.schema import Document
16
from pathlib import Path
17
from typing import List, Dict, Any, Optional
18
19
class BaseConverter:
20
def convert(self, file_path: Path, meta: Optional[Dict[str, Any]] = None,
21
encoding: Optional[str] = None, **kwargs) -> List[Document]:
22
"""
23
Convert file to Document objects.
24
25
Args:
26
file_path: Path to file to convert
27
meta: Additional metadata for documents
28
encoding: Text encoding for file reading
29
30
Returns:
31
List of Document objects with extracted content
32
"""
33
```
34
35
## PDF Converter
36
37
```python { .api }
38
from haystack.nodes import PDFToTextConverter
39
40
class PDFToTextConverter(BaseConverter):
41
def __init__(self, remove_numeric_tables: bool = False,
42
valid_languages: Optional[List[str]] = None):
43
"""
44
Initialize PDF to text converter.
45
46
Args:
47
remove_numeric_tables: Remove tables with mostly numeric content
48
valid_languages: List of valid languages for language detection
49
"""
50
```
51
52
## DOCX Converter
53
54
```python { .api }
55
from haystack.nodes import DocxToTextConverter
56
57
class DocxToTextConverter(BaseConverter):
58
def __init__(self, remove_numeric_tables: bool = False,
59
valid_languages: Optional[List[str]] = None):
60
"""
61
Initialize DOCX to text converter.
62
63
Args:
64
remove_numeric_tables: Remove tables with mostly numeric content
65
valid_languages: List of valid languages for language detection
66
"""
67
```
68
69
## PreProcessor
70
71
```python { .api }
72
from haystack.nodes import PreProcessor
73
from haystack.nodes.base import BaseComponent
74
75
class PreProcessor(BaseComponent):
76
def __init__(self, clean_empty_lines: bool = True,
77
clean_whitespace: bool = True,
78
clean_header_footer: bool = False,
79
split_by: str = "word",
80
split_length: int = 1000,
81
split_overlap: int = 0,
82
split_respect_sentence_boundary: bool = True,
83
language: str = "en"):
84
"""
85
Initialize document preprocessor.
86
87
Args:
88
clean_empty_lines: Remove empty lines
89
clean_whitespace: Normalize whitespace
90
clean_header_footer: Remove headers/footers
91
split_by: Splitting unit ("word", "sentence", "page")
92
split_length: Length of splits
93
split_overlap: Overlap between splits
94
split_respect_sentence_boundary: Keep sentence boundaries
95
language: Language for sentence splitting
96
"""
97
98
def process(self, documents: List[Document]) -> List[Document]:
99
"""
100
Process and clean documents.
101
102
Args:
103
documents: List of documents to process
104
105
Returns:
106
List of processed Document objects
107
"""
108
```