or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agents.mdcore-schema.mddocument-stores.mdevaluation-utilities.mdfile-processing.mdgenerators.mdindex.mdpipelines.mdreaders.mdretrievers.md

file-processing.mddocs/

0

# File Processing

1

2

Document converters and preprocessors for handling PDF, DOCX, HTML, images, and other file formats with text extraction and cleaning.

3

4

## Core Imports

5

6

```python

7

from haystack.nodes import PDFToTextConverter, DocxToTextConverter, PreProcessor

8

from haystack.nodes.file_converter.base import BaseConverter

9

```

10

11

## Base Converter

12

13

```python { .api }

14

from haystack.nodes.file_converter.base import BaseConverter

15

from haystack.schema import Document

16

from pathlib import Path

17

from typing import List, Dict, Any, Optional

18

19

class BaseConverter:

20

def convert(self, file_path: Path, meta: Optional[Dict[str, Any]] = None,

21

encoding: Optional[str] = None, **kwargs) -> List[Document]:

22

"""

23

Convert file to Document objects.

24

25

Args:

26

file_path: Path to file to convert

27

meta: Additional metadata for documents

28

encoding: Text encoding for file reading

29

30

Returns:

31

List of Document objects with extracted content

32

"""

33

```

34

35

## PDF Converter

36

37

```python { .api }

38

from haystack.nodes import PDFToTextConverter

39

40

class PDFToTextConverter(BaseConverter):

41

def __init__(self, remove_numeric_tables: bool = False,

42

valid_languages: Optional[List[str]] = None):

43

"""

44

Initialize PDF to text converter.

45

46

Args:

47

remove_numeric_tables: Remove tables with mostly numeric content

48

valid_languages: List of valid languages for language detection

49

"""

50

```

51

52

## DOCX Converter

53

54

```python { .api }

55

from haystack.nodes import DocxToTextConverter

56

57

class DocxToTextConverter(BaseConverter):

58

def __init__(self, remove_numeric_tables: bool = False,

59

valid_languages: Optional[List[str]] = None):

60

"""

61

Initialize DOCX to text converter.

62

63

Args:

64

remove_numeric_tables: Remove tables with mostly numeric content

65

valid_languages: List of valid languages for language detection

66

"""

67

```

68

69

## PreProcessor

70

71

```python { .api }

72

from haystack.nodes import PreProcessor

73

from haystack.nodes.base import BaseComponent

74

75

class PreProcessor(BaseComponent):

76

def __init__(self, clean_empty_lines: bool = True,

77

clean_whitespace: bool = True,

78

clean_header_footer: bool = False,

79

split_by: str = "word",

80

split_length: int = 1000,

81

split_overlap: int = 0,

82

split_respect_sentence_boundary: bool = True,

83

language: str = "en"):

84

"""

85

Initialize document preprocessor.

86

87

Args:

88

clean_empty_lines: Remove empty lines

89

clean_whitespace: Normalize whitespace

90

clean_header_footer: Remove headers/footers

91

split_by: Splitting unit ("word", "sentence", "page")

92

split_length: Length of splits

93

split_overlap: Overlap between splits

94

split_respect_sentence_boundary: Keep sentence boundaries

95

language: Language for sentence splitting

96

"""

97

98

def process(self, documents: List[Document]) -> List[Document]:

99

"""

100

Process and clean documents.

101

102

Args:

103

documents: List of documents to process

104

105

Returns:

106

List of processed Document objects

107

"""

108

```