Tessl Tile for pypi/farm-haystack@1.26.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agents.md core-schema.md document-stores.md evaluation-utilities.md file-processing.md generators.md index.md pipelines.md readers.md retrievers.md

file-processing.mddocs/

0
# File Processing
1

2
Document converters and preprocessors for handling PDF, DOCX, HTML, images, and other file formats with text extraction and cleaning.
3

4
## Core Imports
5

6
```python
7
from haystack.nodes import PDFToTextConverter, DocxToTextConverter, PreProcessor
8
from haystack.nodes.file_converter.base import BaseConverter
9
```
10

11
## Base Converter
12

13
```python { .api }
14
from haystack.nodes.file_converter.base import BaseConverter
15
from haystack.schema import Document
16
from pathlib import Path
17
from typing import List, Dict, Any, Optional
18

19
class BaseConverter:
20
    def convert(self, file_path: Path, meta: Optional[Dict[str, Any]] = None,
21
                encoding: Optional[str] = None, **kwargs) -> List[Document]:
22
        """
23
        Convert file to Document objects.
24
        
25
        Args:
26
            file_path: Path to file to convert
27
            meta: Additional metadata for documents
28
            encoding: Text encoding for file reading
29
            
30
        Returns:
31
            List of Document objects with extracted content
32
        """
33
```
34

35
## PDF Converter
36

37
```python { .api }
38
from haystack.nodes import PDFToTextConverter
39

40
class PDFToTextConverter(BaseConverter):
41
    def __init__(self, remove_numeric_tables: bool = False,
42
                 valid_languages: Optional[List[str]] = None):
43
        """
44
        Initialize PDF to text converter.
45
        
46
        Args:
47
            remove_numeric_tables: Remove tables with mostly numeric content
48
            valid_languages: List of valid languages for language detection
49
        """
50
```
51

52
## DOCX Converter
53

54
```python { .api }
55
from haystack.nodes import DocxToTextConverter
56

57
class DocxToTextConverter(BaseConverter):
58
    def __init__(self, remove_numeric_tables: bool = False,
59
                 valid_languages: Optional[List[str]] = None):
60
        """
61
        Initialize DOCX to text converter.
62
        
63
        Args:
64
            remove_numeric_tables: Remove tables with mostly numeric content
65
            valid_languages: List of valid languages for language detection
66
        """
67
```
68

69
## PreProcessor
70

71
```python { .api }
72
from haystack.nodes import PreProcessor
73
from haystack.nodes.base import BaseComponent
74

75
class PreProcessor(BaseComponent):
76
    def __init__(self, clean_empty_lines: bool = True,
77
                 clean_whitespace: bool = True,
78
                 clean_header_footer: bool = False,
79
                 split_by: str = "word",
80
                 split_length: int = 1000,
81
                 split_overlap: int = 0,
82
                 split_respect_sentence_boundary: bool = True,
83
                 language: str = "en"):
84
        """
85
        Initialize document preprocessor.
86
        
87
        Args:
88
            clean_empty_lines: Remove empty lines
89
            clean_whitespace: Normalize whitespace
90
            clean_header_footer: Remove headers/footers
91
            split_by: Splitting unit ("word", "sentence", "page")
92
            split_length: Length of splits
93
            split_overlap: Overlap between splits
94
            split_respect_sentence_boundary: Keep sentence boundaries
95
            language: Language for sentence splitting
96
        """
97
    
98
    def process(self, documents: List[Document]) -> List[Document]:
99
        """
100
        Process and clean documents.
101
        
102
        Args:
103
            documents: List of documents to process
104
            
105
        Returns:
106
            List of processed Document objects
107
        """
108
```

Version

Tile

Files

file-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

file-processing.mddocs/