Tessl Tile for pypi/langchain-text-splitters@0.3.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

character-splitting.md code-splitting.md core-base.md document-structure.md index.md nlp-splitting.md token-splitting.md

document-structure.mddocs/

0
# Document Structure-Aware Splitting
1

2
Document structure-aware splitting provides specialized text segmentation that understands and preserves the structural elements of various document formats. These splitters maintain semantic context by respecting document hierarchy, headers, and formatting while creating appropriately sized chunks.
3

4
## Capabilities
5

6
### HTML Document Splitting
7

8
Specialized splitters for HTML content that preserve document structure and semantic elements.
9

10
#### HTML Header Text Splitter
11

12
Splits HTML content based on header tags while preserving document hierarchy and metadata.
13

14
```python { .api }
15
class HTMLHeaderTextSplitter:
16
    def __init__(
17
        self,
18
        headers_to_split_on: list[tuple[str, str]],
19
        return_each_element: bool = False
20
    ) -> None: ...
21
    
22
    def split_text(self, text: str) -> list[Document]: ...
23
    
24
    def split_text_from_url(
25
        self,
26
        url: str,
27
        timeout: int = 10,
28
        **kwargs: Any
29
    ) -> list[Document]: ...
30
    
31
    def split_text_from_file(self, file: Any) -> list[Document]: ...
32
```
33

34
**Parameters:**
35
- `headers_to_split_on`: List of tuples `(header_tag, header_name)` defining split points
36
- `return_each_element`: Whether to return each element separately (default: `False`)
37

38
**Usage:**
39

40
```python
41
from langchain_text_splitters import HTMLHeaderTextSplitter
42

43
# Define headers to split on
44
headers_to_split_on = [
45
    ("h1", "Header 1"),
46
    ("h2", "Header 2"),
47
    ("h3", "Header 3"),
48
]
49

50
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
51

52
# Split HTML text
53
html_content = """
54
<h1>Chapter 1</h1>
55
<p>Content of chapter 1...</p>
56
<h2>Section 1.1</h2>
57
<p>Content of section 1.1...</p>
58
"""
59
documents = html_splitter.split_text(html_content)
60

61
# Split HTML from URL
62
url_docs = html_splitter.split_text_from_url("https://example.com", timeout=30)
63

64
# Split HTML from file
65
with open("document.html", "r") as f:
66
    file_docs = html_splitter.split_text_from_file(f)
67
```
68

69
#### HTML Section Splitter
70

71
Advanced HTML splitting based on tags and font sizes, requiring lxml for enhanced processing.
72

73
```python { .api }
74
class HTMLSectionSplitter:
75
    def __init__(
76
        self,
77
        headers_to_split_on: list[tuple[str, str]],
78
        **kwargs: Any
79
    ) -> None: ...
80
    
81
    def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
82
    
83
    def split_text(self, text: str) -> list[Document]: ...
84
    
85
    def create_documents(
86
        self,
87
        texts: list[str],
88
        metadatas: Optional[list[dict[Any, Any]]] = None
89
    ) -> list[Document]: ...
90
    
91
    def split_html_by_headers(self, html_doc: str) -> list[dict[str, Optional[str]]]: ...
92
    
93
    def convert_possible_tags_to_header(self, html_content: str) -> str: ...
94
    
95
    def split_text_from_file(self, file: Any) -> list[Document]: ...
96
```
97

98
#### HTML Semantic Preserving Splitter
99

100
Beta-stage advanced HTML splitter that preserves semantic structure with media handling capabilities.
101

102
```python { .api }
103
class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
104
    def __init__(
105
        self,
106
        headers_to_split_on: list[tuple[str, str]],
107
        *,
108
        max_chunk_size: int = 1000,
109
        chunk_overlap: int = 0,
110
        separators: Optional[list[str]] = None,
111
        elements_to_preserve: Optional[list[str]] = None,
112
        preserve_links: bool = False,
113
        preserve_images: bool = False,
114
        preserve_videos: bool = False,
115
        preserve_audio: bool = False,
116
        custom_handlers: Optional[dict[str, Callable[[Any], str]]] = None,
117
        stopword_removal: bool = False,
118
        stopword_lang: str = "english",
119
        normalize_text: bool = False,
120
        external_metadata: Optional[dict[str, str]] = None,
121
        allowlist_tags: Optional[list[str]] = None,
122
        denylist_tags: Optional[list[str]] = None,
123
        preserve_parent_metadata: bool = False,
124
        keep_separator: Union[bool, Literal["start", "end"]] = True
125
    ) -> None: ...
126
    
127
    def split_text(self, text: str) -> list[Document]: ...
128
    
129
    def transform_documents(
130
        self,
131
        documents: Sequence[Document],
132
        **kwargs: Any
133
    ) -> list[Document]: ...
134
```
135

136
**Parameters:**
137
- `max_chunk_size`: Maximum size of each chunk (default: `1000`)
138
- `chunk_overlap`: Number of characters to overlap between chunks (default: `0`)
139
- `separators`: Delimiters used by RecursiveCharacterTextSplitter for further splitting
140
- `elements_to_preserve`: HTML tags to remain intact during splitting
141
- `preserve_links`: Whether to convert `<a>` tags to Markdown links (default: `False`)
142
- `preserve_images`: Whether to convert `<img>` tags to Markdown images (default: `False`)
143
- `preserve_videos`: Whether to convert `<video>` tags to Markdown video links (default: `False`)
144
- `preserve_audio`: Whether to convert `<audio>` tags to Markdown audio links (default: `False`)
145
- `custom_handlers`: Custom element handlers for specific tags
146
- `stopword_removal`: Whether to remove stopwords from text (default: `False`)
147
- `stopword_lang`: Language for stopword removal (default: `"english"`)
148
- `normalize_text`: Whether to normalize text during processing (default: `False`)
149
- `external_metadata`: Additional metadata to include in all documents
150
- `allowlist_tags`: HTML tags to specifically include in processing
151
- `denylist_tags`: HTML tags to exclude from processing
152
- `preserve_parent_metadata`: Whether to preserve metadata from parent elements (default: `False`)
153
- `keep_separator`: Whether to keep separators and where to place them (default: `True`)
154

155
### Markdown Document Splitting
156

157
Specialized splitters for Markdown content that understand heading hierarchy and structure.
158

159
#### Markdown Text Splitter
160

161
Basic Markdown splitting that extends recursive character splitting with Markdown-specific separators.
162

163
```python { .api }
164
class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
165
    def __init__(self, **kwargs: Any) -> None: ...
166
```
167

168
#### Markdown Header Text Splitter
169

170
Splits Markdown content based on header levels while preserving document structure.
171

172
```python { .api }
173
class MarkdownHeaderTextSplitter:
174
    def __init__(
175
        self,
176
        headers_to_split_on: list[tuple[str, str]],
177
        return_each_line: bool = False,
178
        strip_headers: bool = True,
179
        custom_header_patterns: Optional[dict[int, str]] = None
180
    ) -> None: ...
181
    
182
    def split_text(self, text: str) -> list[Document]: ...
183
    
184
    def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]: ...
185
```
186

187
**Parameters:**
188
- `headers_to_split_on`: List of tuples `(header_level, header_name)`
189
- `return_each_line`: Whether to return each line as separate document
190
- `strip_headers`: Whether to remove header text from content
191
- `custom_header_patterns`: Custom regex patterns for header detection
192

193
**Usage:**
194

195
```python
196
from langchain_text_splitters import MarkdownHeaderTextSplitter
197

198
# Define headers to split on
199
headers_to_split_on = [
200
    ("#", "Header 1"),
201
    ("##", "Header 2"),
202
    ("###", "Header 3"),
203
]
204

205
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
206

207
markdown_content = """
208
# Chapter 1
209
Content of chapter 1...
210

211
## Section 1.1
212
Content of section 1.1...
213

214
### Subsection 1.1.1
215
Content of subsection...
216
"""
217

218
documents = markdown_splitter.split_text(markdown_content)
219
```
220

221
#### Experimental Markdown Syntax Text Splitter
222

223
Advanced experimental Markdown splitter with exact whitespace retention and structured metadata extraction.
224

225
```python { .api }
226
class ExperimentalMarkdownSyntaxTextSplitter:
227
    def __init__(
228
        self,
229
        headers_to_split_on: Optional[list[tuple[str, str]]] = None,
230
        return_each_line: bool = False,
231
        strip_headers: bool = True
232
    ) -> None: ...
233
    
234
    def split_text(self, text: str) -> list[Document]: ...
235
```
236

237
### JSON Data Splitting
238

239
Specialized splitter for JSON data that preserves structure while creating manageable chunks.
240

241
```python { .api }
242
class RecursiveJsonSplitter:
243
    def __init__(
244
        self,
245
        max_chunk_size: int = 2000,
246
        min_chunk_size: Optional[int] = None
247
    ) -> None: ...
248
    
249
    def split_json(
250
        self,
251
        json_data: dict,
252
        convert_lists: bool = False
253
    ) -> list[dict]: ...
254
    
255
    def split_text(
256
        self,
257
        json_data: dict,
258
        convert_lists: bool = False,
259
        ensure_ascii: bool = True
260
    ) -> list[str]: ...
261
    
262
    def create_documents(
263
        self,
264
        texts: list[dict],
265
        convert_lists: bool = False,
266
        ensure_ascii: bool = True,
267
        metadatas: Optional[list[dict[Any, Any]]] = None
268
    ) -> list[Document]: ...
269
```
270

271
**Parameters:**
272
- `max_chunk_size`: Maximum size of JSON chunks
273
- `min_chunk_size`: Minimum size for chunk splitting
274

275
**Methods:**
276
- `split_json()`: Split JSON into dictionary chunks
277
- `split_text()`: Split JSON into string chunks
278
- `create_documents()`: Create Document objects from JSON
279

280
**Usage:**
281

282
```python
283
from langchain_text_splitters import RecursiveJsonSplitter
284
import json
285

286
json_splitter = RecursiveJsonSplitter(max_chunk_size=1000)
287

288
# Large JSON data
289
large_json = {
290
    "users": [
291
        {"id": 1, "name": "Alice", "data": {...}},
292
        {"id": 2, "name": "Bob", "data": {...}},
293
        # ... many more users
294
    ],
295
    "metadata": {"version": "1.0", "created": "2023-01-01"}
296
}
297

298
# Split into dictionary chunks
299
dict_chunks = json_splitter.split_json(large_json)
300

301
# Split into string chunks
302
string_chunks = json_splitter.split_text(large_json, ensure_ascii=False)
303

304
# Create Document objects
305
documents = json_splitter.create_documents([large_json])
306
```
307

308
## Type Definitions
309

310
Document structure splitters use several type definitions for metadata and configuration:
311

312
```python { .api }
313
class ElementType(TypedDict):
314
    url: str
315
    xpath: str
316
    content: str
317
    metadata: dict[str, str]
318

319
class HeaderType(TypedDict):
320
    level: int
321
    name: str
322
    data: str
323

324
class LineType(TypedDict):
325
    metadata: dict[str, str]
326
    content: str
327
```
328

329
## Best Practices
330

331
1. **Choose appropriate headers**: Select header levels that represent logical document divisions
332
2. **Preserve metadata**: Document structure splitters maintain hierarchical metadata for context
333
3. **Handle nested structures**: JSON splitter respects nested object and array boundaries
334
4. **Configure chunk sizes**: Balance between context preservation and manageable chunk sizes
335
5. **Test with your documents**: Different document structures may require different splitting strategies
336
6. **Use semantic preservation**: For HTML, consider using the semantic preserving splitter for better structure retention

Version

Tile

Files

document-structure.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

document-structure.mddocs/