0
# Document Structure-Aware Splitting
1
2
Document structure-aware splitting provides specialized text segmentation that understands and preserves the structural elements of various document formats. These splitters maintain semantic context by respecting document hierarchy, headers, and formatting while creating appropriately sized chunks.
3
4
## Capabilities
5
6
### HTML Document Splitting
7
8
Specialized splitters for HTML content that preserve document structure and semantic elements.
9
10
#### HTML Header Text Splitter
11
12
Splits HTML content based on header tags while preserving document hierarchy and metadata.
13
14
```python { .api }
15
class HTMLHeaderTextSplitter:
16
def __init__(
17
self,
18
headers_to_split_on: list[tuple[str, str]],
19
return_each_element: bool = False
20
) -> None: ...
21
22
def split_text(self, text: str) -> list[Document]: ...
23
24
def split_text_from_url(
25
self,
26
url: str,
27
timeout: int = 10,
28
**kwargs: Any
29
) -> list[Document]: ...
30
31
def split_text_from_file(self, file: Any) -> list[Document]: ...
32
```
33
34
**Parameters:**
35
- `headers_to_split_on`: List of tuples `(header_tag, header_name)` defining split points
36
- `return_each_element`: Whether to return each element separately (default: `False`)
37
38
**Usage:**
39
40
```python
41
from langchain_text_splitters import HTMLHeaderTextSplitter
42
43
# Define headers to split on
44
headers_to_split_on = [
45
("h1", "Header 1"),
46
("h2", "Header 2"),
47
("h3", "Header 3"),
48
]
49
50
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
51
52
# Split HTML text
53
html_content = """
54
<h1>Chapter 1</h1>
55
<p>Content of chapter 1...</p>
56
<h2>Section 1.1</h2>
57
<p>Content of section 1.1...</p>
58
"""
59
documents = html_splitter.split_text(html_content)
60
61
# Split HTML from URL
62
url_docs = html_splitter.split_text_from_url("https://example.com", timeout=30)
63
64
# Split HTML from file
65
with open("document.html", "r") as f:
66
file_docs = html_splitter.split_text_from_file(f)
67
```
68
69
#### HTML Section Splitter
70
71
Advanced HTML splitting based on tags and font sizes, requiring lxml for enhanced processing.
72
73
```python { .api }
74
class HTMLSectionSplitter:
75
def __init__(
76
self,
77
headers_to_split_on: list[tuple[str, str]],
78
**kwargs: Any
79
) -> None: ...
80
81
def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...
82
83
def split_text(self, text: str) -> list[Document]: ...
84
85
def create_documents(
86
self,
87
texts: list[str],
88
metadatas: Optional[list[dict[Any, Any]]] = None
89
) -> list[Document]: ...
90
91
def split_html_by_headers(self, html_doc: str) -> list[dict[str, Optional[str]]]: ...
92
93
def convert_possible_tags_to_header(self, html_content: str) -> str: ...
94
95
def split_text_from_file(self, file: Any) -> list[Document]: ...
96
```
97
98
#### HTML Semantic Preserving Splitter
99
100
Beta-stage advanced HTML splitter that preserves semantic structure with media handling capabilities.
101
102
```python { .api }
103
class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):
104
def __init__(
105
self,
106
headers_to_split_on: list[tuple[str, str]],
107
*,
108
max_chunk_size: int = 1000,
109
chunk_overlap: int = 0,
110
separators: Optional[list[str]] = None,
111
elements_to_preserve: Optional[list[str]] = None,
112
preserve_links: bool = False,
113
preserve_images: bool = False,
114
preserve_videos: bool = False,
115
preserve_audio: bool = False,
116
custom_handlers: Optional[dict[str, Callable[[Any], str]]] = None,
117
stopword_removal: bool = False,
118
stopword_lang: str = "english",
119
normalize_text: bool = False,
120
external_metadata: Optional[dict[str, str]] = None,
121
allowlist_tags: Optional[list[str]] = None,
122
denylist_tags: Optional[list[str]] = None,
123
preserve_parent_metadata: bool = False,
124
keep_separator: Union[bool, Literal["start", "end"]] = True
125
) -> None: ...
126
127
def split_text(self, text: str) -> list[Document]: ...
128
129
def transform_documents(
130
self,
131
documents: Sequence[Document],
132
**kwargs: Any
133
) -> list[Document]: ...
134
```
135
136
**Parameters:**
137
- `max_chunk_size`: Maximum size of each chunk (default: `1000`)
138
- `chunk_overlap`: Number of characters to overlap between chunks (default: `0`)
139
- `separators`: Delimiters used by RecursiveCharacterTextSplitter for further splitting
140
- `elements_to_preserve`: HTML tags to remain intact during splitting
141
- `preserve_links`: Whether to convert `<a>` tags to Markdown links (default: `False`)
142
- `preserve_images`: Whether to convert `<img>` tags to Markdown images (default: `False`)
143
- `preserve_videos`: Whether to convert `<video>` tags to Markdown video links (default: `False`)
144
- `preserve_audio`: Whether to convert `<audio>` tags to Markdown audio links (default: `False`)
145
- `custom_handlers`: Custom element handlers for specific tags
146
- `stopword_removal`: Whether to remove stopwords from text (default: `False`)
147
- `stopword_lang`: Language for stopword removal (default: `"english"`)
148
- `normalize_text`: Whether to normalize text during processing (default: `False`)
149
- `external_metadata`: Additional metadata to include in all documents
150
- `allowlist_tags`: HTML tags to specifically include in processing
151
- `denylist_tags`: HTML tags to exclude from processing
152
- `preserve_parent_metadata`: Whether to preserve metadata from parent elements (default: `False`)
153
- `keep_separator`: Whether to keep separators and where to place them (default: `True`)
154
155
### Markdown Document Splitting
156
157
Specialized splitters for Markdown content that understand heading hierarchy and structure.
158
159
#### Markdown Text Splitter
160
161
Basic Markdown splitting that extends recursive character splitting with Markdown-specific separators.
162
163
```python { .api }
164
class MarkdownTextSplitter(RecursiveCharacterTextSplitter):
165
def __init__(self, **kwargs: Any) -> None: ...
166
```
167
168
#### Markdown Header Text Splitter
169
170
Splits Markdown content based on header levels while preserving document structure.
171
172
```python { .api }
173
class MarkdownHeaderTextSplitter:
174
def __init__(
175
self,
176
headers_to_split_on: list[tuple[str, str]],
177
return_each_line: bool = False,
178
strip_headers: bool = True,
179
custom_header_patterns: Optional[dict[int, str]] = None
180
) -> None: ...
181
182
def split_text(self, text: str) -> list[Document]: ...
183
184
def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]: ...
185
```
186
187
**Parameters:**
188
- `headers_to_split_on`: List of tuples `(header_level, header_name)`
189
- `return_each_line`: Whether to return each line as separate document
190
- `strip_headers`: Whether to remove header text from content
191
- `custom_header_patterns`: Custom regex patterns for header detection
192
193
**Usage:**
194
195
```python
196
from langchain_text_splitters import MarkdownHeaderTextSplitter
197
198
# Define headers to split on
199
headers_to_split_on = [
200
("#", "Header 1"),
201
("##", "Header 2"),
202
("###", "Header 3"),
203
]
204
205
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
206
207
markdown_content = """
208
# Chapter 1
209
Content of chapter 1...
210
211
## Section 1.1
212
Content of section 1.1...
213
214
### Subsection 1.1.1
215
Content of subsection...
216
"""
217
218
documents = markdown_splitter.split_text(markdown_content)
219
```
220
221
#### Experimental Markdown Syntax Text Splitter
222
223
Advanced experimental Markdown splitter with exact whitespace retention and structured metadata extraction.
224
225
```python { .api }
226
class ExperimentalMarkdownSyntaxTextSplitter:
227
def __init__(
228
self,
229
headers_to_split_on: Optional[list[tuple[str, str]]] = None,
230
return_each_line: bool = False,
231
strip_headers: bool = True
232
) -> None: ...
233
234
def split_text(self, text: str) -> list[Document]: ...
235
```
236
237
### JSON Data Splitting
238
239
Specialized splitter for JSON data that preserves structure while creating manageable chunks.
240
241
```python { .api }
242
class RecursiveJsonSplitter:
243
def __init__(
244
self,
245
max_chunk_size: int = 2000,
246
min_chunk_size: Optional[int] = None
247
) -> None: ...
248
249
def split_json(
250
self,
251
json_data: dict,
252
convert_lists: bool = False
253
) -> list[dict]: ...
254
255
def split_text(
256
self,
257
json_data: dict,
258
convert_lists: bool = False,
259
ensure_ascii: bool = True
260
) -> list[str]: ...
261
262
def create_documents(
263
self,
264
texts: list[dict],
265
convert_lists: bool = False,
266
ensure_ascii: bool = True,
267
metadatas: Optional[list[dict[Any, Any]]] = None
268
) -> list[Document]: ...
269
```
270
271
**Parameters:**
272
- `max_chunk_size`: Maximum size of JSON chunks
273
- `min_chunk_size`: Minimum size for chunk splitting
274
275
**Methods:**
276
- `split_json()`: Split JSON into dictionary chunks
277
- `split_text()`: Split JSON into string chunks
278
- `create_documents()`: Create Document objects from JSON
279
280
**Usage:**
281
282
```python
283
from langchain_text_splitters import RecursiveJsonSplitter
284
import json
285
286
json_splitter = RecursiveJsonSplitter(max_chunk_size=1000)
287
288
# Large JSON data
289
large_json = {
290
"users": [
291
{"id": 1, "name": "Alice", "data": {...}},
292
{"id": 2, "name": "Bob", "data": {...}},
293
# ... many more users
294
],
295
"metadata": {"version": "1.0", "created": "2023-01-01"}
296
}
297
298
# Split into dictionary chunks
299
dict_chunks = json_splitter.split_json(large_json)
300
301
# Split into string chunks
302
string_chunks = json_splitter.split_text(large_json, ensure_ascii=False)
303
304
# Create Document objects
305
documents = json_splitter.create_documents([large_json])
306
```
307
308
## Type Definitions
309
310
Document structure splitters use several type definitions for metadata and configuration:
311
312
```python { .api }
313
class ElementType(TypedDict):
314
url: str
315
xpath: str
316
content: str
317
metadata: dict[str, str]
318
319
class HeaderType(TypedDict):
320
level: int
321
name: str
322
data: str
323
324
class LineType(TypedDict):
325
metadata: dict[str, str]
326
content: str
327
```
328
329
## Best Practices
330
331
1. **Choose appropriate headers**: Select header levels that represent logical document divisions
332
2. **Preserve metadata**: Document structure splitters maintain hierarchical metadata for context
333
3. **Handle nested structures**: JSON splitter respects nested object and array boundaries
334
4. **Configure chunk sizes**: Balance between context preservation and manageable chunk sizes
335
5. **Test with your documents**: Different document structures may require different splitting strategies
336
6. **Use semantic preservation**: For HTML, consider using the semantic preserving splitter for better structure retention