or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdcode-splitting.mdcore-base.mddocument-structure.mdindex.mdnlp-splitting.mdtoken-splitting.md

document-structure.mddocs/

0

# Document Structure-Aware Splitting

1

2

Document structure-aware splitting provides specialized text segmentation that understands and preserves the structural elements of various document formats. These splitters maintain semantic context by respecting document hierarchy, headers, and formatting while creating appropriately sized chunks.

3

4

## Capabilities

5

6

### HTML Document Splitting

7

8

Specialized splitters for HTML content that preserve document structure and semantic elements.

9

10

#### HTML Header Text Splitter

11

12

Splits HTML content based on header tags while preserving document hierarchy and metadata.

13

14

```python { .api }

15

class HTMLHeaderTextSplitter:

16

def __init__(

17

self,

18

headers_to_split_on: list[tuple[str, str]],

19

return_each_element: bool = False

20

) -> None: ...

21

22

def split_text(self, text: str) -> list[Document]: ...

23

24

def split_text_from_url(

25

self,

26

url: str,

27

timeout: int = 10,

28

**kwargs: Any

29

) -> list[Document]: ...

30

31

def split_text_from_file(self, file: Any) -> list[Document]: ...

32

```

33

34

**Parameters:**

35

- `headers_to_split_on`: List of tuples `(header_tag, header_name)` defining split points

36

- `return_each_element`: Whether to return each element separately (default: `False`)

37

38

**Usage:**

39

40

```python

41

from langchain_text_splitters import HTMLHeaderTextSplitter

42

43

# Define headers to split on

44

headers_to_split_on = [

45

("h1", "Header 1"),

46

("h2", "Header 2"),

47

("h3", "Header 3"),

48

]

49

50

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

51

52

# Split HTML text

53

html_content = """

54

<h1>Chapter 1</h1>

55

<p>Content of chapter 1...</p>

56

<h2>Section 1.1</h2>

57

<p>Content of section 1.1...</p>

58

"""

59

documents = html_splitter.split_text(html_content)

60

61

# Split HTML from URL

62

url_docs = html_splitter.split_text_from_url("https://example.com", timeout=30)

63

64

# Split HTML from file

65

with open("document.html", "r") as f:

66

file_docs = html_splitter.split_text_from_file(f)

67

```

68

69

#### HTML Section Splitter

70

71

Advanced HTML splitting based on tags and font sizes, requiring lxml for enhanced processing.

72

73

```python { .api }

74

class HTMLSectionSplitter:

75

def __init__(

76

self,

77

headers_to_split_on: list[tuple[str, str]],

78

**kwargs: Any

79

) -> None: ...

80

81

def split_documents(self, documents: Iterable[Document]) -> list[Document]: ...

82

83

def split_text(self, text: str) -> list[Document]: ...

84

85

def create_documents(

86

self,

87

texts: list[str],

88

metadatas: Optional[list[dict[Any, Any]]] = None

89

) -> list[Document]: ...

90

91

def split_html_by_headers(self, html_doc: str) -> list[dict[str, Optional[str]]]: ...

92

93

def convert_possible_tags_to_header(self, html_content: str) -> str: ...

94

95

def split_text_from_file(self, file: Any) -> list[Document]: ...

96

```

97

98

#### HTML Semantic Preserving Splitter

99

100

Beta-stage advanced HTML splitter that preserves semantic structure with media handling capabilities.

101

102

```python { .api }

103

class HTMLSemanticPreservingSplitter(BaseDocumentTransformer):

104

def __init__(

105

self,

106

headers_to_split_on: list[tuple[str, str]],

107

*,

108

max_chunk_size: int = 1000,

109

chunk_overlap: int = 0,

110

separators: Optional[list[str]] = None,

111

elements_to_preserve: Optional[list[str]] = None,

112

preserve_links: bool = False,

113

preserve_images: bool = False,

114

preserve_videos: bool = False,

115

preserve_audio: bool = False,

116

custom_handlers: Optional[dict[str, Callable[[Any], str]]] = None,

117

stopword_removal: bool = False,

118

stopword_lang: str = "english",

119

normalize_text: bool = False,

120

external_metadata: Optional[dict[str, str]] = None,

121

allowlist_tags: Optional[list[str]] = None,

122

denylist_tags: Optional[list[str]] = None,

123

preserve_parent_metadata: bool = False,

124

keep_separator: Union[bool, Literal["start", "end"]] = True

125

) -> None: ...

126

127

def split_text(self, text: str) -> list[Document]: ...

128

129

def transform_documents(

130

self,

131

documents: Sequence[Document],

132

**kwargs: Any

133

) -> list[Document]: ...

134

```

135

136

**Parameters:**

137

- `max_chunk_size`: Maximum size of each chunk (default: `1000`)

138

- `chunk_overlap`: Number of characters to overlap between chunks (default: `0`)

139

- `separators`: Delimiters used by RecursiveCharacterTextSplitter for further splitting

140

- `elements_to_preserve`: HTML tags to remain intact during splitting

141

- `preserve_links`: Whether to convert `<a>` tags to Markdown links (default: `False`)

142

- `preserve_images`: Whether to convert `<img>` tags to Markdown images (default: `False`)

143

- `preserve_videos`: Whether to convert `<video>` tags to Markdown video links (default: `False`)

144

- `preserve_audio`: Whether to convert `<audio>` tags to Markdown audio links (default: `False`)

145

- `custom_handlers`: Custom element handlers for specific tags

146

- `stopword_removal`: Whether to remove stopwords from text (default: `False`)

147

- `stopword_lang`: Language for stopword removal (default: `"english"`)

148

- `normalize_text`: Whether to normalize text during processing (default: `False`)

149

- `external_metadata`: Additional metadata to include in all documents

150

- `allowlist_tags`: HTML tags to specifically include in processing

151

- `denylist_tags`: HTML tags to exclude from processing

152

- `preserve_parent_metadata`: Whether to preserve metadata from parent elements (default: `False`)

153

- `keep_separator`: Whether to keep separators and where to place them (default: `True`)

154

155

### Markdown Document Splitting

156

157

Specialized splitters for Markdown content that understand heading hierarchy and structure.

158

159

#### Markdown Text Splitter

160

161

Basic Markdown splitting that extends recursive character splitting with Markdown-specific separators.

162

163

```python { .api }

164

class MarkdownTextSplitter(RecursiveCharacterTextSplitter):

165

def __init__(self, **kwargs: Any) -> None: ...

166

```

167

168

#### Markdown Header Text Splitter

169

170

Splits Markdown content based on header levels while preserving document structure.

171

172

```python { .api }

173

class MarkdownHeaderTextSplitter:

174

def __init__(

175

self,

176

headers_to_split_on: list[tuple[str, str]],

177

return_each_line: bool = False,

178

strip_headers: bool = True,

179

custom_header_patterns: Optional[dict[int, str]] = None

180

) -> None: ...

181

182

def split_text(self, text: str) -> list[Document]: ...

183

184

def aggregate_lines_to_chunks(self, lines: list[LineType]) -> list[Document]: ...

185

```

186

187

**Parameters:**

188

- `headers_to_split_on`: List of tuples `(header_level, header_name)`

189

- `return_each_line`: Whether to return each line as separate document

190

- `strip_headers`: Whether to remove header text from content

191

- `custom_header_patterns`: Custom regex patterns for header detection

192

193

**Usage:**

194

195

```python

196

from langchain_text_splitters import MarkdownHeaderTextSplitter

197

198

# Define headers to split on

199

headers_to_split_on = [

200

("#", "Header 1"),

201

("##", "Header 2"),

202

("###", "Header 3"),

203

]

204

205

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)

206

207

markdown_content = """

208

# Chapter 1

209

Content of chapter 1...

210

211

## Section 1.1

212

Content of section 1.1...

213

214

### Subsection 1.1.1

215

Content of subsection...

216

"""

217

218

documents = markdown_splitter.split_text(markdown_content)

219

```

220

221

#### Experimental Markdown Syntax Text Splitter

222

223

Advanced experimental Markdown splitter with exact whitespace retention and structured metadata extraction.

224

225

```python { .api }

226

class ExperimentalMarkdownSyntaxTextSplitter:

227

def __init__(

228

self,

229

headers_to_split_on: Optional[list[tuple[str, str]]] = None,

230

return_each_line: bool = False,

231

strip_headers: bool = True

232

) -> None: ...

233

234

def split_text(self, text: str) -> list[Document]: ...

235

```

236

237

### JSON Data Splitting

238

239

Specialized splitter for JSON data that preserves structure while creating manageable chunks.

240

241

```python { .api }

242

class RecursiveJsonSplitter:

243

def __init__(

244

self,

245

max_chunk_size: int = 2000,

246

min_chunk_size: Optional[int] = None

247

) -> None: ...

248

249

def split_json(

250

self,

251

json_data: dict,

252

convert_lists: bool = False

253

) -> list[dict]: ...

254

255

def split_text(

256

self,

257

json_data: dict,

258

convert_lists: bool = False,

259

ensure_ascii: bool = True

260

) -> list[str]: ...

261

262

def create_documents(

263

self,

264

texts: list[dict],

265

convert_lists: bool = False,

266

ensure_ascii: bool = True,

267

metadatas: Optional[list[dict[Any, Any]]] = None

268

) -> list[Document]: ...

269

```

270

271

**Parameters:**

272

- `max_chunk_size`: Maximum size of JSON chunks

273

- `min_chunk_size`: Minimum size for chunk splitting

274

275

**Methods:**

276

- `split_json()`: Split JSON into dictionary chunks

277

- `split_text()`: Split JSON into string chunks

278

- `create_documents()`: Create Document objects from JSON

279

280

**Usage:**

281

282

```python

283

from langchain_text_splitters import RecursiveJsonSplitter

284

import json

285

286

json_splitter = RecursiveJsonSplitter(max_chunk_size=1000)

287

288

# Large JSON data

289

large_json = {

290

"users": [

291

{"id": 1, "name": "Alice", "data": {...}},

292

{"id": 2, "name": "Bob", "data": {...}},

293

# ... many more users

294

],

295

"metadata": {"version": "1.0", "created": "2023-01-01"}

296

}

297

298

# Split into dictionary chunks

299

dict_chunks = json_splitter.split_json(large_json)

300

301

# Split into string chunks

302

string_chunks = json_splitter.split_text(large_json, ensure_ascii=False)

303

304

# Create Document objects

305

documents = json_splitter.create_documents([large_json])

306

```

307

308

## Type Definitions

309

310

Document structure splitters use several type definitions for metadata and configuration:

311

312

```python { .api }

313

class ElementType(TypedDict):

314

url: str

315

xpath: str

316

content: str

317

metadata: dict[str, str]

318

319

class HeaderType(TypedDict):

320

level: int

321

name: str

322

data: str

323

324

class LineType(TypedDict):

325

metadata: dict[str, str]

326

content: str

327

```

328

329

## Best Practices

330

331

1. **Choose appropriate headers**: Select header levels that represent logical document divisions

332

2. **Preserve metadata**: Document structure splitters maintain hierarchical metadata for context

333

3. **Handle nested structures**: JSON splitter respects nested object and array boundaries

334

4. **Configure chunk sizes**: Balance between context preservation and manageable chunk sizes

335

5. **Test with your documents**: Different document structures may require different splitting strategies

336

6. **Use semantic preservation**: For HTML, consider using the semantic preserving splitter for better structure retention