0
# Document Format Splitting
1
2
Specialized splitters optimized for specific document formats like Markdown and LaTeX. Designed to preserve document structure and formatting semantics while providing intelligent chunking.
3
4
## Capabilities
5
6
### MarkdownTextSplitter Class
7
8
Specialized splitter for Markdown documents that preserves heading hierarchy and structural elements.
9
10
```typescript { .api }
11
/**
12
* Text splitter optimized for Markdown documents
13
* Preserves heading structure and code blocks
14
*/
15
class MarkdownTextSplitter extends RecursiveCharacterTextSplitter implements MarkdownTextSplitterParams {
16
constructor(fields?: Partial<MarkdownTextSplitterParams>);
17
}
18
19
type MarkdownTextSplitterParams = TextSplitterParams;
20
```
21
22
**Usage Examples:**
23
24
```typescript
25
import { MarkdownTextSplitter } from "@langchain/textsplitters";
26
27
// Basic Markdown splitting
28
const markdownSplitter = new MarkdownTextSplitter({
29
chunkSize: 1000,
30
chunkOverlap: 200,
31
});
32
33
const markdownContent = `# Main Title
34
35
This is the introduction paragraph with some **bold text** and *italic text*.
36
37
## Section One
38
39
Here's content in section one with a [link](https://example.com).
40
41
### Subsection
42
43
More detailed content here.
44
45
\`\`\`javascript
46
// Code block that should be preserved
47
function example() {
48
return "Hello World";
49
}
50
\`\`\`
51
52
## Section Two
53
54
Final section with a list:
55
56
- Item one
57
- Item two
58
- Item three
59
60
> This is a blockquote that should be preserved.
61
`;
62
63
const chunks = await markdownSplitter.splitText(markdownContent);
64
// Preserves heading boundaries, code blocks, and list structure
65
```
66
67
### LaTeX Text Splitter Class
68
69
Specialized splitter for LaTeX documents that understands document structure and mathematical environments.
70
71
```typescript { .api }
72
/**
73
* Text splitter optimized for LaTeX documents
74
* Preserves document structure, sections, and math environments
75
*/
76
class LatexTextSplitter extends RecursiveCharacterTextSplitter implements LatexTextSplitterParams {
77
constructor(fields?: Partial<LatexTextSplitterParams>);
78
}
79
80
type LatexTextSplitterParams = TextSplitterParams;
81
```
82
83
**Usage Examples:**
84
85
```typescript
86
import { LatexTextSplitter } from "@langchain/textsplitters";
87
88
// Basic LaTeX splitting
89
const latexSplitter = new LatexTextSplitter({
90
chunkSize: 800,
91
chunkOverlap: 100,
92
});
93
94
const latexContent = `\\documentclass{article}
95
\\usepackage{amsmath}
96
97
\\title{Research Paper Title}
98
\\author{Author Name}
99
\\date{}
100
101
\\begin{document}
102
103
\\maketitle
104
105
\\section{Introduction}
106
107
This is the introduction section with some mathematical notation: $E = mc^2$.
108
109
\\subsection{Background}
110
111
Some background information with an equation:
112
113
\\begin{equation}
114
f(x) = \\int_{-\\infty}^{\\infty} g(t) e^{-2\\pi i x t} dt
115
\\end{equation}
116
117
\\section{Methodology}
118
119
The methodology section describes our approach.
120
121
\\begin{itemize}
122
\\item First step of the process
123
\\item Second step with more details
124
\\item Final step and conclusions
125
\\end{itemize}
126
127
\\section{Results}
128
129
Results are presented in this section.
130
131
\\begin{align}
132
y &= mx + b \\\\
133
z &= ax^2 + bx + c
134
\\end{align}
135
136
\\section{Conclusion}
137
138
Final conclusions and future work.
139
140
\\end{document}`;
141
142
const latexChunks = await latexSplitter.splitText(latexContent);
143
// Preserves section boundaries, equation environments, and document structure
144
```
145
146
### Markdown-Specific Features
147
148
The Markdown splitter uses intelligent separators that prioritize document structure:
149
150
**Markdown Separator Hierarchy:**
151
```typescript
152
// Internal separator order used by MarkdownTextSplitter
153
const markdownSeparators = [
154
"\n## ", // H2 headings
155
"\n### ", // H3 headings
156
"\n#### ", // H4 headings
157
"\n##### ", // H5 headings
158
"\n###### ", // H6 headings
159
"```\n\n", // End of code blocks
160
"\n\n***\n\n", // Horizontal rules (asterisk)
161
"\n\n---\n\n", // Horizontal rules (dash)
162
"\n\n___\n\n", // Horizontal rules (underscore)
163
"\n\n", // Paragraph breaks
164
"\n", // Line breaks
165
" ", // Spaces
166
"" // Characters
167
];
168
```
169
170
**Advanced Markdown Usage:**
171
172
```typescript
173
// Custom configuration for documentation
174
const docSplitter = new MarkdownTextSplitter({
175
chunkSize: 1500,
176
chunkOverlap: 150,
177
keepSeparator: true, // Keep headings with content
178
});
179
180
// Process technical documentation
181
const technicalDoc = `# API Reference
182
183
## Authentication
184
185
All API requests require authentication using Bearer tokens.
186
187
\`\`\`bash
188
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com/users
189
\`\`\`
190
191
## Endpoints
192
193
### GET /users
194
195
Retrieves a list of users.
196
197
**Parameters:**
198
- \`limit\` (optional): Maximum number of users to return
199
- \`offset\` (optional): Number of users to skip
200
201
**Response:**
202
\`\`\`json
203
{
204
"users": [...],
205
"total": 100,
206
"limit": 20,
207
"offset": 0
208
}
209
\`\`\`
210
211
### POST /users
212
213
Creates a new user.`;
214
215
const docChunks = await docSplitter.splitText(technicalDoc);
216
217
// Create structured documents
218
const docSections = await docSplitter.createDocuments(
219
[technicalDoc],
220
[{ type: "api_docs", version: "1.0" }],
221
{
222
chunkHeader: "=== API Documentation Section ===\n",
223
appendChunkOverlapHeader: true
224
}
225
);
226
```
227
228
### LaTeX-Specific Features
229
230
The LaTeX splitter uses separators that understand academic document structure:
231
232
**LaTeX Separator Hierarchy:**
233
```typescript
234
// Internal separator order used by LatexTextSplitter
235
const latexSeparators = [
236
"\n\\chapter{", // Chapter divisions
237
"\n\\section{", // Section divisions
238
"\n\\subsection{", // Subsection divisions
239
"\n\\subsubsection{", // Subsubsection divisions
240
"\n\\begin{enumerate}", // List environments
241
"\n\\begin{itemize}", // List environments
242
"\n\\begin{description}", // Description lists
243
"\n\\begin{list}", // Generic lists
244
"\n\\begin{quote}", // Quote environments
245
"\n\\begin{quotation}", // Quotation environments
246
"\n\\begin{verse}", // Verse environments
247
"\n\\begin{verbatim}", // Verbatim environments
248
"\n\\begin{align}", // Math environments
249
"$$", // Display math
250
"$", // Inline math
251
"\n\n", // Paragraph breaks
252
"\n", // Line breaks
253
" ", // Spaces
254
"" // Characters
255
];
256
```
257
258
**Advanced LaTeX Usage:**
259
260
```typescript
261
// Configuration for academic papers
262
const academicSplitter = new LatexTextSplitter({
263
chunkSize: 2000, // Longer chunks for academic content
264
chunkOverlap: 200, // Good overlap for context
265
keepSeparator: true, // Preserve LaTeX commands
266
});
267
268
// Process research paper
269
const researchPaper = `\\section{Literature Review}
270
271
Previous work in this area includes studies by \\cite{smith2020} and \\cite{jones2021}.
272
273
\\subsection{Theoretical Framework}
274
275
The theoretical framework is based on the following principles:
276
277
\\begin{enumerate}
278
\\item First principle with mathematical foundation
279
\\item Second principle involving:
280
\\begin{equation}
281
\\mathbf{X} = \\mathbf{A}\\mathbf{B} + \\mathbf{C}
282
\\end{equation}
283
\\item Third principle with experimental validation
284
\\end{enumerate}
285
286
\\subsection{Experimental Design}
287
288
Our experimental approach follows established protocols.`;
289
290
const paperChunks = await academicSplitter.splitText(researchPaper);
291
292
// Process with metadata for citation tracking
293
const paperSections = await academicSplitter.createDocuments(
294
[researchPaper],
295
[{
296
paper_id: "smith2023_ml_approach",
297
authors: ["Smith, J.", "Doe, A."],
298
journal: "AI Research Quarterly"
299
}]
300
);
301
```
302
303
### Document Processing Integration
304
305
Both format splitters work seamlessly with LangChain's document processing:
306
307
```typescript
308
import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";
309
import { Document } from "@langchain/core/documents";
310
311
// Process mixed document types
312
const markdownDocs = [
313
new Document({
314
pageContent: readmeContent,
315
metadata: { type: "readme", language: "markdown" }
316
})
317
];
318
319
const latexDocs = [
320
new Document({
321
pageContent: paperContent,
322
metadata: { type: "paper", language: "latex" }
323
})
324
];
325
326
// Split with appropriate splitters
327
const markdownSplitter = new MarkdownTextSplitter({ chunkSize: 1000 });
328
const latexSplitter = new LatexTextSplitter({ chunkSize: 1500 });
329
330
const [splitMarkdown, splitLatex] = await Promise.all([
331
markdownSplitter.transformDocuments(markdownDocs),
332
latexSplitter.transformDocuments(latexDocs)
333
]);
334
335
// Combine results maintaining document type information
336
const allSplitDocs = [...splitMarkdown, ...splitLatex];
337
```
338
339
### Format Detection and Processing
340
341
Automatic format detection and processing workflow:
342
343
```typescript
344
import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";
345
346
function createFormatSplitter(content: string, options = {}) {
347
// Simple format detection
348
if (content.includes('\\documentclass') || content.includes('\\begin{document}')) {
349
return new LatexTextSplitter(options);
350
} else if (content.includes('# ') || content.includes('## ') || content.includes('```')) {
351
return new MarkdownTextSplitter(options);
352
} else {
353
// Fall back to recursive character splitter
354
return new RecursiveCharacterTextSplitter(options);
355
}
356
}
357
358
// Process documents with automatic format detection
359
async function processDocuments(documents: Array<{content: string, metadata: any}>) {
360
const results = [];
361
362
for (const doc of documents) {
363
const splitter = createFormatSplitter(doc.content, {
364
chunkSize: 1000,
365
chunkOverlap: 100
366
});
367
368
const chunks = await splitter.createDocuments(
369
[doc.content],
370
[{ ...doc.metadata, detected_format: splitter.constructor.name }]
371
);
372
373
results.push(...chunks);
374
}
375
376
return results;
377
}
378
```