Tessl Tile for npm/@langchain/textsplitters@0.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

character-splitting.md format-splitting.md index.md recursive-splitting.md token-splitting.md

tile.json

format-splitting.mddocs/

0
# Document Format Splitting
1

2
Specialized splitters optimized for specific document formats like Markdown and LaTeX. Designed to preserve document structure and formatting semantics while providing intelligent chunking.
3

4
## Capabilities
5

6
### MarkdownTextSplitter Class
7

8
Specialized splitter for Markdown documents that preserves heading hierarchy and structural elements.
9

10
```typescript { .api }
11
/**
12
 * Text splitter optimized for Markdown documents
13
 * Preserves heading structure and code blocks
14
 */
15
class MarkdownTextSplitter extends RecursiveCharacterTextSplitter implements MarkdownTextSplitterParams {
16
  constructor(fields?: Partial<MarkdownTextSplitterParams>);
17
}
18

19
type MarkdownTextSplitterParams = TextSplitterParams;
20
```
21

22
**Usage Examples:**
23

24
```typescript
25
import { MarkdownTextSplitter } from "@langchain/textsplitters";
26

27
// Basic Markdown splitting
28
const markdownSplitter = new MarkdownTextSplitter({
29
  chunkSize: 1000,
30
  chunkOverlap: 200,
31
});
32

33
const markdownContent = `# Main Title
34

35
This is the introduction paragraph with some **bold text** and *italic text*.
36

37
## Section One
38

39
Here's content in section one with a [link](https://example.com).
40

41
### Subsection
42

43
More detailed content here.
44

45
\`\`\`javascript
46
// Code block that should be preserved
47
function example() {
48
  return "Hello World";
49
}
50
\`\`\`
51

52
## Section Two
53

54
Final section with a list:
55

56
- Item one
57
- Item two  
58
- Item three
59

60
> This is a blockquote that should be preserved.
61
`;
62

63
const chunks = await markdownSplitter.splitText(markdownContent);
64
// Preserves heading boundaries, code blocks, and list structure
65
```
66

67
### LaTeX Text Splitter Class
68

69
Specialized splitter for LaTeX documents that understands document structure and mathematical environments.
70

71
```typescript { .api }
72
/**
73
 * Text splitter optimized for LaTeX documents
74
 * Preserves document structure, sections, and math environments
75
 */
76
class LatexTextSplitter extends RecursiveCharacterTextSplitter implements LatexTextSplitterParams {
77
  constructor(fields?: Partial<LatexTextSplitterParams>);
78
}
79

80
type LatexTextSplitterParams = TextSplitterParams;
81
```
82

83
**Usage Examples:**
84

85
```typescript
86
import { LatexTextSplitter } from "@langchain/textsplitters";
87

88
// Basic LaTeX splitting
89
const latexSplitter = new LatexTextSplitter({
90
  chunkSize: 800,
91
  chunkOverlap: 100,
92
});
93

94
const latexContent = `\\documentclass{article}
95
\\usepackage{amsmath}
96

97
\\title{Research Paper Title}
98
\\author{Author Name}
99
\\date{}
100

101
\\begin{document}
102

103
\\maketitle
104

105
\\section{Introduction}
106

107
This is the introduction section with some mathematical notation: $E = mc^2$.
108

109
\\subsection{Background}
110

111
Some background information with an equation:
112

113
\\begin{equation}
114
f(x) = \\int_{-\\infty}^{\\infty} g(t) e^{-2\\pi i x t} dt
115
\\end{equation}
116

117
\\section{Methodology}
118

119
The methodology section describes our approach.
120

121
\\begin{itemize}
122
\\item First step of the process
123
\\item Second step with more details
124
\\item Final step and conclusions
125
\\end{itemize}
126

127
\\section{Results}
128

129
Results are presented in this section.
130

131
\\begin{align}
132
y &= mx + b \\\\
133
z &= ax^2 + bx + c
134
\\end{align}
135

136
\\section{Conclusion}
137

138
Final conclusions and future work.
139

140
\\end{document}`;
141

142
const latexChunks = await latexSplitter.splitText(latexContent);
143
// Preserves section boundaries, equation environments, and document structure
144
```
145

146
### Markdown-Specific Features
147

148
The Markdown splitter uses intelligent separators that prioritize document structure:
149

150
**Markdown Separator Hierarchy:**
151
```typescript
152
// Internal separator order used by MarkdownTextSplitter
153
const markdownSeparators = [
154
  "\n## ",        // H2 headings
155
  "\n### ",       // H3 headings  
156
  "\n#### ",      // H4 headings
157
  "\n##### ",     // H5 headings
158
  "\n###### ",    // H6 headings
159
  "```\n\n",      // End of code blocks
160
  "\n\n***\n\n",  // Horizontal rules (asterisk)
161
  "\n\n---\n\n",  // Horizontal rules (dash)
162
  "\n\n___\n\n",  // Horizontal rules (underscore)  
163
  "\n\n",         // Paragraph breaks
164
  "\n",           // Line breaks
165
  " ",            // Spaces
166
  ""              // Characters
167
];
168
```
169

170
**Advanced Markdown Usage:**
171

172
```typescript
173
// Custom configuration for documentation
174
const docSplitter = new MarkdownTextSplitter({
175
  chunkSize: 1500,
176
  chunkOverlap: 150,
177
  keepSeparator: true, // Keep headings with content
178
});
179

180
// Process technical documentation
181
const technicalDoc = `# API Reference
182

183
## Authentication
184

185
All API requests require authentication using Bearer tokens.
186

187
\`\`\`bash
188
curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com/users
189
\`\`\`
190

191
## Endpoints
192

193
### GET /users
194

195
Retrieves a list of users.
196

197
**Parameters:**
198
- \`limit\` (optional): Maximum number of users to return
199
- \`offset\` (optional): Number of users to skip
200

201
**Response:**
202
\`\`\`json
203
{
204
  "users": [...],
205
  "total": 100,
206
  "limit": 20,
207
  "offset": 0
208
}
209
\`\`\`
210

211
### POST /users
212

213
Creates a new user.`;
214

215
const docChunks = await docSplitter.splitText(technicalDoc);
216

217
// Create structured documents
218
const docSections = await docSplitter.createDocuments(
219
  [technicalDoc],
220
  [{ type: "api_docs", version: "1.0" }],
221
  {
222
    chunkHeader: "=== API Documentation Section ===\n",
223
    appendChunkOverlapHeader: true
224
  }
225
);
226
```
227

228
### LaTeX-Specific Features
229

230
The LaTeX splitter uses separators that understand academic document structure:
231

232
**LaTeX Separator Hierarchy:**
233
```typescript
234
// Internal separator order used by LatexTextSplitter
235
const latexSeparators = [
236
  "\n\\chapter{",        // Chapter divisions
237
  "\n\\section{",        // Section divisions
238
  "\n\\subsection{",     // Subsection divisions
239
  "\n\\subsubsection{",  // Subsubsection divisions
240
  "\n\\begin{enumerate}", // List environments
241
  "\n\\begin{itemize}",  // List environments
242
  "\n\\begin{description}", // Description lists
243
  "\n\\begin{list}",     // Generic lists
244
  "\n\\begin{quote}",    // Quote environments
245
  "\n\\begin{quotation}", // Quotation environments
246
  "\n\\begin{verse}",    // Verse environments
247
  "\n\\begin{verbatim}", // Verbatim environments
248
  "\n\\begin{align}",    // Math environments
249
  "$$",                  // Display math
250
  "$",                   // Inline math
251
  "\n\n",               // Paragraph breaks
252
  "\n",                 // Line breaks
253
  " ",                  // Spaces
254
  ""                    // Characters
255
];
256
```
257

258
**Advanced LaTeX Usage:**
259

260
```typescript
261
// Configuration for academic papers
262
const academicSplitter = new LatexTextSplitter({
263
  chunkSize: 2000,      // Longer chunks for academic content
264
  chunkOverlap: 200,    // Good overlap for context
265
  keepSeparator: true,  // Preserve LaTeX commands
266
});
267

268
// Process research paper
269
const researchPaper = `\\section{Literature Review}
270

271
Previous work in this area includes studies by \\cite{smith2020} and \\cite{jones2021}.
272

273
\\subsection{Theoretical Framework}
274

275
The theoretical framework is based on the following principles:
276

277
\\begin{enumerate}
278
\\item First principle with mathematical foundation
279
\\item Second principle involving:
280
  \\begin{equation}
281
  \\mathbf{X} = \\mathbf{A}\\mathbf{B} + \\mathbf{C}
282
  \\end{equation}
283
\\item Third principle with experimental validation
284
\\end{enumerate}
285

286
\\subsection{Experimental Design}
287

288
Our experimental approach follows established protocols.`;
289

290
const paperChunks = await academicSplitter.splitText(researchPaper);
291

292
// Process with metadata for citation tracking
293
const paperSections = await academicSplitter.createDocuments(
294
  [researchPaper], 
295
  [{ 
296
    paper_id: "smith2023_ml_approach",
297
    authors: ["Smith, J.", "Doe, A."],
298
    journal: "AI Research Quarterly"
299
  }]
300
);
301
```
302

303
### Document Processing Integration
304

305
Both format splitters work seamlessly with LangChain's document processing:
306

307
```typescript
308
import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";
309
import { Document } from "@langchain/core/documents";
310

311
// Process mixed document types
312
const markdownDocs = [
313
  new Document({
314
    pageContent: readmeContent,
315
    metadata: { type: "readme", language: "markdown" }
316
  })
317
];
318

319
const latexDocs = [
320
  new Document({
321
    pageContent: paperContent,
322
    metadata: { type: "paper", language: "latex" }
323
  })
324
];
325

326
// Split with appropriate splitters
327
const markdownSplitter = new MarkdownTextSplitter({ chunkSize: 1000 });
328
const latexSplitter = new LatexTextSplitter({ chunkSize: 1500 });
329

330
const [splitMarkdown, splitLatex] = await Promise.all([
331
  markdownSplitter.transformDocuments(markdownDocs),
332
  latexSplitter.transformDocuments(latexDocs)
333
]);
334

335
// Combine results maintaining document type information
336
const allSplitDocs = [...splitMarkdown, ...splitLatex];
337
```
338

339
### Format Detection and Processing
340

341
Automatic format detection and processing workflow:
342

343
```typescript
344
import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";
345

346
function createFormatSplitter(content: string, options = {}) {
347
  // Simple format detection
348
  if (content.includes('\\documentclass') || content.includes('\\begin{document}')) {
349
    return new LatexTextSplitter(options);
350
  } else if (content.includes('# ') || content.includes('## ') || content.includes('```')) {
351
    return new MarkdownTextSplitter(options);
352
  } else {
353
    // Fall back to recursive character splitter
354
    return new RecursiveCharacterTextSplitter(options);
355
  }
356
}
357

358
// Process documents with automatic format detection
359
async function processDocuments(documents: Array<{content: string, metadata: any}>) {
360
  const results = [];
361
  
362
  for (const doc of documents) {
363
    const splitter = createFormatSplitter(doc.content, {
364
      chunkSize: 1000,
365
      chunkOverlap: 100
366
    });
367
    
368
    const chunks = await splitter.createDocuments(
369
      [doc.content],
370
      [{ ...doc.metadata, detected_format: splitter.constructor.name }]
371
    );
372
    
373
    results.push(...chunks);
374
  }
375
  
376
  return results;
377
}
378
```

Version

Tile

Files

format-splitting.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

format-splitting.mddocs/