or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdformat-splitting.mdindex.mdrecursive-splitting.mdtoken-splitting.md
tile.json

format-splitting.mddocs/

0

# Document Format Splitting

1

2

Specialized splitters optimized for specific document formats like Markdown and LaTeX. Designed to preserve document structure and formatting semantics while providing intelligent chunking.

3

4

## Capabilities

5

6

### MarkdownTextSplitter Class

7

8

Specialized splitter for Markdown documents that preserves heading hierarchy and structural elements.

9

10

```typescript { .api }

11

/**

12

* Text splitter optimized for Markdown documents

13

* Preserves heading structure and code blocks

14

*/

15

class MarkdownTextSplitter extends RecursiveCharacterTextSplitter implements MarkdownTextSplitterParams {

16

constructor(fields?: Partial<MarkdownTextSplitterParams>);

17

}

18

19

type MarkdownTextSplitterParams = TextSplitterParams;

20

```

21

22

**Usage Examples:**

23

24

```typescript

25

import { MarkdownTextSplitter } from "@langchain/textsplitters";

26

27

// Basic Markdown splitting

28

const markdownSplitter = new MarkdownTextSplitter({

29

chunkSize: 1000,

30

chunkOverlap: 200,

31

});

32

33

const markdownContent = `# Main Title

34

35

This is the introduction paragraph with some **bold text** and *italic text*.

36

37

## Section One

38

39

Here's content in section one with a [link](https://example.com).

40

41

### Subsection

42

43

More detailed content here.

44

45

\`\`\`javascript

46

// Code block that should be preserved

47

function example() {

48

return "Hello World";

49

}

50

\`\`\`

51

52

## Section Two

53

54

Final section with a list:

55

56

- Item one

57

- Item two

58

- Item three

59

60

> This is a blockquote that should be preserved.

61

`;

62

63

const chunks = await markdownSplitter.splitText(markdownContent);

64

// Preserves heading boundaries, code blocks, and list structure

65

```

66

67

### LaTeX Text Splitter Class

68

69

Specialized splitter for LaTeX documents that understands document structure and mathematical environments.

70

71

```typescript { .api }

72

/**

73

* Text splitter optimized for LaTeX documents

74

* Preserves document structure, sections, and math environments

75

*/

76

class LatexTextSplitter extends RecursiveCharacterTextSplitter implements LatexTextSplitterParams {

77

constructor(fields?: Partial<LatexTextSplitterParams>);

78

}

79

80

type LatexTextSplitterParams = TextSplitterParams;

81

```

82

83

**Usage Examples:**

84

85

```typescript

86

import { LatexTextSplitter } from "@langchain/textsplitters";

87

88

// Basic LaTeX splitting

89

const latexSplitter = new LatexTextSplitter({

90

chunkSize: 800,

91

chunkOverlap: 100,

92

});

93

94

const latexContent = `\\documentclass{article}

95

\\usepackage{amsmath}

96

97

\\title{Research Paper Title}

98

\\author{Author Name}

99

\\date{}

100

101

\\begin{document}

102

103

\\maketitle

104

105

\\section{Introduction}

106

107

This is the introduction section with some mathematical notation: $E = mc^2$.

108

109

\\subsection{Background}

110

111

Some background information with an equation:

112

113

\\begin{equation}

114

f(x) = \\int_{-\\infty}^{\\infty} g(t) e^{-2\\pi i x t} dt

115

\\end{equation}

116

117

\\section{Methodology}

118

119

The methodology section describes our approach.

120

121

\\begin{itemize}

122

\\item First step of the process

123

\\item Second step with more details

124

\\item Final step and conclusions

125

\\end{itemize}

126

127

\\section{Results}

128

129

Results are presented in this section.

130

131

\\begin{align}

132

y &= mx + b \\\\

133

z &= ax^2 + bx + c

134

\\end{align}

135

136

\\section{Conclusion}

137

138

Final conclusions and future work.

139

140

\\end{document}`;

141

142

const latexChunks = await latexSplitter.splitText(latexContent);

143

// Preserves section boundaries, equation environments, and document structure

144

```

145

146

### Markdown-Specific Features

147

148

The Markdown splitter uses intelligent separators that prioritize document structure:

149

150

**Markdown Separator Hierarchy:**

151

```typescript

152

// Internal separator order used by MarkdownTextSplitter

153

const markdownSeparators = [

154

"\n## ", // H2 headings

155

"\n### ", // H3 headings

156

"\n#### ", // H4 headings

157

"\n##### ", // H5 headings

158

"\n###### ", // H6 headings

159

"```\n\n", // End of code blocks

160

"\n\n***\n\n", // Horizontal rules (asterisk)

161

"\n\n---\n\n", // Horizontal rules (dash)

162

"\n\n___\n\n", // Horizontal rules (underscore)

163

"\n\n", // Paragraph breaks

164

"\n", // Line breaks

165

" ", // Spaces

166

"" // Characters

167

];

168

```

169

170

**Advanced Markdown Usage:**

171

172

```typescript

173

// Custom configuration for documentation

174

const docSplitter = new MarkdownTextSplitter({

175

chunkSize: 1500,

176

chunkOverlap: 150,

177

keepSeparator: true, // Keep headings with content

178

});

179

180

// Process technical documentation

181

const technicalDoc = `# API Reference

182

183

## Authentication

184

185

All API requests require authentication using Bearer tokens.

186

187

\`\`\`bash

188

curl -H "Authorization: Bearer YOUR_TOKEN" https://api.example.com/users

189

\`\`\`

190

191

## Endpoints

192

193

### GET /users

194

195

Retrieves a list of users.

196

197

**Parameters:**

198

- \`limit\` (optional): Maximum number of users to return

199

- \`offset\` (optional): Number of users to skip

200

201

**Response:**

202

\`\`\`json

203

{

204

"users": [...],

205

"total": 100,

206

"limit": 20,

207

"offset": 0

208

}

209

\`\`\`

210

211

### POST /users

212

213

Creates a new user.`;

214

215

const docChunks = await docSplitter.splitText(technicalDoc);

216

217

// Create structured documents

218

const docSections = await docSplitter.createDocuments(

219

[technicalDoc],

220

[{ type: "api_docs", version: "1.0" }],

221

{

222

chunkHeader: "=== API Documentation Section ===\n",

223

appendChunkOverlapHeader: true

224

}

225

);

226

```

227

228

### LaTeX-Specific Features

229

230

The LaTeX splitter uses separators that understand academic document structure:

231

232

**LaTeX Separator Hierarchy:**

233

```typescript

234

// Internal separator order used by LatexTextSplitter

235

const latexSeparators = [

236

"\n\\chapter{", // Chapter divisions

237

"\n\\section{", // Section divisions

238

"\n\\subsection{", // Subsection divisions

239

"\n\\subsubsection{", // Subsubsection divisions

240

"\n\\begin{enumerate}", // List environments

241

"\n\\begin{itemize}", // List environments

242

"\n\\begin{description}", // Description lists

243

"\n\\begin{list}", // Generic lists

244

"\n\\begin{quote}", // Quote environments

245

"\n\\begin{quotation}", // Quotation environments

246

"\n\\begin{verse}", // Verse environments

247

"\n\\begin{verbatim}", // Verbatim environments

248

"\n\\begin{align}", // Math environments

249

"$$", // Display math

250

"$", // Inline math

251

"\n\n", // Paragraph breaks

252

"\n", // Line breaks

253

" ", // Spaces

254

"" // Characters

255

];

256

```

257

258

**Advanced LaTeX Usage:**

259

260

```typescript

261

// Configuration for academic papers

262

const academicSplitter = new LatexTextSplitter({

263

chunkSize: 2000, // Longer chunks for academic content

264

chunkOverlap: 200, // Good overlap for context

265

keepSeparator: true, // Preserve LaTeX commands

266

});

267

268

// Process research paper

269

const researchPaper = `\\section{Literature Review}

270

271

Previous work in this area includes studies by \\cite{smith2020} and \\cite{jones2021}.

272

273

\\subsection{Theoretical Framework}

274

275

The theoretical framework is based on the following principles:

276

277

\\begin{enumerate}

278

\\item First principle with mathematical foundation

279

\\item Second principle involving:

280

\\begin{equation}

281

\\mathbf{X} = \\mathbf{A}\\mathbf{B} + \\mathbf{C}

282

\\end{equation}

283

\\item Third principle with experimental validation

284

\\end{enumerate}

285

286

\\subsection{Experimental Design}

287

288

Our experimental approach follows established protocols.`;

289

290

const paperChunks = await academicSplitter.splitText(researchPaper);

291

292

// Process with metadata for citation tracking

293

const paperSections = await academicSplitter.createDocuments(

294

[researchPaper],

295

[{

296

paper_id: "smith2023_ml_approach",

297

authors: ["Smith, J.", "Doe, A."],

298

journal: "AI Research Quarterly"

299

}]

300

);

301

```

302

303

### Document Processing Integration

304

305

Both format splitters work seamlessly with LangChain's document processing:

306

307

```typescript

308

import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";

309

import { Document } from "@langchain/core/documents";

310

311

// Process mixed document types

312

const markdownDocs = [

313

new Document({

314

pageContent: readmeContent,

315

metadata: { type: "readme", language: "markdown" }

316

})

317

];

318

319

const latexDocs = [

320

new Document({

321

pageContent: paperContent,

322

metadata: { type: "paper", language: "latex" }

323

})

324

];

325

326

// Split with appropriate splitters

327

const markdownSplitter = new MarkdownTextSplitter({ chunkSize: 1000 });

328

const latexSplitter = new LatexTextSplitter({ chunkSize: 1500 });

329

330

const [splitMarkdown, splitLatex] = await Promise.all([

331

markdownSplitter.transformDocuments(markdownDocs),

332

latexSplitter.transformDocuments(latexDocs)

333

]);

334

335

// Combine results maintaining document type information

336

const allSplitDocs = [...splitMarkdown, ...splitLatex];

337

```

338

339

### Format Detection and Processing

340

341

Automatic format detection and processing workflow:

342

343

```typescript

344

import { MarkdownTextSplitter, LatexTextSplitter } from "@langchain/textsplitters";

345

346

function createFormatSplitter(content: string, options = {}) {

347

// Simple format detection

348

if (content.includes('\\documentclass') || content.includes('\\begin{document}')) {

349

return new LatexTextSplitter(options);

350

} else if (content.includes('# ') || content.includes('## ') || content.includes('```')) {

351

return new MarkdownTextSplitter(options);

352

} else {

353

// Fall back to recursive character splitter

354

return new RecursiveCharacterTextSplitter(options);

355

}

356

}

357

358

// Process documents with automatic format detection

359

async function processDocuments(documents: Array<{content: string, metadata: any}>) {

360

const results = [];

361

362

for (const doc of documents) {

363

const splitter = createFormatSplitter(doc.content, {

364

chunkSize: 1000,

365

chunkOverlap: 100

366

});

367

368

const chunks = await splitter.createDocuments(

369

[doc.content],

370

[{ ...doc.metadata, detected_format: splitter.constructor.name }]

371

);

372

373

results.push(...chunks);

374

}

375

376

return results;

377

}

378

```