0
# Document Processing and Node Parsing
1
2
Text processing and document chunking functionality for preparing data for indexing and retrieval in LlamaIndex.TS.
3
4
## Import
5
6
```typescript
7
import { Document, SentenceSplitter, TextNode } from "llamaindex";
8
// Or from specific submodules
9
import { SentenceSplitter, MarkdownNodeParser } from "llamaindex/node-parser";
10
```
11
12
## Overview
13
14
Document processing in LlamaIndex.TS involves transforming raw text into structured nodes that can be indexed and retrieved. The system provides various node parsers for different text types and chunking strategies.
15
16
## Document Class
17
18
The Document class represents a source document with text content and metadata.
19
20
```typescript { .api }
21
class Document {
22
constructor(init: {
23
text: string;
24
id_?: string;
25
metadata?: Record<string, any>;
26
mimetype?: string;
27
relationships?: Record<string, any>;
28
});
29
30
text: string;
31
id_: string;
32
metadata: Record<string, any>;
33
mimetype?: string;
34
relationships?: Record<string, any>;
35
36
getText(): string;
37
setContent(value: string): void;
38
asRelatedNodeInfo(): RelatedNodeInfo;
39
}
40
```
41
42
## Node Classes
43
44
### BaseNode
45
46
Base class for all node types in the system.
47
48
```typescript { .api }
49
class BaseNode {
50
id_: string;
51
text: string;
52
metadata: Record<string, any>;
53
relationships: Record<string, any>;
54
55
getText(): string;
56
setContent(value: string): void;
57
asRelatedNodeInfo(): RelatedNodeInfo;
58
}
59
```
60
61
### TextNode
62
63
Represents a chunk of text extracted from a document.
64
65
```typescript { .api }
66
class TextNode extends BaseNode {
67
constructor(init: {
68
text: string;
69
id_?: string;
70
metadata?: Record<string, any>;
71
relationships?: Record<string, any>;
72
});
73
74
startCharIdx?: number;
75
endCharIdx?: number;
76
textTemplate: string;
77
metadataTemplate: string;
78
metadataSeparator: string;
79
80
getContent(metadataMode?: MetadataMode): string;
81
getMetadataStr(mode?: MetadataMode): string;
82
setContent(value: string): void;
83
}
84
```
85
86
## Node Parsers
87
88
### NodeParser Interface
89
90
Base interface for all node parsers.
91
92
```typescript { .api }
93
interface NodeParser {
94
getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
95
splitText(text: string): string[];
96
}
97
```
98
99
### SentenceSplitter
100
101
The most commonly used node parser that splits text into sentences and chunks.
102
103
```typescript { .api }
104
class SentenceSplitter implements NodeParser {
105
constructor(options?: {
106
chunkSize?: number;
107
chunkOverlap?: number;
108
tokenizer?: (text: string) => string[];
109
paragraphSeparator?: string;
110
chunkingTokenizerFn?: (text: string) => string[];
111
secondaryChunkingRegex?: string;
112
separator?: string;
113
});
114
115
chunkSize: number;
116
chunkOverlap: number;
117
separator: string;
118
paragraphSeparator: string;
119
120
getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
121
splitText(text: string): string[];
122
splitTextMetadataAware(text: string, metadata: Record<string, any>): string[];
123
}
124
```
125
126
### TokenTextSplitter
127
128
Splits text based on token count rather than characters.
129
130
```typescript { .api }
131
class TokenTextSplitter implements NodeParser {
132
constructor(options?: {
133
chunkSize?: number;
134
chunkOverlap?: number;
135
separator?: string;
136
tokenizer?: (text: string) => string[];
137
chunkingTokenizerFn?: (text: string) => string[];
138
});
139
140
getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
141
splitText(text: string): string[];
142
}
143
```
144
145
### MarkdownNodeParser
146
147
Specialized parser for Markdown documents that preserves structure.
148
149
```typescript { .api }
150
class MarkdownNodeParser implements NodeParser {
151
constructor(options?: {
152
chunkSize?: number;
153
chunkOverlap?: number;
154
});
155
156
getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
157
splitText(text: string): string[];
158
}
159
```
160
161
### HTMLNodeParser
162
163
Parser for HTML documents that handles tags and structure.
164
165
```typescript { .api }
166
class HTMLNodeParser implements NodeParser {
167
constructor(options?: {
168
chunkSize?: number;
169
chunkOverlap?: number;
170
tags?: string[];
171
});
172
173
getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
174
splitText(text: string): string[];
175
}
176
```
177
178
### CodeSplitter
179
180
Specialized parser for source code that respects language syntax.
181
182
```typescript { .api }
183
class CodeSplitter implements NodeParser {
184
constructor(options?: {
185
language: string;
186
chunkLines?: number;
187
chunkLinesOverlap?: number;
188
maxChars?: number;
189
});
190
191
getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
192
splitText(text: string): string[];
193
}
194
```
195
196
### SentenceWindowNodeParser
197
198
Creates overlapping windows of sentences for better context preservation.
199
200
```typescript { .api }
201
class SentenceWindowNodeParser implements NodeParser {
202
constructor(options?: {
203
windowSize?: number;
204
windowMetadataKey?: string;
205
originalTextMetadataKey?: string;
206
});
207
208
getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
209
splitText(text: string): string[];
210
}
211
```
212
213
## Basic Usage
214
215
### Creating Documents
216
217
```typescript
218
import { Document } from "llamaindex";
219
220
// Simple document
221
const doc = new Document({
222
text: "This is the content of my document.",
223
id_: "doc-1"
224
});
225
226
// Document with metadata
227
const docWithMetadata = new Document({
228
text: "Financial report for Q3 2024...",
229
id_: "financial-report-q3-2024",
230
metadata: {
231
author: "John Doe",
232
department: "Finance",
233
date: "2024-03-31",
234
classification: "internal"
235
}
236
});
237
```
238
239
### Basic Text Splitting
240
241
```typescript
242
import { SentenceSplitter, Document } from "llamaindex";
243
244
const documents = [
245
new Document({ text: "Long document content here..." }),
246
];
247
248
// Create splitter with default settings
249
const splitter = new SentenceSplitter({
250
chunkSize: 1024,
251
chunkOverlap: 20,
252
});
253
254
// Split documents into nodes
255
const nodes = splitter.getNodesFromDocuments(documents);
256
257
console.log(`Created ${nodes.length} text nodes`);
258
nodes.forEach((node, i) => {
259
console.log(`Node ${i}: ${node.text.substring(0, 100)}...`);
260
});
261
```
262
263
### Advanced Splitting Configuration
264
265
```typescript
266
import { SentenceSplitter } from "llamaindex/node-parser";
267
268
// Custom tokenizer and separators
269
const advancedSplitter = new SentenceSplitter({
270
chunkSize: 512,
271
chunkOverlap: 50,
272
separator: " ",
273
paragraphSeparator: "\n\n",
274
chunkingTokenizerFn: (text: string) => text.split(/\s+/), // Custom tokenizer
275
});
276
277
const nodes = advancedSplitter.getNodesFromDocuments(documents);
278
```
279
280
### Markdown Processing
281
282
```typescript
283
import { MarkdownNodeParser } from "llamaindex/node-parser";
284
285
const markdownDoc = new Document({
286
text: `# Chapter 1\n\nThis is the introduction.\n\n## Section 1.1\n\nContent here...`,
287
});
288
289
const markdownParser = new MarkdownNodeParser({
290
chunkSize: 1024,
291
});
292
293
const markdownNodes = markdownParser.getNodesFromDocuments([markdownDoc]);
294
```
295
296
### HTML Processing
297
298
```typescript
299
import { HTMLNodeParser } from "llamaindex/node-parser";
300
301
const htmlDoc = new Document({
302
text: `<html><body><h1>Title</h1><p>Paragraph content...</p></body></html>`,
303
});
304
305
const htmlParser = new HTMLNodeParser({
306
chunkSize: 512,
307
tags: ["p", "h1", "h2", "div"], // Focus on specific tags
308
});
309
310
const htmlNodes = htmlParser.getNodesFromDocuments([htmlDoc]);
311
```
312
313
### Code Processing
314
315
```typescript
316
import { CodeSplitter } from "llamaindex/node-parser";
317
318
const codeDoc = new Document({
319
text: `function example() {\n return "Hello World";\n}\n\nclass MyClass {\n constructor() {}\n}`,
320
});
321
322
const codeSplitter = new CodeSplitter({
323
language: "javascript",
324
chunkLines: 10,
325
chunkLinesOverlap: 2,
326
maxChars: 1000,
327
});
328
329
const codeNodes = codeSplitter.getNodesFromDocuments([codeDoc]);
330
```
331
332
## Configuration with Settings
333
334
### Global Node Parser
335
336
```typescript
337
import { Settings, SentenceSplitter } from "llamaindex";
338
339
// Set global node parser
340
Settings.nodeParser = new SentenceSplitter({
341
chunkSize: 1024,
342
chunkOverlap: 20,
343
});
344
345
// All indexing operations will use this parser by default
346
```
347
348
### Temporary Node Parser Override
349
350
```typescript
351
import { Settings, TokenTextSplitter, VectorStoreIndex } from "llamaindex";
352
353
const documents = [/* your documents */];
354
355
// Use different parser for specific operation
356
const index = Settings.withNodeParser(
357
new TokenTextSplitter({ chunkSize: 512 }),
358
() => {
359
return VectorStoreIndex.fromDocuments(documents);
360
}
361
);
362
```
363
364
## Working with Node Metadata
365
366
### Accessing Node Information
367
368
```typescript
369
const nodes = splitter.getNodesFromDocuments(documents);
370
371
nodes.forEach(node => {
372
console.log("Node ID:", node.id_);
373
console.log("Text:", node.text);
374
console.log("Metadata:", node.metadata);
375
376
// Check relationships to source document
377
if (node.relationships.SOURCE_NODE) {
378
console.log("Source document ID:", node.relationships.SOURCE_NODE.nodeId);
379
}
380
381
// Check text positions if available
382
if (node.startCharIdx !== undefined) {
383
console.log(`Text span: ${node.startCharIdx}-${node.endCharIdx}`);
384
}
385
});
386
```
387
388
### Preserving Document Metadata
389
390
Document metadata is automatically propagated to generated nodes:
391
392
```typescript
393
const docWithMeta = new Document({
394
text: "Content here...",
395
metadata: {
396
source: "research-paper.pdf",
397
page: 1,
398
section: "introduction"
399
}
400
});
401
402
const nodes = splitter.getNodesFromDocuments([docWithMeta]);
403
404
// Each node will contain the document metadata
405
nodes.forEach(node => {
406
console.log(node.metadata); // { source: "research-paper.pdf", page: 1, section: "introduction" }
407
});
408
```
409
410
## Best Practices
411
412
### Choosing Chunk Size
413
414
```typescript
415
// For general text (articles, books)
416
const generalSplitter = new SentenceSplitter({
417
chunkSize: 1024, // Good balance of context and specificity
418
chunkOverlap: 20,
419
});
420
421
// For code
422
const codeSplitter = new CodeSplitter({
423
language: "typescript",
424
chunkLines: 15, // Functions or logical blocks
425
chunkLinesOverlap: 3,
426
});
427
428
// For short-form content (tweets, messages)
429
const shortFormSplitter = new SentenceSplitter({
430
chunkSize: 256, // Smaller chunks for focused retrieval
431
chunkOverlap: 10,
432
});
433
```
434
435
### Handling Different Content Types
436
437
```typescript
438
const processDocumentByType = (doc: Document) => {
439
const { mimetype } = doc;
440
441
if (mimetype?.includes('html')) {
442
return new HTMLNodeParser().getNodesFromDocuments([doc]);
443
} else if (mimetype?.includes('markdown')) {
444
return new MarkdownNodeParser().getNodesFromDocuments([doc]);
445
} else if (doc.metadata.fileExtension === '.py') {
446
return new CodeSplitter({ language: 'python' }).getNodesFromDocuments([doc]);
447
} else {
448
return new SentenceSplitter().getNodesFromDocuments([doc]);
449
}
450
};
451
```