Tessl Tile for npm/llamaindex@0.11.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

agents.md chat-engines.md document-processing.md embeddings.md index.md llm-integration.md query-engines.md response-synthesis.md settings.md storage.md tools.md vector-indexing.md

tile.json

document-processing.mddocs/

0
# Document Processing and Node Parsing
1

2
Text processing and document chunking functionality for preparing data for indexing and retrieval in LlamaIndex.TS.
3

4
## Import
5

6
```typescript
7
import { Document, SentenceSplitter, TextNode } from "llamaindex";
8
// Or from specific submodules
9
import { SentenceSplitter, MarkdownNodeParser } from "llamaindex/node-parser";
10
```
11

12
## Overview
13

14
Document processing in LlamaIndex.TS involves transforming raw text into structured nodes that can be indexed and retrieved. The system provides various node parsers for different text types and chunking strategies.
15

16
## Document Class
17

18
The Document class represents a source document with text content and metadata.
19

20
```typescript { .api }
21
class Document {
22
  constructor(init: { 
23
    text: string; 
24
    id_?: string; 
25
    metadata?: Record<string, any>;
26
    mimetype?: string;
27
    relationships?: Record<string, any>;
28
  });
29
  
30
  text: string;
31
  id_: string;
32
  metadata: Record<string, any>;
33
  mimetype?: string;
34
  relationships?: Record<string, any>;
35
  
36
  getText(): string;
37
  setContent(value: string): void;
38
  asRelatedNodeInfo(): RelatedNodeInfo;
39
}
40
```
41

42
## Node Classes
43

44
### BaseNode
45

46
Base class for all node types in the system.
47

48
```typescript { .api }
49
class BaseNode {
50
  id_: string;
51
  text: string;
52
  metadata: Record<string, any>;
53
  relationships: Record<string, any>;
54
  
55
  getText(): string;
56
  setContent(value: string): void;
57
  asRelatedNodeInfo(): RelatedNodeInfo;
58
}
59
```
60

61
### TextNode
62

63
Represents a chunk of text extracted from a document.
64

65
```typescript { .api }
66
class TextNode extends BaseNode {
67
  constructor(init: { 
68
    text: string; 
69
    id_?: string; 
70
    metadata?: Record<string, any>;
71
    relationships?: Record<string, any>;
72
  });
73
  
74
  startCharIdx?: number;
75
  endCharIdx?: number;
76
  textTemplate: string;
77
  metadataTemplate: string;
78
  metadataSeparator: string;
79
  
80
  getContent(metadataMode?: MetadataMode): string;
81
  getMetadataStr(mode?: MetadataMode): string;
82
  setContent(value: string): void;
83
}
84
```
85

86
## Node Parsers
87

88
### NodeParser Interface
89

90
Base interface for all node parsers.
91

92
```typescript { .api }
93
interface NodeParser {
94
  getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
95
  splitText(text: string): string[];
96
}
97
```
98

99
### SentenceSplitter
100

101
The most commonly used node parser that splits text into sentences and chunks.
102

103
```typescript { .api }
104
class SentenceSplitter implements NodeParser {
105
  constructor(options?: {
106
    chunkSize?: number;
107
    chunkOverlap?: number;
108
    tokenizer?: (text: string) => string[];
109
    paragraphSeparator?: string;
110
    chunkingTokenizerFn?: (text: string) => string[];
111
    secondaryChunkingRegex?: string;
112
    separator?: string;
113
  });
114
  
115
  chunkSize: number;
116
  chunkOverlap: number;
117
  separator: string;
118
  paragraphSeparator: string;
119
  
120
  getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
121
  splitText(text: string): string[];
122
  splitTextMetadataAware(text: string, metadata: Record<string, any>): string[];
123
}
124
```
125

126
### TokenTextSplitter
127

128
Splits text based on token count rather than characters.
129

130
```typescript { .api }
131
class TokenTextSplitter implements NodeParser {
132
  constructor(options?: {
133
    chunkSize?: number;
134
    chunkOverlap?: number;
135
    separator?: string;
136
    tokenizer?: (text: string) => string[];
137
    chunkingTokenizerFn?: (text: string) => string[];
138
  });
139
  
140
  getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
141
  splitText(text: string): string[];
142
}
143
```
144

145
### MarkdownNodeParser
146

147
Specialized parser for Markdown documents that preserves structure.
148

149
```typescript { .api }
150
class MarkdownNodeParser implements NodeParser {
151
  constructor(options?: {
152
    chunkSize?: number;
153
    chunkOverlap?: number;
154
  });
155
  
156
  getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
157
  splitText(text: string): string[];
158
}
159
```
160

161
### HTMLNodeParser
162

163
Parser for HTML documents that handles tags and structure.
164

165
```typescript { .api }
166
class HTMLNodeParser implements NodeParser {
167
  constructor(options?: {
168
    chunkSize?: number;
169
    chunkOverlap?: number;
170
    tags?: string[];
171
  });
172
  
173
  getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
174
  splitText(text: string): string[];
175
}
176
```
177

178
### CodeSplitter
179

180
Specialized parser for source code that respects language syntax.
181

182
```typescript { .api }
183
class CodeSplitter implements NodeParser {
184
  constructor(options?: {
185
    language: string;
186
    chunkLines?: number;
187
    chunkLinesOverlap?: number;
188
    maxChars?: number;
189
  });
190
  
191
  getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
192
  splitText(text: string): string[];
193
}
194
```
195

196
### SentenceWindowNodeParser
197

198
Creates overlapping windows of sentences for better context preservation.
199

200
```typescript { .api }
201
class SentenceWindowNodeParser implements NodeParser {
202
  constructor(options?: {
203
    windowSize?: number;
204
    windowMetadataKey?: string;
205
    originalTextMetadataKey?: string;
206
  });
207
  
208
  getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];
209
  splitText(text: string): string[];
210
}
211
```
212

213
## Basic Usage
214

215
### Creating Documents
216

217
```typescript
218
import { Document } from "llamaindex";
219

220
// Simple document
221
const doc = new Document({ 
222
  text: "This is the content of my document.",
223
  id_: "doc-1" 
224
});
225

226
// Document with metadata
227
const docWithMetadata = new Document({
228
  text: "Financial report for Q3 2024...",
229
  id_: "financial-report-q3-2024",
230
  metadata: {
231
    author: "John Doe",
232
    department: "Finance",
233
    date: "2024-03-31",
234
    classification: "internal"
235
  }
236
});
237
```
238

239
### Basic Text Splitting
240

241
```typescript
242
import { SentenceSplitter, Document } from "llamaindex";
243

244
const documents = [
245
  new Document({ text: "Long document content here..." }),
246
];
247

248
// Create splitter with default settings
249
const splitter = new SentenceSplitter({
250
  chunkSize: 1024,
251
  chunkOverlap: 20,
252
});
253

254
// Split documents into nodes
255
const nodes = splitter.getNodesFromDocuments(documents);
256

257
console.log(`Created ${nodes.length} text nodes`);
258
nodes.forEach((node, i) => {
259
  console.log(`Node ${i}: ${node.text.substring(0, 100)}...`);
260
});
261
```
262

263
### Advanced Splitting Configuration
264

265
```typescript
266
import { SentenceSplitter } from "llamaindex/node-parser";
267

268
// Custom tokenizer and separators
269
const advancedSplitter = new SentenceSplitter({
270
  chunkSize: 512,
271
  chunkOverlap: 50,
272
  separator: " ",
273
  paragraphSeparator: "\n\n",
274
  chunkingTokenizerFn: (text: string) => text.split(/\s+/), // Custom tokenizer
275
});
276

277
const nodes = advancedSplitter.getNodesFromDocuments(documents);
278
```
279

280
### Markdown Processing
281

282
```typescript
283
import { MarkdownNodeParser } from "llamaindex/node-parser";
284

285
const markdownDoc = new Document({
286
  text: `# Chapter 1\n\nThis is the introduction.\n\n## Section 1.1\n\nContent here...`,
287
});
288

289
const markdownParser = new MarkdownNodeParser({
290
  chunkSize: 1024,
291
});
292

293
const markdownNodes = markdownParser.getNodesFromDocuments([markdownDoc]);
294
```
295

296
### HTML Processing
297

298
```typescript
299
import { HTMLNodeParser } from "llamaindex/node-parser";
300

301
const htmlDoc = new Document({
302
  text: `<html><body><h1>Title</h1><p>Paragraph content...</p></body></html>`,
303
});
304

305
const htmlParser = new HTMLNodeParser({
306
  chunkSize: 512,
307
  tags: ["p", "h1", "h2", "div"], // Focus on specific tags
308
});
309

310
const htmlNodes = htmlParser.getNodesFromDocuments([htmlDoc]);
311
```
312

313
### Code Processing
314

315
```typescript
316
import { CodeSplitter } from "llamaindex/node-parser";
317

318
const codeDoc = new Document({
319
  text: `function example() {\n  return "Hello World";\n}\n\nclass MyClass {\n  constructor() {}\n}`,
320
});
321

322
const codeSplitter = new CodeSplitter({
323
  language: "javascript",
324
  chunkLines: 10,
325
  chunkLinesOverlap: 2,
326
  maxChars: 1000,
327
});
328

329
const codeNodes = codeSplitter.getNodesFromDocuments([codeDoc]);
330
```
331

332
## Configuration with Settings
333

334
### Global Node Parser
335

336
```typescript
337
import { Settings, SentenceSplitter } from "llamaindex";
338

339
// Set global node parser
340
Settings.nodeParser = new SentenceSplitter({
341
  chunkSize: 1024,
342
  chunkOverlap: 20,
343
});
344

345
// All indexing operations will use this parser by default
346
```
347

348
### Temporary Node Parser Override
349

350
```typescript
351
import { Settings, TokenTextSplitter, VectorStoreIndex } from "llamaindex";
352

353
const documents = [/* your documents */];
354

355
// Use different parser for specific operation
356
const index = Settings.withNodeParser(
357
  new TokenTextSplitter({ chunkSize: 512 }),
358
  () => {
359
    return VectorStoreIndex.fromDocuments(documents);
360
  }
361
);
362
```
363

364
## Working with Node Metadata
365

366
### Accessing Node Information
367

368
```typescript
369
const nodes = splitter.getNodesFromDocuments(documents);
370

371
nodes.forEach(node => {
372
  console.log("Node ID:", node.id_);
373
  console.log("Text:", node.text);
374
  console.log("Metadata:", node.metadata);
375
  
376
  // Check relationships to source document
377
  if (node.relationships.SOURCE_NODE) {
378
    console.log("Source document ID:", node.relationships.SOURCE_NODE.nodeId);
379
  }
380
  
381
  // Check text positions if available
382
  if (node.startCharIdx !== undefined) {
383
    console.log(`Text span: ${node.startCharIdx}-${node.endCharIdx}`);
384
  }
385
});
386
```
387

388
### Preserving Document Metadata
389

390
Document metadata is automatically propagated to generated nodes:
391

392
```typescript
393
const docWithMeta = new Document({
394
  text: "Content here...",
395
  metadata: {
396
    source: "research-paper.pdf",
397
    page: 1,
398
    section: "introduction"
399
  }
400
});
401

402
const nodes = splitter.getNodesFromDocuments([docWithMeta]);
403

404
// Each node will contain the document metadata
405
nodes.forEach(node => {
406
  console.log(node.metadata); // { source: "research-paper.pdf", page: 1, section: "introduction" }
407
});
408
```
409

410
## Best Practices
411

412
### Choosing Chunk Size
413

414
```typescript
415
// For general text (articles, books)
416
const generalSplitter = new SentenceSplitter({
417
  chunkSize: 1024,  // Good balance of context and specificity
418
  chunkOverlap: 20,
419
});
420

421
// For code
422
const codeSplitter = new CodeSplitter({
423
  language: "typescript",
424
  chunkLines: 15,    // Functions or logical blocks
425
  chunkLinesOverlap: 3,
426
});
427

428
// For short-form content (tweets, messages)
429
const shortFormSplitter = new SentenceSplitter({
430
  chunkSize: 256,   // Smaller chunks for focused retrieval
431
  chunkOverlap: 10,
432
});
433
```
434

435
### Handling Different Content Types
436

437
```typescript
438
const processDocumentByType = (doc: Document) => {
439
  const { mimetype } = doc;
440
  
441
  if (mimetype?.includes('html')) {
442
    return new HTMLNodeParser().getNodesFromDocuments([doc]);
443
  } else if (mimetype?.includes('markdown')) {
444
    return new MarkdownNodeParser().getNodesFromDocuments([doc]);
445
  } else if (doc.metadata.fileExtension === '.py') {
446
    return new CodeSplitter({ language: 'python' }).getNodesFromDocuments([doc]);
447
  } else {
448
    return new SentenceSplitter().getNodesFromDocuments([doc]);
449
  }
450
};
451
```

Version

Tile

Files

document-processing.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

document-processing.mddocs/