or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

agents.mdchat-engines.mddocument-processing.mdembeddings.mdindex.mdllm-integration.mdquery-engines.mdresponse-synthesis.mdsettings.mdstorage.mdtools.mdvector-indexing.md
tile.json

document-processing.mddocs/

0

# Document Processing and Node Parsing

1

2

Text processing and document chunking functionality for preparing data for indexing and retrieval in LlamaIndex.TS.

3

4

## Import

5

6

```typescript

7

import { Document, SentenceSplitter, TextNode } from "llamaindex";

8

// Or from specific submodules

9

import { SentenceSplitter, MarkdownNodeParser } from "llamaindex/node-parser";

10

```

11

12

## Overview

13

14

Document processing in LlamaIndex.TS involves transforming raw text into structured nodes that can be indexed and retrieved. The system provides various node parsers for different text types and chunking strategies.

15

16

## Document Class

17

18

The Document class represents a source document with text content and metadata.

19

20

```typescript { .api }

21

class Document {

22

constructor(init: {

23

text: string;

24

id_?: string;

25

metadata?: Record<string, any>;

26

mimetype?: string;

27

relationships?: Record<string, any>;

28

});

29

30

text: string;

31

id_: string;

32

metadata: Record<string, any>;

33

mimetype?: string;

34

relationships?: Record<string, any>;

35

36

getText(): string;

37

setContent(value: string): void;

38

asRelatedNodeInfo(): RelatedNodeInfo;

39

}

40

```

41

42

## Node Classes

43

44

### BaseNode

45

46

Base class for all node types in the system.

47

48

```typescript { .api }

49

class BaseNode {

50

id_: string;

51

text: string;

52

metadata: Record<string, any>;

53

relationships: Record<string, any>;

54

55

getText(): string;

56

setContent(value: string): void;

57

asRelatedNodeInfo(): RelatedNodeInfo;

58

}

59

```

60

61

### TextNode

62

63

Represents a chunk of text extracted from a document.

64

65

```typescript { .api }

66

class TextNode extends BaseNode {

67

constructor(init: {

68

text: string;

69

id_?: string;

70

metadata?: Record<string, any>;

71

relationships?: Record<string, any>;

72

});

73

74

startCharIdx?: number;

75

endCharIdx?: number;

76

textTemplate: string;

77

metadataTemplate: string;

78

metadataSeparator: string;

79

80

getContent(metadataMode?: MetadataMode): string;

81

getMetadataStr(mode?: MetadataMode): string;

82

setContent(value: string): void;

83

}

84

```

85

86

## Node Parsers

87

88

### NodeParser Interface

89

90

Base interface for all node parsers.

91

92

```typescript { .api }

93

interface NodeParser {

94

getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];

95

splitText(text: string): string[];

96

}

97

```

98

99

### SentenceSplitter

100

101

The most commonly used node parser that splits text into sentences and chunks.

102

103

```typescript { .api }

104

class SentenceSplitter implements NodeParser {

105

constructor(options?: {

106

chunkSize?: number;

107

chunkOverlap?: number;

108

tokenizer?: (text: string) => string[];

109

paragraphSeparator?: string;

110

chunkingTokenizerFn?: (text: string) => string[];

111

secondaryChunkingRegex?: string;

112

separator?: string;

113

});

114

115

chunkSize: number;

116

chunkOverlap: number;

117

separator: string;

118

paragraphSeparator: string;

119

120

getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];

121

splitText(text: string): string[];

122

splitTextMetadataAware(text: string, metadata: Record<string, any>): string[];

123

}

124

```

125

126

### TokenTextSplitter

127

128

Splits text based on token count rather than characters.

129

130

```typescript { .api }

131

class TokenTextSplitter implements NodeParser {

132

constructor(options?: {

133

chunkSize?: number;

134

chunkOverlap?: number;

135

separator?: string;

136

tokenizer?: (text: string) => string[];

137

chunkingTokenizerFn?: (text: string) => string[];

138

});

139

140

getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];

141

splitText(text: string): string[];

142

}

143

```

144

145

### MarkdownNodeParser

146

147

Specialized parser for Markdown documents that preserves structure.

148

149

```typescript { .api }

150

class MarkdownNodeParser implements NodeParser {

151

constructor(options?: {

152

chunkSize?: number;

153

chunkOverlap?: number;

154

});

155

156

getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];

157

splitText(text: string): string[];

158

}

159

```

160

161

### HTMLNodeParser

162

163

Parser for HTML documents that handles tags and structure.

164

165

```typescript { .api }

166

class HTMLNodeParser implements NodeParser {

167

constructor(options?: {

168

chunkSize?: number;

169

chunkOverlap?: number;

170

tags?: string[];

171

});

172

173

getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];

174

splitText(text: string): string[];

175

}

176

```

177

178

### CodeSplitter

179

180

Specialized parser for source code that respects language syntax.

181

182

```typescript { .api }

183

class CodeSplitter implements NodeParser {

184

constructor(options?: {

185

language: string;

186

chunkLines?: number;

187

chunkLinesOverlap?: number;

188

maxChars?: number;

189

});

190

191

getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];

192

splitText(text: string): string[];

193

}

194

```

195

196

### SentenceWindowNodeParser

197

198

Creates overlapping windows of sentences for better context preservation.

199

200

```typescript { .api }

201

class SentenceWindowNodeParser implements NodeParser {

202

constructor(options?: {

203

windowSize?: number;

204

windowMetadataKey?: string;

205

originalTextMetadataKey?: string;

206

});

207

208

getNodesFromDocuments(documents: Document[], showProgress?: boolean): TextNode[];

209

splitText(text: string): string[];

210

}

211

```

212

213

## Basic Usage

214

215

### Creating Documents

216

217

```typescript

218

import { Document } from "llamaindex";

219

220

// Simple document

221

const doc = new Document({

222

text: "This is the content of my document.",

223

id_: "doc-1"

224

});

225

226

// Document with metadata

227

const docWithMetadata = new Document({

228

text: "Financial report for Q3 2024...",

229

id_: "financial-report-q3-2024",

230

metadata: {

231

author: "John Doe",

232

department: "Finance",

233

date: "2024-03-31",

234

classification: "internal"

235

}

236

});

237

```

238

239

### Basic Text Splitting

240

241

```typescript

242

import { SentenceSplitter, Document } from "llamaindex";

243

244

const documents = [

245

new Document({ text: "Long document content here..." }),

246

];

247

248

// Create splitter with default settings

249

const splitter = new SentenceSplitter({

250

chunkSize: 1024,

251

chunkOverlap: 20,

252

});

253

254

// Split documents into nodes

255

const nodes = splitter.getNodesFromDocuments(documents);

256

257

console.log(`Created ${nodes.length} text nodes`);

258

nodes.forEach((node, i) => {

259

console.log(`Node ${i}: ${node.text.substring(0, 100)}...`);

260

});

261

```

262

263

### Advanced Splitting Configuration

264

265

```typescript

266

import { SentenceSplitter } from "llamaindex/node-parser";

267

268

// Custom tokenizer and separators

269

const advancedSplitter = new SentenceSplitter({

270

chunkSize: 512,

271

chunkOverlap: 50,

272

separator: " ",

273

paragraphSeparator: "\n\n",

274

chunkingTokenizerFn: (text: string) => text.split(/\s+/), // Custom tokenizer

275

});

276

277

const nodes = advancedSplitter.getNodesFromDocuments(documents);

278

```

279

280

### Markdown Processing

281

282

```typescript

283

import { MarkdownNodeParser } from "llamaindex/node-parser";

284

285

const markdownDoc = new Document({

286

text: `# Chapter 1\n\nThis is the introduction.\n\n## Section 1.1\n\nContent here...`,

287

});

288

289

const markdownParser = new MarkdownNodeParser({

290

chunkSize: 1024,

291

});

292

293

const markdownNodes = markdownParser.getNodesFromDocuments([markdownDoc]);

294

```

295

296

### HTML Processing

297

298

```typescript

299

import { HTMLNodeParser } from "llamaindex/node-parser";

300

301

const htmlDoc = new Document({

302

text: `<html><body><h1>Title</h1><p>Paragraph content...</p></body></html>`,

303

});

304

305

const htmlParser = new HTMLNodeParser({

306

chunkSize: 512,

307

tags: ["p", "h1", "h2", "div"], // Focus on specific tags

308

});

309

310

const htmlNodes = htmlParser.getNodesFromDocuments([htmlDoc]);

311

```

312

313

### Code Processing

314

315

```typescript

316

import { CodeSplitter } from "llamaindex/node-parser";

317

318

const codeDoc = new Document({

319

text: `function example() {\n return "Hello World";\n}\n\nclass MyClass {\n constructor() {}\n}`,

320

});

321

322

const codeSplitter = new CodeSplitter({

323

language: "javascript",

324

chunkLines: 10,

325

chunkLinesOverlap: 2,

326

maxChars: 1000,

327

});

328

329

const codeNodes = codeSplitter.getNodesFromDocuments([codeDoc]);

330

```

331

332

## Configuration with Settings

333

334

### Global Node Parser

335

336

```typescript

337

import { Settings, SentenceSplitter } from "llamaindex";

338

339

// Set global node parser

340

Settings.nodeParser = new SentenceSplitter({

341

chunkSize: 1024,

342

chunkOverlap: 20,

343

});

344

345

// All indexing operations will use this parser by default

346

```

347

348

### Temporary Node Parser Override

349

350

```typescript

351

import { Settings, TokenTextSplitter, VectorStoreIndex } from "llamaindex";

352

353

const documents = [/* your documents */];

354

355

// Use different parser for specific operation

356

const index = Settings.withNodeParser(

357

new TokenTextSplitter({ chunkSize: 512 }),

358

() => {

359

return VectorStoreIndex.fromDocuments(documents);

360

}

361

);

362

```

363

364

## Working with Node Metadata

365

366

### Accessing Node Information

367

368

```typescript

369

const nodes = splitter.getNodesFromDocuments(documents);

370

371

nodes.forEach(node => {

372

console.log("Node ID:", node.id_);

373

console.log("Text:", node.text);

374

console.log("Metadata:", node.metadata);

375

376

// Check relationships to source document

377

if (node.relationships.SOURCE_NODE) {

378

console.log("Source document ID:", node.relationships.SOURCE_NODE.nodeId);

379

}

380

381

// Check text positions if available

382

if (node.startCharIdx !== undefined) {

383

console.log(`Text span: ${node.startCharIdx}-${node.endCharIdx}`);

384

}

385

});

386

```

387

388

### Preserving Document Metadata

389

390

Document metadata is automatically propagated to generated nodes:

391

392

```typescript

393

const docWithMeta = new Document({

394

text: "Content here...",

395

metadata: {

396

source: "research-paper.pdf",

397

page: 1,

398

section: "introduction"

399

}

400

});

401

402

const nodes = splitter.getNodesFromDocuments([docWithMeta]);

403

404

// Each node will contain the document metadata

405

nodes.forEach(node => {

406

console.log(node.metadata); // { source: "research-paper.pdf", page: 1, section: "introduction" }

407

});

408

```

409

410

## Best Practices

411

412

### Choosing Chunk Size

413

414

```typescript

415

// For general text (articles, books)

416

const generalSplitter = new SentenceSplitter({

417

chunkSize: 1024, // Good balance of context and specificity

418

chunkOverlap: 20,

419

});

420

421

// For code

422

const codeSplitter = new CodeSplitter({

423

language: "typescript",

424

chunkLines: 15, // Functions or logical blocks

425

chunkLinesOverlap: 3,

426

});

427

428

// For short-form content (tweets, messages)

429

const shortFormSplitter = new SentenceSplitter({

430

chunkSize: 256, // Smaller chunks for focused retrieval

431

chunkOverlap: 10,

432

});

433

```

434

435

### Handling Different Content Types

436

437

```typescript

438

const processDocumentByType = (doc: Document) => {

439

const { mimetype } = doc;

440

441

if (mimetype?.includes('html')) {

442

return new HTMLNodeParser().getNodesFromDocuments([doc]);

443

} else if (mimetype?.includes('markdown')) {

444

return new MarkdownNodeParser().getNodesFromDocuments([doc]);

445

} else if (doc.metadata.fileExtension === '.py') {

446

return new CodeSplitter({ language: 'python' }).getNodesFromDocuments([doc]);

447

} else {

448

return new SentenceSplitter().getNodesFromDocuments([doc]);

449

}

450

};

451

```