Various implementations of LangChain.js text splitters for retrieval-augmented generation (RAG) pipelines
npx @tessl/cli install tessl/npm-langchain--textsplitters@0.1.00
# LangChain Text Splitters
1
2
LangChain Text Splitters provides various implementations of text splitting functionality for LangChain.js, most commonly used as part of retrieval-augmented generation (RAG) pipelines. The library offers abstract base classes and concrete implementations for splitting text documents into smaller chunks with configurable size, overlap, and length functions.
3
4
## Package Information
5
6
- **Package Name**: @langchain/textsplitters
7
- **Package Type**: npm
8
- **Language**: TypeScript
9
- **Installation**: `npm install @langchain/textsplitters @langchain/core js-tiktoken`
10
11
## Core Imports
12
13
```typescript
14
import {
15
// Classes
16
TextSplitter,
17
CharacterTextSplitter,
18
RecursiveCharacterTextSplitter,
19
TokenTextSplitter,
20
MarkdownTextSplitter,
21
LatexTextSplitter,
22
23
// Interfaces and Types
24
TextSplitterParams,
25
TextSplitterChunkHeaderOptions,
26
CharacterTextSplitterParams,
27
RecursiveCharacterTextSplitterParams,
28
TokenTextSplitterParams,
29
MarkdownTextSplitterParams,
30
LatexTextSplitterParams,
31
SupportedTextSplitterLanguage,
32
SupportedTextSplitterLanguages
33
} from "@langchain/textsplitters";
34
35
// Required imports for tiktoken functionality
36
import type * as tiktoken from "js-tiktoken";
37
import { Document } from "@langchain/core/documents";
38
```
39
40
For CommonJS:
41
42
```javascript
43
const {
44
// Classes
45
TextSplitter,
46
CharacterTextSplitter,
47
RecursiveCharacterTextSplitter,
48
TokenTextSplitter,
49
MarkdownTextSplitter,
50
LatexTextSplitter,
51
52
// Interfaces and Types
53
TextSplitterParams,
54
TextSplitterChunkHeaderOptions,
55
CharacterTextSplitterParams,
56
RecursiveCharacterTextSplitterParams,
57
TokenTextSplitterParams,
58
MarkdownTextSplitterParams,
59
LatexTextSplitterParams,
60
SupportedTextSplitterLanguage,
61
SupportedTextSplitterLanguages
62
} = require("@langchain/textsplitters");
63
64
// Required for document processing
65
const { Document } = require("@langchain/core/documents");
66
```
67
68
## Basic Usage
69
70
```typescript
71
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
72
import { Document } from "@langchain/core/documents";
73
74
// Create a text splitter with custom configuration
75
const splitter = new RecursiveCharacterTextSplitter({
76
chunkSize: 1000,
77
chunkOverlap: 200,
78
});
79
80
// Split text into chunks
81
const text = "Your long text content here...";
82
const chunks = await splitter.splitText(text);
83
84
// Create documents from text with metadata
85
const docs = await splitter.createDocuments(
86
[text],
87
[{ source: "example.txt" }]
88
);
89
90
// Split existing documents
91
const existingDocs = [
92
new Document({ pageContent: text, metadata: { source: "doc1" } })
93
];
94
const splitDocs = await splitter.splitDocuments(existingDocs);
95
```
96
97
## Architecture
98
99
LangChain Text Splitters is built around several key components:
100
101
- **Abstract Base Class**: `TextSplitter` provides core splitting functionality and document transformation interface
102
- **Concrete Implementations**: Specialized splitters for different splitting strategies (character-based, recursive, token-based)
103
- **Document Integration**: Full integration with LangChain's Document ecosystem via BaseDocumentTransformer
104
- **Metadata Preservation**: Automatic line number tracking and metadata propagation through splitting operations
105
- **Language-Aware Splitting**: Support for 18 programming languages with language-specific separators
106
107
## Capabilities
108
109
### Basic Text Splitting
110
111
Core text splitting functionality using simple character-based separators. Ideal for basic document chunking with predictable separator patterns.
112
113
```typescript { .api }
114
class CharacterTextSplitter extends TextSplitter {
115
constructor(fields?: Partial<CharacterTextSplitterParams>);
116
splitText(text: string): Promise<string[]>;
117
}
118
119
interface CharacterTextSplitterParams extends TextSplitterParams {
120
separator: string;
121
}
122
```
123
124
[Character Text Splitting](./character-splitting.md)
125
126
### Recursive Text Splitting
127
128
Advanced recursive splitting using a hierarchy of separators. Perfect for intelligent document chunking that preserves semantic structure and supports code-aware splitting.
129
130
```typescript { .api }
131
class RecursiveCharacterTextSplitter extends TextSplitter {
132
constructor(fields?: Partial<RecursiveCharacterTextSplitterParams>);
133
splitText(text: string): Promise<string[]>;
134
static fromLanguage(
135
language: SupportedTextSplitterLanguage,
136
options?: Partial<RecursiveCharacterTextSplitterParams>
137
): RecursiveCharacterTextSplitter;
138
static getSeparatorsForLanguage(language: SupportedTextSplitterLanguage): string[];
139
}
140
141
interface RecursiveCharacterTextSplitterParams extends TextSplitterParams {
142
separators: string[];
143
}
144
145
const SupportedTextSplitterLanguages = [
146
"cpp", "go", "java", "js", "php", "proto", "python", "rst",
147
"ruby", "rust", "scala", "swift", "markdown", "latex", "html", "sol"
148
] as const;
149
150
type SupportedTextSplitterLanguage = (typeof SupportedTextSplitterLanguages)[number];
151
```
152
153
[Recursive Text Splitting](./recursive-splitting.md)
154
155
### Token-Based Splitting
156
157
Token-aware splitting using tiktoken encoding for accurate token count management. Essential for applications that need precise token-based chunking for language models.
158
159
```typescript { .api }
160
class TokenTextSplitter extends TextSplitter {
161
constructor(fields?: Partial<TokenTextSplitterParams>);
162
splitText(text: string): Promise<string[]>;
163
}
164
165
interface TokenTextSplitterParams extends TextSplitterParams {
166
encodingName: tiktoken.TiktokenEncoding;
167
allowedSpecial: "all" | Array<string>;
168
disallowedSpecial: "all" | Array<string>;
169
}
170
171
// Tiktoken encoding types from js-tiktoken
172
namespace tiktoken {
173
type TiktokenEncoding = "gpt2" | "r50k_base" | "p50k_base" | "cl100k_base";
174
175
interface Tiktoken {
176
encode(text: string, allowedSpecial?: "all" | Array<string>, disallowedSpecial?: "all" | Array<string>): number[];
177
decode(tokens: number[]): string;
178
}
179
}
180
```
181
182
[Token-Based Splitting](./token-splitting.md)
183
184
### Document Format Splitting
185
186
Specialized splitters optimized for specific document formats like Markdown and LaTeX. Designed to preserve document structure and formatting semantics.
187
188
```typescript { .api }
189
class MarkdownTextSplitter extends RecursiveCharacterTextSplitter {
190
constructor(fields?: Partial<MarkdownTextSplitterParams>);
191
}
192
193
class LatexTextSplitter extends RecursiveCharacterTextSplitter {
194
constructor(fields?: Partial<LatexTextSplitterParams>);
195
}
196
197
type MarkdownTextSplitterParams = TextSplitterParams;
198
type LatexTextSplitterParams = TextSplitterParams;
199
```
200
201
[Document Format Splitting](./format-splitting.md)
202
203
## Core Types
204
205
```typescript { .api }
206
interface TextSplitterParams {
207
chunkSize: number;
208
chunkOverlap: number;
209
keepSeparator: boolean;
210
lengthFunction?: ((text: string) => number) | ((text: string) => Promise<number>);
211
}
212
213
type TextSplitterChunkHeaderOptions = {
214
chunkHeader?: string;
215
chunkOverlapHeader?: string;
216
appendChunkOverlapHeader?: boolean;
217
};
218
219
// Base class from @langchain/core/documents
220
abstract class BaseDocumentTransformer {
221
abstract transformDocuments(documents: Document[], ...args: any[]): Promise<Document[]>;
222
}
223
224
abstract class TextSplitter extends BaseDocumentTransformer implements TextSplitterParams {
225
lc_namespace: string[];
226
chunkSize: number;
227
chunkOverlap: number;
228
keepSeparator: boolean;
229
lengthFunction: ((text: string) => number) | ((text: string) => Promise<number>);
230
231
constructor(fields?: Partial<TextSplitterParams>);
232
abstract splitText(text: string): Promise<string[]>;
233
transformDocuments(
234
documents: Document[],
235
chunkHeaderOptions?: TextSplitterChunkHeaderOptions
236
): Promise<Document[]>;
237
createDocuments(
238
texts: string[],
239
metadatas?: Record<string, any>[],
240
chunkHeaderOptions?: TextSplitterChunkHeaderOptions
241
): Promise<Document[]>;
242
splitDocuments(
243
documents: Document[],
244
chunkHeaderOptions?: TextSplitterChunkHeaderOptions
245
): Promise<Document[]>;
246
protected splitOnSeparator(text: string, separator: string): string[];
247
mergeSplits(splits: string[], separator: string): Promise<string[]>;
248
private numberOfNewLines(text: string, start?: number, end?: number): number;
249
private joinDocs(docs: string[], separator: string): string | null;
250
}
251
```