0
# Token-Based Splitting
1
2
Token-aware splitting using tiktoken encoding for accurate token count management. Essential for applications that need precise token-based chunking for language models.
3
4
## Capabilities
5
6
### TokenTextSplitter Class
7
8
Splits text based on actual token boundaries using tiktoken encoding, providing accurate token-based chunking for language model applications.
9
10
```typescript { .api }
11
/**
12
* Text splitter that splits text based on token count using tiktoken encoding
13
*/
14
class TokenTextSplitter extends TextSplitter implements TokenTextSplitterParams {
15
encodingName: tiktoken.TiktokenEncoding;
16
allowedSpecial: "all" | Array<string>;
17
disallowedSpecial: "all" | Array<string>;
18
private tokenizer: tiktoken.Tiktoken;
19
20
constructor(fields?: Partial<TokenTextSplitterParams>);
21
splitText(text: string): Promise<string[]>;
22
static lc_name(): string;
23
}
24
25
interface TokenTextSplitterParams extends TextSplitterParams {
26
/** The tiktoken encoding to use (default: "gpt2") */
27
encodingName: tiktoken.TiktokenEncoding;
28
/** Special tokens that are allowed in the text (default: []) */
29
allowedSpecial: "all" | Array<string>;
30
/** Special tokens that should cause errors if encountered (default: "all") */
31
disallowedSpecial: "all" | Array<string>;
32
}
33
34
// Tiktoken encoding types from js-tiktoken
35
namespace tiktoken {
36
type TiktokenEncoding = "gpt2" | "r50k_base" | "p50k_base" | "cl100k_base";
37
38
interface Tiktoken {
39
encode(text: string, allowedSpecial?: "all" | Array<string>, disallowedSpecial?: "all" | Array<string>): number[];
40
decode(tokens: number[]): string;
41
}
42
}
43
```
44
45
**Usage Examples:**
46
47
```typescript
48
import { TokenTextSplitter } from "@langchain/textsplitters";
49
50
// Basic token-based splitting with GPT-2 encoding
51
const splitter = new TokenTextSplitter({
52
encodingName: "gpt2",
53
chunkSize: 100, // 100 tokens per chunk
54
chunkOverlap: 20, // 20 token overlap
55
});
56
57
const text = `This is a sample text that will be split based on actual token boundaries
58
rather than character count. This ensures more accurate chunking for language model applications.`;
59
60
const chunks = await splitter.splitText(text);
61
// Each chunk contains approximately 100 tokens with 20 token overlap
62
63
// GPT-3.5/GPT-4 compatible splitting
64
const gpt4Splitter = new TokenTextSplitter({
65
encodingName: "cl100k_base", // Used by GPT-3.5 and GPT-4
66
chunkSize: 500,
67
chunkOverlap: 50,
68
});
69
70
const longText = `Your long content here that needs to be split into chunks
71
that respect the actual token boundaries used by modern language models...`;
72
73
const tokenChunks = await gpt4Splitter.splitText(longText);
74
75
// Different encoding options
76
const r50kSplitter = new TokenTextSplitter({
77
encodingName: "r50k_base", // Used by text-davinci-003 and earlier models
78
chunkSize: 200,
79
chunkOverlap: 30,
80
});
81
```
82
83
### Encoding Options
84
85
Token text splitters support various tiktoken encodings for different language models:
86
87
```typescript { .api }
88
/**
89
* Supported tiktoken encodings for different language models
90
*/
91
type TiktokenEncoding =
92
| "gpt2" // GPT-2, used by older models
93
| "r50k_base" // Used by text-davinci-003 and earlier GPT-3 models
94
| "p50k_base" // Used by text-davinci-002 and earlier GPT-3 models
95
| "cl100k_base"; // Used by GPT-3.5 and GPT-4 models
96
```
97
98
**Encoding Examples:**
99
100
```typescript
101
// For GPT-3.5-turbo and GPT-4 applications
102
const modernSplitter = new TokenTextSplitter({
103
encodingName: "cl100k_base",
104
chunkSize: 1000, // Approximately 1000 tokens
105
chunkOverlap: 100,
106
});
107
108
// For legacy GPT-3 models
109
const legacySplitter = new TokenTextSplitter({
110
encodingName: "r50k_base",
111
chunkSize: 2048, // Common context window size
112
chunkOverlap: 200,
113
});
114
115
// For GPT-2 applications or research
116
const gpt2Splitter = new TokenTextSplitter({
117
encodingName: "gpt2",
118
chunkSize: 512,
119
chunkOverlap: 50,
120
});
121
```
122
123
### Special Token Handling
124
125
Control how special tokens are handled during tokenization:
126
127
```typescript { .api }
128
interface TokenTextSplitterParams extends TextSplitterParams {
129
/** Special tokens that are allowed in the text */
130
allowedSpecial: "all" | Array<string>;
131
/** Special tokens that should cause errors if encountered */
132
disallowedSpecial: "all" | Array<string>;
133
}
134
```
135
136
**Special Token Examples:**
137
138
```typescript
139
// Allow all special tokens
140
const permissiveSplitter = new TokenTextSplitter({
141
encodingName: "cl100k_base",
142
chunkSize: 500,
143
chunkOverlap: 50,
144
allowedSpecial: "all",
145
disallowedSpecial: [],
146
});
147
148
// Strict special token handling - error on any special tokens
149
const strictSplitter = new TokenTextSplitter({
150
encodingName: "cl100k_base",
151
chunkSize: 500,
152
chunkOverlap: 50,
153
allowedSpecial: [],
154
disallowedSpecial: "all",
155
});
156
157
// Allow specific special tokens
158
const customSplitter = new TokenTextSplitter({
159
encodingName: "cl100k_base",
160
chunkSize: 500,
161
chunkOverlap: 50,
162
allowedSpecial: ["<|endoftext|>", "<|startoftext|>"],
163
disallowedSpecial: "all",
164
});
165
166
const textWithSpecialTokens = "Regular text <|endoftext|> More text <|startoftext|> Final text";
167
const chunks = await customSplitter.splitText(textWithSpecialTokens);
168
```
169
170
### Document Processing
171
172
Token text splitters integrate with LangChain's document processing pipeline:
173
174
```typescript
175
import { TokenTextSplitter } from "@langchain/textsplitters";
176
import { Document } from "@langchain/core/documents";
177
178
// Create documents with precise token management
179
const splitter = new TokenTextSplitter({
180
encodingName: "cl100k_base",
181
chunkSize: 300,
182
chunkOverlap: 30,
183
});
184
185
// Split documents for RAG applications
186
const documents = [
187
new Document({
188
pageContent: "Long article content that needs token-based splitting...",
189
metadata: { source: "article.txt", tokens: 1500 }
190
})
191
];
192
193
const splitDocs = await splitter.splitDocuments(documents);
194
195
// Each split document maintains metadata and adds line location
196
splitDocs.forEach(doc => {
197
console.log(`Chunk: ${doc.pageContent.substring(0, 50)}...`);
198
console.log(`Metadata:`, doc.metadata);
199
// Includes original metadata plus { loc: { lines: { from: X, to: Y } } }
200
});
201
202
// Create documents with chunk headers for context
203
const docsWithHeaders = await splitter.createDocuments(
204
[longArticleText],
205
[{ source: "research.pdf", page: 1 }],
206
{
207
chunkHeader: "=== DOCUMENT CHUNK ===\n",
208
chunkOverlapHeader: "[CONTINUED] ",
209
appendChunkOverlapHeader: true
210
}
211
);
212
```
213
214
### Practical Applications
215
216
Token-based splitting is particularly useful for:
217
218
**RAG Systems with Token Limits:**
219
```typescript
220
// Ensure chunks fit within model context windows
221
const ragSplitter = new TokenTextSplitter({
222
encodingName: "cl100k_base",
223
chunkSize: 500, // Leave room for query + system prompt
224
chunkOverlap: 50, // Maintain context between chunks
225
});
226
227
const knowledge = await splitter.createDocuments(documentTexts, metadatas);
228
// Each chunk guaranteed to fit within token budget
229
```
230
231
**Prompt Engineering:**
232
```typescript
233
// Split prompts to fit model limits while preserving structure
234
const promptSplitter = new TokenTextSplitter({
235
encodingName: "cl100k_base",
236
chunkSize: 2000, // Under 4K context limit
237
chunkOverlap: 100, // Maintain instruction context
238
});
239
240
const longPrompt = "Complex multi-part instructions...";
241
const promptChunks = await promptSplitter.splitText(longPrompt);
242
```
243
244
**Content Processing Pipelines:**
245
```typescript
246
// Process large documents with precise token accounting
247
const pipelineSplitter = new TokenTextSplitter({
248
encodingName: "cl100k_base",
249
chunkSize: 1000,
250
chunkOverlap: 100,
251
lengthFunction: async (text: string) => {
252
// Custom length function could add processing overhead
253
const tokenCount = await countTokens(text);
254
return tokenCount;
255
}
256
});
257
258
const processedDocs = await pipelineSplitter.transformDocuments(inputDocs);
259
```
260
261
### Performance Considerations
262
263
Token-based splitting requires tokenization which has performance implications:
264
265
```typescript
266
// Reuse splitter instances to avoid repeated tokenizer initialization
267
const sharedSplitter = new TokenTextSplitter({
268
encodingName: "cl100k_base",
269
chunkSize: 500,
270
chunkOverlap: 50,
271
});
272
273
// Process multiple texts with same splitter
274
const allChunks = await Promise.all(
275
texts.map(text => sharedSplitter.splitText(text))
276
);
277
278
// For high-volume processing, consider batching
279
const batchSize = 10;
280
for (let i = 0; i < texts.length; i += batchSize) {
281
const batch = texts.slice(i, i + batchSize);
282
const batchChunks = await Promise.all(
283
batch.map(text => sharedSplitter.splitText(text))
284
);
285
// Process batch results
286
}
287
```