Tessl Tile for npm/@langchain/textsplitters@0.1.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

character-splitting.md format-splitting.md index.md recursive-splitting.md token-splitting.md

tile.json

token-splitting.mddocs/

0
# Token-Based Splitting
1

2
Token-aware splitting using tiktoken encoding for accurate token count management. Essential for applications that need precise token-based chunking for language models.
3

4
## Capabilities
5

6
### TokenTextSplitter Class
7

8
Splits text based on actual token boundaries using tiktoken encoding, providing accurate token-based chunking for language model applications.
9

10
```typescript { .api }
11
/**
12
 * Text splitter that splits text based on token count using tiktoken encoding
13
 */
14
class TokenTextSplitter extends TextSplitter implements TokenTextSplitterParams {
15
  encodingName: tiktoken.TiktokenEncoding;
16
  allowedSpecial: "all" | Array<string>;
17
  disallowedSpecial: "all" | Array<string>;
18
  private tokenizer: tiktoken.Tiktoken;
19
  
20
  constructor(fields?: Partial<TokenTextSplitterParams>);
21
  splitText(text: string): Promise<string[]>;
22
  static lc_name(): string;
23
}
24

25
interface TokenTextSplitterParams extends TextSplitterParams {
26
  /** The tiktoken encoding to use (default: "gpt2") */
27
  encodingName: tiktoken.TiktokenEncoding;
28
  /** Special tokens that are allowed in the text (default: []) */
29
  allowedSpecial: "all" | Array<string>;
30
  /** Special tokens that should cause errors if encountered (default: "all") */
31
  disallowedSpecial: "all" | Array<string>;
32
}
33

34
// Tiktoken encoding types from js-tiktoken
35
namespace tiktoken {
36
  type TiktokenEncoding = "gpt2" | "r50k_base" | "p50k_base" | "cl100k_base";
37
  
38
  interface Tiktoken {
39
    encode(text: string, allowedSpecial?: "all" | Array<string>, disallowedSpecial?: "all" | Array<string>): number[];
40
    decode(tokens: number[]): string;
41
  }
42
}
43
```
44

45
**Usage Examples:**
46

47
```typescript
48
import { TokenTextSplitter } from "@langchain/textsplitters";
49

50
// Basic token-based splitting with GPT-2 encoding
51
const splitter = new TokenTextSplitter({
52
  encodingName: "gpt2",
53
  chunkSize: 100,  // 100 tokens per chunk
54
  chunkOverlap: 20, // 20 token overlap
55
});
56

57
const text = `This is a sample text that will be split based on actual token boundaries 
58
rather than character count. This ensures more accurate chunking for language model applications.`;
59

60
const chunks = await splitter.splitText(text);
61
// Each chunk contains approximately 100 tokens with 20 token overlap
62

63
// GPT-3.5/GPT-4 compatible splitting
64
const gpt4Splitter = new TokenTextSplitter({
65
  encodingName: "cl100k_base", // Used by GPT-3.5 and GPT-4
66
  chunkSize: 500,
67
  chunkOverlap: 50,
68
});
69

70
const longText = `Your long content here that needs to be split into chunks 
71
that respect the actual token boundaries used by modern language models...`;
72

73
const tokenChunks = await gpt4Splitter.splitText(longText);
74

75
// Different encoding options
76
const r50kSplitter = new TokenTextSplitter({
77
  encodingName: "r50k_base", // Used by text-davinci-003 and earlier models
78
  chunkSize: 200,
79
  chunkOverlap: 30,
80
});
81
```
82

83
### Encoding Options
84

85
Token text splitters support various tiktoken encodings for different language models:
86

87
```typescript { .api }
88
/**
89
 * Supported tiktoken encodings for different language models
90
 */
91
type TiktokenEncoding = 
92
  | "gpt2"           // GPT-2, used by older models
93
  | "r50k_base"      // Used by text-davinci-003 and earlier GPT-3 models  
94
  | "p50k_base"      // Used by text-davinci-002 and earlier GPT-3 models
95
  | "cl100k_base";   // Used by GPT-3.5 and GPT-4 models
96
```
97

98
**Encoding Examples:**
99

100
```typescript
101
// For GPT-3.5-turbo and GPT-4 applications
102
const modernSplitter = new TokenTextSplitter({
103
  encodingName: "cl100k_base",
104
  chunkSize: 1000,   // Approximately 1000 tokens
105
  chunkOverlap: 100,
106
});
107

108
// For legacy GPT-3 models
109
const legacySplitter = new TokenTextSplitter({
110
  encodingName: "r50k_base", 
111
  chunkSize: 2048,   // Common context window size
112
  chunkOverlap: 200,
113
});
114

115
// For GPT-2 applications or research
116
const gpt2Splitter = new TokenTextSplitter({
117
  encodingName: "gpt2",
118
  chunkSize: 512,
119
  chunkOverlap: 50,
120
});
121
```
122

123
### Special Token Handling
124

125
Control how special tokens are handled during tokenization:
126

127
```typescript { .api }
128
interface TokenTextSplitterParams extends TextSplitterParams {
129
  /** Special tokens that are allowed in the text */
130
  allowedSpecial: "all" | Array<string>;
131
  /** Special tokens that should cause errors if encountered */
132
  disallowedSpecial: "all" | Array<string>;
133
}
134
```
135

136
**Special Token Examples:**
137

138
```typescript
139
// Allow all special tokens
140
const permissiveSplitter = new TokenTextSplitter({
141
  encodingName: "cl100k_base",
142
  chunkSize: 500,
143
  chunkOverlap: 50,
144
  allowedSpecial: "all",
145
  disallowedSpecial: [],
146
});
147

148
// Strict special token handling - error on any special tokens
149
const strictSplitter = new TokenTextSplitter({
150
  encodingName: "cl100k_base",
151
  chunkSize: 500,
152
  chunkOverlap: 50,
153
  allowedSpecial: [],
154
  disallowedSpecial: "all",
155
});
156

157
// Allow specific special tokens
158
const customSplitter = new TokenTextSplitter({
159
  encodingName: "cl100k_base",
160
  chunkSize: 500,
161
  chunkOverlap: 50,
162
  allowedSpecial: ["<|endoftext|>", "<|startoftext|>"],
163
  disallowedSpecial: "all",
164
});
165

166
const textWithSpecialTokens = "Regular text <|endoftext|> More text <|startoftext|> Final text";
167
const chunks = await customSplitter.splitText(textWithSpecialTokens);
168
```
169

170
### Document Processing
171

172
Token text splitters integrate with LangChain's document processing pipeline:
173

174
```typescript
175
import { TokenTextSplitter } from "@langchain/textsplitters";
176
import { Document } from "@langchain/core/documents";
177

178
// Create documents with precise token management
179
const splitter = new TokenTextSplitter({
180
  encodingName: "cl100k_base",
181
  chunkSize: 300,
182
  chunkOverlap: 30,
183
});
184

185
// Split documents for RAG applications
186
const documents = [
187
  new Document({
188
    pageContent: "Long article content that needs token-based splitting...",
189
    metadata: { source: "article.txt", tokens: 1500 }
190
  })
191
];
192

193
const splitDocs = await splitter.splitDocuments(documents);
194

195
// Each split document maintains metadata and adds line location
196
splitDocs.forEach(doc => {
197
  console.log(`Chunk: ${doc.pageContent.substring(0, 50)}...`);
198
  console.log(`Metadata:`, doc.metadata);
199
  // Includes original metadata plus { loc: { lines: { from: X, to: Y } } }
200
});
201

202
// Create documents with chunk headers for context
203
const docsWithHeaders = await splitter.createDocuments(
204
  [longArticleText],
205
  [{ source: "research.pdf", page: 1 }],
206
  {
207
    chunkHeader: "=== DOCUMENT CHUNK ===\n",
208
    chunkOverlapHeader: "[CONTINUED] ",
209
    appendChunkOverlapHeader: true
210
  }
211
);
212
```
213

214
### Practical Applications
215

216
Token-based splitting is particularly useful for:
217

218
**RAG Systems with Token Limits:**
219
```typescript
220
// Ensure chunks fit within model context windows
221
const ragSplitter = new TokenTextSplitter({
222
  encodingName: "cl100k_base",
223
  chunkSize: 500,    // Leave room for query + system prompt
224
  chunkOverlap: 50,  // Maintain context between chunks
225
});
226

227
const knowledge = await splitter.createDocuments(documentTexts, metadatas);
228
// Each chunk guaranteed to fit within token budget
229
```
230

231
**Prompt Engineering:**
232
```typescript
233
// Split prompts to fit model limits while preserving structure
234
const promptSplitter = new TokenTextSplitter({
235
  encodingName: "cl100k_base", 
236
  chunkSize: 2000,   // Under 4K context limit
237
  chunkOverlap: 100, // Maintain instruction context
238
});
239

240
const longPrompt = "Complex multi-part instructions...";
241
const promptChunks = await promptSplitter.splitText(longPrompt);
242
```
243

244
**Content Processing Pipelines:**
245
```typescript
246
// Process large documents with precise token accounting
247
const pipelineSplitter = new TokenTextSplitter({
248
  encodingName: "cl100k_base",
249
  chunkSize: 1000,
250
  chunkOverlap: 100,
251
  lengthFunction: async (text: string) => {
252
    // Custom length function could add processing overhead
253
    const tokenCount = await countTokens(text);
254
    return tokenCount;
255
  }
256
});
257

258
const processedDocs = await pipelineSplitter.transformDocuments(inputDocs);
259
```
260

261
### Performance Considerations
262

263
Token-based splitting requires tokenization which has performance implications:
264

265
```typescript
266
// Reuse splitter instances to avoid repeated tokenizer initialization
267
const sharedSplitter = new TokenTextSplitter({
268
  encodingName: "cl100k_base",
269
  chunkSize: 500,
270
  chunkOverlap: 50,
271
});
272

273
// Process multiple texts with same splitter
274
const allChunks = await Promise.all(
275
  texts.map(text => sharedSplitter.splitText(text))
276
);
277

278
// For high-volume processing, consider batching
279
const batchSize = 10;
280
for (let i = 0; i < texts.length; i += batchSize) {
281
  const batch = texts.slice(i, i + batchSize);
282
  const batchChunks = await Promise.all(
283
    batch.map(text => sharedSplitter.splitText(text))
284
  );
285
  // Process batch results
286
}
287
```

Version

Tile

Files

token-splitting.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

token-splitting.mddocs/