or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdformat-splitting.mdindex.mdrecursive-splitting.mdtoken-splitting.md
tile.json

token-splitting.mddocs/

0

# Token-Based Splitting

1

2

Token-aware splitting using tiktoken encoding for accurate token count management. Essential for applications that need precise token-based chunking for language models.

3

4

## Capabilities

5

6

### TokenTextSplitter Class

7

8

Splits text based on actual token boundaries using tiktoken encoding, providing accurate token-based chunking for language model applications.

9

10

```typescript { .api }

11

/**

12

* Text splitter that splits text based on token count using tiktoken encoding

13

*/

14

class TokenTextSplitter extends TextSplitter implements TokenTextSplitterParams {

15

encodingName: tiktoken.TiktokenEncoding;

16

allowedSpecial: "all" | Array<string>;

17

disallowedSpecial: "all" | Array<string>;

18

private tokenizer: tiktoken.Tiktoken;

19

20

constructor(fields?: Partial<TokenTextSplitterParams>);

21

splitText(text: string): Promise<string[]>;

22

static lc_name(): string;

23

}

24

25

interface TokenTextSplitterParams extends TextSplitterParams {

26

/** The tiktoken encoding to use (default: "gpt2") */

27

encodingName: tiktoken.TiktokenEncoding;

28

/** Special tokens that are allowed in the text (default: []) */

29

allowedSpecial: "all" | Array<string>;

30

/** Special tokens that should cause errors if encountered (default: "all") */

31

disallowedSpecial: "all" | Array<string>;

32

}

33

34

// Tiktoken encoding types from js-tiktoken

35

namespace tiktoken {

36

type TiktokenEncoding = "gpt2" | "r50k_base" | "p50k_base" | "cl100k_base";

37

38

interface Tiktoken {

39

encode(text: string, allowedSpecial?: "all" | Array<string>, disallowedSpecial?: "all" | Array<string>): number[];

40

decode(tokens: number[]): string;

41

}

42

}

43

```

44

45

**Usage Examples:**

46

47

```typescript

48

import { TokenTextSplitter } from "@langchain/textsplitters";

49

50

// Basic token-based splitting with GPT-2 encoding

51

const splitter = new TokenTextSplitter({

52

encodingName: "gpt2",

53

chunkSize: 100, // 100 tokens per chunk

54

chunkOverlap: 20, // 20 token overlap

55

});

56

57

const text = `This is a sample text that will be split based on actual token boundaries

58

rather than character count. This ensures more accurate chunking for language model applications.`;

59

60

const chunks = await splitter.splitText(text);

61

// Each chunk contains approximately 100 tokens with 20 token overlap

62

63

// GPT-3.5/GPT-4 compatible splitting

64

const gpt4Splitter = new TokenTextSplitter({

65

encodingName: "cl100k_base", // Used by GPT-3.5 and GPT-4

66

chunkSize: 500,

67

chunkOverlap: 50,

68

});

69

70

const longText = `Your long content here that needs to be split into chunks

71

that respect the actual token boundaries used by modern language models...`;

72

73

const tokenChunks = await gpt4Splitter.splitText(longText);

74

75

// Different encoding options

76

const r50kSplitter = new TokenTextSplitter({

77

encodingName: "r50k_base", // Used by text-davinci-003 and earlier models

78

chunkSize: 200,

79

chunkOverlap: 30,

80

});

81

```

82

83

### Encoding Options

84

85

Token text splitters support various tiktoken encodings for different language models:

86

87

```typescript { .api }

88

/**

89

* Supported tiktoken encodings for different language models

90

*/

91

type TiktokenEncoding =

92

| "gpt2" // GPT-2, used by older models

93

| "r50k_base" // Used by text-davinci-003 and earlier GPT-3 models

94

| "p50k_base" // Used by text-davinci-002 and earlier GPT-3 models

95

| "cl100k_base"; // Used by GPT-3.5 and GPT-4 models

96

```

97

98

**Encoding Examples:**

99

100

```typescript

101

// For GPT-3.5-turbo and GPT-4 applications

102

const modernSplitter = new TokenTextSplitter({

103

encodingName: "cl100k_base",

104

chunkSize: 1000, // Approximately 1000 tokens

105

chunkOverlap: 100,

106

});

107

108

// For legacy GPT-3 models

109

const legacySplitter = new TokenTextSplitter({

110

encodingName: "r50k_base",

111

chunkSize: 2048, // Common context window size

112

chunkOverlap: 200,

113

});

114

115

// For GPT-2 applications or research

116

const gpt2Splitter = new TokenTextSplitter({

117

encodingName: "gpt2",

118

chunkSize: 512,

119

chunkOverlap: 50,

120

});

121

```

122

123

### Special Token Handling

124

125

Control how special tokens are handled during tokenization:

126

127

```typescript { .api }

128

interface TokenTextSplitterParams extends TextSplitterParams {

129

/** Special tokens that are allowed in the text */

130

allowedSpecial: "all" | Array<string>;

131

/** Special tokens that should cause errors if encountered */

132

disallowedSpecial: "all" | Array<string>;

133

}

134

```

135

136

**Special Token Examples:**

137

138

```typescript

139

// Allow all special tokens

140

const permissiveSplitter = new TokenTextSplitter({

141

encodingName: "cl100k_base",

142

chunkSize: 500,

143

chunkOverlap: 50,

144

allowedSpecial: "all",

145

disallowedSpecial: [],

146

});

147

148

// Strict special token handling - error on any special tokens

149

const strictSplitter = new TokenTextSplitter({

150

encodingName: "cl100k_base",

151

chunkSize: 500,

152

chunkOverlap: 50,

153

allowedSpecial: [],

154

disallowedSpecial: "all",

155

});

156

157

// Allow specific special tokens

158

const customSplitter = new TokenTextSplitter({

159

encodingName: "cl100k_base",

160

chunkSize: 500,

161

chunkOverlap: 50,

162

allowedSpecial: ["<|endoftext|>", "<|startoftext|>"],

163

disallowedSpecial: "all",

164

});

165

166

const textWithSpecialTokens = "Regular text <|endoftext|> More text <|startoftext|> Final text";

167

const chunks = await customSplitter.splitText(textWithSpecialTokens);

168

```

169

170

### Document Processing

171

172

Token text splitters integrate with LangChain's document processing pipeline:

173

174

```typescript

175

import { TokenTextSplitter } from "@langchain/textsplitters";

176

import { Document } from "@langchain/core/documents";

177

178

// Create documents with precise token management

179

const splitter = new TokenTextSplitter({

180

encodingName: "cl100k_base",

181

chunkSize: 300,

182

chunkOverlap: 30,

183

});

184

185

// Split documents for RAG applications

186

const documents = [

187

new Document({

188

pageContent: "Long article content that needs token-based splitting...",

189

metadata: { source: "article.txt", tokens: 1500 }

190

})

191

];

192

193

const splitDocs = await splitter.splitDocuments(documents);

194

195

// Each split document maintains metadata and adds line location

196

splitDocs.forEach(doc => {

197

console.log(`Chunk: ${doc.pageContent.substring(0, 50)}...`);

198

console.log(`Metadata:`, doc.metadata);

199

// Includes original metadata plus { loc: { lines: { from: X, to: Y } } }

200

});

201

202

// Create documents with chunk headers for context

203

const docsWithHeaders = await splitter.createDocuments(

204

[longArticleText],

205

[{ source: "research.pdf", page: 1 }],

206

{

207

chunkHeader: "=== DOCUMENT CHUNK ===\n",

208

chunkOverlapHeader: "[CONTINUED] ",

209

appendChunkOverlapHeader: true

210

}

211

);

212

```

213

214

### Practical Applications

215

216

Token-based splitting is particularly useful for:

217

218

**RAG Systems with Token Limits:**

219

```typescript

220

// Ensure chunks fit within model context windows

221

const ragSplitter = new TokenTextSplitter({

222

encodingName: "cl100k_base",

223

chunkSize: 500, // Leave room for query + system prompt

224

chunkOverlap: 50, // Maintain context between chunks

225

});

226

227

const knowledge = await splitter.createDocuments(documentTexts, metadatas);

228

// Each chunk guaranteed to fit within token budget

229

```

230

231

**Prompt Engineering:**

232

```typescript

233

// Split prompts to fit model limits while preserving structure

234

const promptSplitter = new TokenTextSplitter({

235

encodingName: "cl100k_base",

236

chunkSize: 2000, // Under 4K context limit

237

chunkOverlap: 100, // Maintain instruction context

238

});

239

240

const longPrompt = "Complex multi-part instructions...";

241

const promptChunks = await promptSplitter.splitText(longPrompt);

242

```

243

244

**Content Processing Pipelines:**

245

```typescript

246

// Process large documents with precise token accounting

247

const pipelineSplitter = new TokenTextSplitter({

248

encodingName: "cl100k_base",

249

chunkSize: 1000,

250

chunkOverlap: 100,

251

lengthFunction: async (text: string) => {

252

// Custom length function could add processing overhead

253

const tokenCount = await countTokens(text);

254

return tokenCount;

255

}

256

});

257

258

const processedDocs = await pipelineSplitter.transformDocuments(inputDocs);

259

```

260

261

### Performance Considerations

262

263

Token-based splitting requires tokenization which has performance implications:

264

265

```typescript

266

// Reuse splitter instances to avoid repeated tokenizer initialization

267

const sharedSplitter = new TokenTextSplitter({

268

encodingName: "cl100k_base",

269

chunkSize: 500,

270

chunkOverlap: 50,

271

});

272

273

// Process multiple texts with same splitter

274

const allChunks = await Promise.all(

275

texts.map(text => sharedSplitter.splitText(text))

276

);

277

278

// For high-volume processing, consider batching

279

const batchSize = 10;

280

for (let i = 0; i < texts.length; i += batchSize) {

281

const batch = texts.slice(i, i + batchSize);

282

const batchChunks = await Promise.all(

283

batch.map(text => sharedSplitter.splitText(text))

284

);

285

// Process batch results

286

}

287

```