or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

character-splitting.mdcode-splitting.mdcore-base.mddocument-structure.mdindex.mdnlp-splitting.mdtoken-splitting.md

token-splitting.mddocs/

0

# Token-Based Text Splitting

1

2

Token-based splitting provides advanced text segmentation based on tokenization models. This approach ensures precise control over chunk sizes in terms of tokens rather than characters, which is crucial for language model applications that have token-based context limits.

3

4

## Capabilities

5

6

### OpenAI Token Splitting

7

8

Text splitting based on OpenAI's tiktoken tokenizer, supporting various encoding schemes and models.

9

10

```python { .api }

11

class TokenTextSplitter(TextSplitter):

12

def __init__(

13

self,

14

encoding_name: str = "gpt2",

15

model_name: Optional[str] = None,

16

allowed_special: Union[Literal["all"], set[str]] = set(),

17

disallowed_special: Union[Literal["all"], Collection[str]] = "all",

18

**kwargs: Any

19

) -> None: ...

20

21

def split_text(self, text: str) -> list[str]: ...

22

```

23

24

**Parameters:**

25

- `encoding_name`: Tiktoken encoding name (default: `"gpt2"`)

26

- `model_name`: Optional OpenAI model name to determine encoding

27

- `allowed_special`: Special tokens allowed during encoding

28

- `disallowed_special`: Special tokens that raise errors during encoding

29

- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`

30

31

**Usage:**

32

33

```python

34

from langchain_text_splitters import TokenTextSplitter

35

36

# Basic token splitting with GPT-2 encoding

37

splitter = TokenTextSplitter(

38

encoding_name="gpt2",

39

chunk_size=512, # 512 tokens per chunk

40

chunk_overlap=50

41

)

42

chunks = splitter.split_text("Long text to be tokenized and split...")

43

44

# Model-specific token splitting

45

gpt4_splitter = TokenTextSplitter(

46

model_name="gpt-4",

47

chunk_size=1000,

48

chunk_overlap=100

49

)

50

51

# Custom special token handling

52

custom_splitter = TokenTextSplitter(

53

encoding_name="cl100k_base", # GPT-3.5/GPT-4 encoding

54

allowed_special={"<|endoftext|>"},

55

disallowed_special="all",

56

chunk_size=800

57

)

58

```

59

60

### Sentence Transformer Token Splitting

61

62

Token splitting using sentence transformer models, optimized for embedding-based applications.

63

64

```python { .api }

65

class SentenceTransformersTokenTextSplitter(TextSplitter):

66

def __init__(

67

self,

68

chunk_overlap: int = 50,

69

model_name: str = "sentence-transformers/all-mpnet-base-v2",

70

tokens_per_chunk: Optional[int] = None,

71

**kwargs: Any

72

) -> None: ...

73

74

def split_text(self, text: str) -> list[str]: ...

75

76

def count_tokens(self, text: str) -> int: ...

77

```

78

79

**Parameters:**

80

- `chunk_overlap`: Token overlap between chunks (default: `50`)

81

- `model_name`: Sentence transformer model name (default: `"sentence-transformers/all-mpnet-base-v2"`)

82

- `tokens_per_chunk`: Maximum tokens per chunk (overrides `chunk_size`)

83

- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`

84

85

**Methods:**

86

- `count_tokens()`: Count tokens in text using the model's tokenizer

87

88

**Usage:**

89

90

```python

91

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

92

93

# Basic sentence transformer splitting

94

splitter = SentenceTransformersTokenTextSplitter(

95

model_name="sentence-transformers/all-mpnet-base-v2",

96

chunk_overlap=50,

97

tokens_per_chunk=384 # Common embedding model context size

98

)

99

100

text = "Document to be split for embedding..."

101

chunks = splitter.split_text(text)

102

103

# Count tokens in text

104

token_count = splitter.count_tokens("Sample text to count")

105

106

# Different embedding models

107

distilbert_splitter = SentenceTransformersTokenTextSplitter(

108

model_name="sentence-transformers/distilbert-base-nli-mean-tokens",

109

tokens_per_chunk=512

110

)

111

112

roberta_splitter = SentenceTransformersTokenTextSplitter(

113

model_name="sentence-transformers/all-roberta-large-v1",

114

tokens_per_chunk=256

115

)

116

```

117

118

### Factory Methods for Token Splitting

119

120

Convenient factory methods on the base `TextSplitter` class for creating token-based splitters.

121

122

```python { .api }

123

class TextSplitter:

124

@classmethod

125

def from_huggingface_tokenizer(

126

cls,

127

tokenizer: Any,

128

**kwargs: Any

129

) -> "TextSplitter": ...

130

131

@classmethod

132

def from_tiktoken_encoder(

133

cls,

134

encoding_name: str = "gpt2",

135

model_name: Optional[str] = None,

136

allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),

137

disallowed_special: Union[Literal["all"], Collection[str]] = "all",

138

**kwargs: Any

139

) -> Self: ...

140

```

141

142

**Factory Methods:**

143

- `from_huggingface_tokenizer()`: Create splitter from HuggingFace tokenizer

144

- `from_tiktoken_encoder()`: Create splitter from tiktoken encoder

145

146

**Usage:**

147

148

```python

149

from langchain_text_splitters import TextSplitter

150

from transformers import AutoTokenizer

151

152

# Create splitter from HuggingFace tokenizer

153

hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

154

hf_splitter = TextSplitter.from_huggingface_tokenizer(

155

tokenizer=hf_tokenizer,

156

chunk_size=512,

157

chunk_overlap=50

158

)

159

160

# Create splitter from tiktoken encoder

161

tiktoken_splitter = TextSplitter.from_tiktoken_encoder(

162

encoding_name="cl100k_base",

163

chunk_size=1000,

164

chunk_overlap=100

165

)

166

```

167

168

### Tokenizer Configuration

169

170

Low-level tokenizer configuration for advanced use cases.

171

172

```python { .api }

173

@dataclass(frozen=True)

174

class Tokenizer:

175

chunk_overlap: int

176

tokens_per_chunk: int

177

decode: Callable[[list[int]], str]

178

encode: Callable[[str], list[int]]

179

180

def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]: ...

181

```

182

183

**Usage:**

184

185

```python

186

from langchain_text_splitters import Tokenizer, split_text_on_tokens

187

import tiktoken

188

189

# Create custom tokenizer configuration

190

encoding = tiktoken.get_encoding("gpt2")

191

custom_tokenizer = Tokenizer(

192

chunk_overlap=50,

193

tokens_per_chunk=500,

194

decode=encoding.decode,

195

encode=encoding.encode

196

)

197

198

# Use tokenizer to split text

199

text = "Text to be split using custom tokenizer..."

200

chunks = split_text_on_tokens(text=text, tokenizer=custom_tokenizer)

201

```

202

203

## Supported Encodings and Models

204

205

### Tiktoken Encodings

206

- `gpt2`: GPT-2 and smaller GPT-3 models

207

- `r50k_base`: text-davinci-002, text-davinci-003

208

- `p50k_base`: Code models, text-davinci-edit-001, text-similarity-*

209

- `cl100k_base`: GPT-3.5, GPT-4, text-embedding-ada-002

210

211

### Popular Sentence Transformer Models

212

- `all-mpnet-base-v2`: High-quality general-purpose embeddings

213

- `all-MiniLM-L6-v2`: Fast and efficient embeddings

214

- `distilbert-base-nli-mean-tokens`: Lightweight BERT-based embeddings

215

- `all-roberta-large-v1`: High-quality RoBERTa-based embeddings

216

217

## Best Practices

218

219

1. **Match model encodings**: Use the same tokenizer as your target language model

220

2. **Account for context limits**: Set chunk sizes well below model context limits

221

3. **Optimize for embeddings**: For RAG applications, use sentence transformer token splitting

222

4. **Consider special tokens**: Configure special token handling based on your use case

223

5. **Monitor token usage**: Use `count_tokens()` method to verify chunk sizes

224

6. **Test with your data**: Different text types may tokenize differently