0
# Token-Based Text Splitting
1
2
Token-based splitting provides advanced text segmentation based on tokenization models. This approach ensures precise control over chunk sizes in terms of tokens rather than characters, which is crucial for language model applications that have token-based context limits.
3
4
## Capabilities
5
6
### OpenAI Token Splitting
7
8
Text splitting based on OpenAI's tiktoken tokenizer, supporting various encoding schemes and models.
9
10
```python { .api }
11
class TokenTextSplitter(TextSplitter):
12
def __init__(
13
self,
14
encoding_name: str = "gpt2",
15
model_name: Optional[str] = None,
16
allowed_special: Union[Literal["all"], set[str]] = set(),
17
disallowed_special: Union[Literal["all"], Collection[str]] = "all",
18
**kwargs: Any
19
) -> None: ...
20
21
def split_text(self, text: str) -> list[str]: ...
22
```
23
24
**Parameters:**
25
- `encoding_name`: Tiktoken encoding name (default: `"gpt2"`)
26
- `model_name`: Optional OpenAI model name to determine encoding
27
- `allowed_special`: Special tokens allowed during encoding
28
- `disallowed_special`: Special tokens that raise errors during encoding
29
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
30
31
**Usage:**
32
33
```python
34
from langchain_text_splitters import TokenTextSplitter
35
36
# Basic token splitting with GPT-2 encoding
37
splitter = TokenTextSplitter(
38
encoding_name="gpt2",
39
chunk_size=512, # 512 tokens per chunk
40
chunk_overlap=50
41
)
42
chunks = splitter.split_text("Long text to be tokenized and split...")
43
44
# Model-specific token splitting
45
gpt4_splitter = TokenTextSplitter(
46
model_name="gpt-4",
47
chunk_size=1000,
48
chunk_overlap=100
49
)
50
51
# Custom special token handling
52
custom_splitter = TokenTextSplitter(
53
encoding_name="cl100k_base", # GPT-3.5/GPT-4 encoding
54
allowed_special={"<|endoftext|>"},
55
disallowed_special="all",
56
chunk_size=800
57
)
58
```
59
60
### Sentence Transformer Token Splitting
61
62
Token splitting using sentence transformer models, optimized for embedding-based applications.
63
64
```python { .api }
65
class SentenceTransformersTokenTextSplitter(TextSplitter):
66
def __init__(
67
self,
68
chunk_overlap: int = 50,
69
model_name: str = "sentence-transformers/all-mpnet-base-v2",
70
tokens_per_chunk: Optional[int] = None,
71
**kwargs: Any
72
) -> None: ...
73
74
def split_text(self, text: str) -> list[str]: ...
75
76
def count_tokens(self, text: str) -> int: ...
77
```
78
79
**Parameters:**
80
- `chunk_overlap`: Token overlap between chunks (default: `50`)
81
- `model_name`: Sentence transformer model name (default: `"sentence-transformers/all-mpnet-base-v2"`)
82
- `tokens_per_chunk`: Maximum tokens per chunk (overrides `chunk_size`)
83
- `**kwargs`: Additional parameters passed to `TextSplitter.__init__()`
84
85
**Methods:**
86
- `count_tokens()`: Count tokens in text using the model's tokenizer
87
88
**Usage:**
89
90
```python
91
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
92
93
# Basic sentence transformer splitting
94
splitter = SentenceTransformersTokenTextSplitter(
95
model_name="sentence-transformers/all-mpnet-base-v2",
96
chunk_overlap=50,
97
tokens_per_chunk=384 # Common embedding model context size
98
)
99
100
text = "Document to be split for embedding..."
101
chunks = splitter.split_text(text)
102
103
# Count tokens in text
104
token_count = splitter.count_tokens("Sample text to count")
105
106
# Different embedding models
107
distilbert_splitter = SentenceTransformersTokenTextSplitter(
108
model_name="sentence-transformers/distilbert-base-nli-mean-tokens",
109
tokens_per_chunk=512
110
)
111
112
roberta_splitter = SentenceTransformersTokenTextSplitter(
113
model_name="sentence-transformers/all-roberta-large-v1",
114
tokens_per_chunk=256
115
)
116
```
117
118
### Factory Methods for Token Splitting
119
120
Convenient factory methods on the base `TextSplitter` class for creating token-based splitters.
121
122
```python { .api }
123
class TextSplitter:
124
@classmethod
125
def from_huggingface_tokenizer(
126
cls,
127
tokenizer: Any,
128
**kwargs: Any
129
) -> "TextSplitter": ...
130
131
@classmethod
132
def from_tiktoken_encoder(
133
cls,
134
encoding_name: str = "gpt2",
135
model_name: Optional[str] = None,
136
allowed_special: Union[Literal["all"], AbstractSet[str]] = set(),
137
disallowed_special: Union[Literal["all"], Collection[str]] = "all",
138
**kwargs: Any
139
) -> Self: ...
140
```
141
142
**Factory Methods:**
143
- `from_huggingface_tokenizer()`: Create splitter from HuggingFace tokenizer
144
- `from_tiktoken_encoder()`: Create splitter from tiktoken encoder
145
146
**Usage:**
147
148
```python
149
from langchain_text_splitters import TextSplitter
150
from transformers import AutoTokenizer
151
152
# Create splitter from HuggingFace tokenizer
153
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
154
hf_splitter = TextSplitter.from_huggingface_tokenizer(
155
tokenizer=hf_tokenizer,
156
chunk_size=512,
157
chunk_overlap=50
158
)
159
160
# Create splitter from tiktoken encoder
161
tiktoken_splitter = TextSplitter.from_tiktoken_encoder(
162
encoding_name="cl100k_base",
163
chunk_size=1000,
164
chunk_overlap=100
165
)
166
```
167
168
### Tokenizer Configuration
169
170
Low-level tokenizer configuration for advanced use cases.
171
172
```python { .api }
173
@dataclass(frozen=True)
174
class Tokenizer:
175
chunk_overlap: int
176
tokens_per_chunk: int
177
decode: Callable[[list[int]], str]
178
encode: Callable[[str], list[int]]
179
180
def split_text_on_tokens(*, text: str, tokenizer: Tokenizer) -> list[str]: ...
181
```
182
183
**Usage:**
184
185
```python
186
from langchain_text_splitters import Tokenizer, split_text_on_tokens
187
import tiktoken
188
189
# Create custom tokenizer configuration
190
encoding = tiktoken.get_encoding("gpt2")
191
custom_tokenizer = Tokenizer(
192
chunk_overlap=50,
193
tokens_per_chunk=500,
194
decode=encoding.decode,
195
encode=encoding.encode
196
)
197
198
# Use tokenizer to split text
199
text = "Text to be split using custom tokenizer..."
200
chunks = split_text_on_tokens(text=text, tokenizer=custom_tokenizer)
201
```
202
203
## Supported Encodings and Models
204
205
### Tiktoken Encodings
206
- `gpt2`: GPT-2 and smaller GPT-3 models
207
- `r50k_base`: text-davinci-002, text-davinci-003
208
- `p50k_base`: Code models, text-davinci-edit-001, text-similarity-*
209
- `cl100k_base`: GPT-3.5, GPT-4, text-embedding-ada-002
210
211
### Popular Sentence Transformer Models
212
- `all-mpnet-base-v2`: High-quality general-purpose embeddings
213
- `all-MiniLM-L6-v2`: Fast and efficient embeddings
214
- `distilbert-base-nli-mean-tokens`: Lightweight BERT-based embeddings
215
- `all-roberta-large-v1`: High-quality RoBERTa-based embeddings
216
217
## Best Practices
218
219
1. **Match model encodings**: Use the same tokenizer as your target language model
220
2. **Account for context limits**: Set chunk sizes well below model context limits
221
3. **Optimize for embeddings**: For RAG applications, use sentence transformer token splitting
222
4. **Consider special tokens**: Configure special token handling based on your use case
223
5. **Monitor token usage**: Use `count_tokens()` method to verify chunk sizes
224
6. **Test with your data**: Different text types may tokenize differently