0
# Tokenization
1
2
Comprehensive tokenization with support for 100+ different tokenizers, handling subword tokenization, special tokens, efficient batch processing, and cross-framework compatibility. The tokenization system provides consistent APIs across different architectures while optimizing for speed and memory efficiency.
3
4
## Capabilities
5
6
### Auto Tokenizer
7
8
Automatic tokenizer selection based on model names or configurations.
9
10
```python { .api }
11
class AutoTokenizer:
12
@classmethod
13
def from_pretrained(
14
cls,
15
pretrained_model_name_or_path: Union[str, os.PathLike],
16
*inputs,
17
cache_dir: Union[str, os.PathLike] = None,
18
force_download: bool = False,
19
local_files_only: bool = False,
20
token: Union[str, bool] = None,
21
revision: str = "main",
22
use_fast: bool = True,
23
tokenizer_type: Optional[str] = None,
24
trust_remote_code: bool = False,
25
**kwargs
26
) -> PreTrainedTokenizer:
27
"""
28
Load tokenizer automatically detecting the type.
29
30
Args:
31
pretrained_model_name_or_path: Model name or path
32
cache_dir: Custom cache directory
33
force_download: Force fresh download
34
local_files_only: Only use local files
35
token: Authentication token
36
revision: Model revision/branch
37
use_fast: Use fast (Rust-based) tokenizer when available
38
tokenizer_type: Override auto-detected tokenizer type
39
trust_remote_code: Allow custom tokenizer code
40
41
Returns:
42
Loaded tokenizer instance
43
"""
44
```
45
46
### Base Tokenizer Classes
47
48
Foundation classes for all tokenizer implementations.
49
50
```python { .api }
51
class PreTrainedTokenizer:
52
"""Base class for all Python tokenizers."""
53
54
def __init__(
55
self,
56
model_max_length: int = None,
57
padding_side: str = "right",
58
truncation_side: str = "right",
59
chat_template: str = None,
60
model_input_names: List[str] = None,
61
bos_token: Union[str, AddedToken] = None,
62
eos_token: Union[str, AddedToken] = None,
63
unk_token: Union[str, AddedToken] = None,
64
sep_token: Union[str, AddedToken] = None,
65
pad_token: Union[str, AddedToken] = None,
66
cls_token: Union[str, AddedToken] = None,
67
mask_token: Union[str, AddedToken] = None,
68
additional_special_tokens: List[Union[str, AddedToken]] = None,
69
**kwargs
70
)
71
72
def __call__(
73
self,
74
text: Union[str, List[str], List[List[str]]] = None,
75
text_pair: Union[str, List[str], List[List[str]]] = None,
76
text_target: Union[str, List[str], List[List[str]]] = None,
77
text_pair_target: Union[str, List[str], List[List[str]]] = None,
78
add_special_tokens: bool = True,
79
padding: Union[bool, str] = False,
80
truncation: Union[bool, str] = None,
81
max_length: Optional[int] = None,
82
stride: int = 0,
83
is_split_into_words: bool = False,
84
pad_to_multiple_of: Optional[int] = None,
85
return_tensors: Optional[Union[str, TensorType]] = None,
86
return_token_type_ids: Optional[bool] = None,
87
return_attention_mask: Optional[bool] = None,
88
return_overflowing_tokens: bool = False,
89
return_special_tokens_mask: bool = False,
90
return_offsets_mapping: bool = False,
91
return_length: bool = False,
92
verbose: bool = True,
93
**kwargs
94
) -> BatchEncoding:
95
"""
96
Main tokenization method with extensive options.
97
98
Args:
99
text: Input text(s) to tokenize
100
text_pair: Paired text for sequence pair tasks
101
add_special_tokens: Add model-specific special tokens
102
padding: Padding strategy ("longest", "max_length", True, False)
103
truncation: Truncation strategy (True, False, "longest_first", etc.)
104
max_length: Maximum sequence length
105
stride: Stride for overlapping windows
106
is_split_into_words: Whether input is pre-tokenized
107
pad_to_multiple_of: Pad length to multiple of this value
108
return_tensors: Format of returned tensors ("pt", "tf", "np")
109
return_token_type_ids: Include token type IDs
110
return_attention_mask: Include attention mask
111
return_overflowing_tokens: Return overflowing tokens
112
return_special_tokens_mask: Mark special tokens
113
return_offsets_mapping: Include character-to-token mapping
114
return_length: Include sequence lengths
115
116
Returns:
117
BatchEncoding with tokenized inputs
118
"""
119
120
def encode(
121
self,
122
text: Union[str, List[str], List[int]],
123
text_pair: Optional[Union[str, List[str]]] = None,
124
add_special_tokens: bool = True,
125
padding: Union[bool, str] = False,
126
truncation: Union[bool, str] = None,
127
max_length: Optional[int] = None,
128
stride: int = 0,
129
return_tensors: Optional[Union[str, TensorType]] = None,
130
**kwargs
131
) -> List[int]:
132
"""
133
Encode text to token IDs.
134
135
Args:
136
text: Text to encode
137
text_pair: Paired text for sequence pairs
138
add_special_tokens: Add special tokens
139
padding: Padding strategy
140
truncation: Truncation strategy
141
max_length: Maximum sequence length
142
stride: Stride for overlapping windows
143
return_tensors: Format of returned tensors
144
145
Returns:
146
List of token IDs
147
"""
148
149
def decode(
150
self,
151
token_ids: Union[int, List[int], torch.Tensor, tf.Tensor, np.ndarray],
152
skip_special_tokens: bool = False,
153
clean_up_tokenization_spaces: bool = None,
154
**kwargs
155
) -> str:
156
"""
157
Decode token IDs back to text.
158
159
Args:
160
token_ids: Token IDs to decode
161
skip_special_tokens: Skip special tokens in output
162
clean_up_tokenization_spaces: Clean tokenization artifacts
163
164
Returns:
165
Decoded text string
166
"""
167
168
def tokenize(
169
self,
170
text: str,
171
pair: Optional[str] = None,
172
add_special_tokens: bool = False,
173
**kwargs
174
) -> List[str]:
175
"""
176
Tokenize text into tokens (not IDs).
177
178
Args:
179
text: Text to tokenize
180
pair: Paired text for sequence pairs
181
add_special_tokens: Add special tokens
182
183
Returns:
184
List of token strings
185
"""
186
187
def convert_tokens_to_ids(
188
self,
189
tokens: Union[str, List[str]]
190
) -> Union[int, List[int]]:
191
"""Convert tokens to corresponding IDs."""
192
193
def convert_ids_to_tokens(
194
self,
195
ids: Union[int, List[int]],
196
skip_special_tokens: bool = False
197
) -> Union[str, List[str]]:
198
"""Convert IDs to corresponding tokens."""
199
200
def add_special_tokens(
201
self,
202
special_tokens_dict: Dict[str, Union[str, AddedToken]]
203
) -> int:
204
"""
205
Add special tokens to vocabulary.
206
207
Args:
208
special_tokens_dict: Dictionary of special tokens
209
210
Returns:
211
Number of tokens added
212
"""
213
214
def save_pretrained(
215
self,
216
save_directory: Union[str, os.PathLike],
217
legacy_format: Optional[bool] = None,
218
filename_prefix: Optional[str] = None,
219
push_to_hub: bool = False,
220
**kwargs
221
) -> Tuple[str]:
222
"""Save tokenizer to directory."""
223
224
class PreTrainedTokenizerFast:
225
"""Base class for fast (Rust-based) tokenizers."""
226
227
def __init__(
228
self,
229
tokenizer_object: Optional["Tokenizer"] = None,
230
tokenizer_file: Optional[str] = None,
231
**kwargs
232
)
233
234
# Inherits most methods from PreTrainedTokenizer with optimized implementations
235
236
def train_new_from_iterator(
237
self,
238
text_iterator: Iterator[str],
239
vocab_size: int,
240
length: Optional[int] = None,
241
new_special_tokens: Optional[List[str]] = None,
242
special_tokens_map: Optional[Dict[str, str]] = None,
243
**kwargs
244
) -> "PreTrainedTokenizerFast":
245
"""Train new tokenizer from text iterator."""
246
247
def push_to_hub(
248
self,
249
repo_id: str,
250
use_temp_dir: Optional[bool] = None,
251
commit_message: Optional[str] = None,
252
private: Optional[bool] = None,
253
token: Union[bool, str] = None,
254
**kwargs
255
) -> str:
256
"""Upload tokenizer to Hugging Face Hub."""
257
```
258
259
### Batch Encoding
260
261
Container for tokenizer outputs with tensor conversion capabilities.
262
263
```python { .api }
264
class BatchEncoding:
265
"""Container for tokenized inputs with convenient methods."""
266
267
def __init__(
268
self,
269
data: Optional[Dict[str, Any]] = None,
270
encoding: Optional[List["EncodingFast"]] = None,
271
tensor_type: Union[None, str, TensorType] = None,
272
prepend_batch_axis: bool = False,
273
n_sequences: Optional[int] = None
274
)
275
276
def __getitem__(self, item: Union[str, int]) -> Union[Any, List[Any]]:
277
"""Access tokenized data by key or index."""
278
279
def __setitem__(self, key: str, value: Any) -> None:
280
"""Set tokenized data value."""
281
282
def keys(self) -> List[str]:
283
"""Get all available keys."""
284
285
def values(self) -> List[Any]:
286
"""Get all values."""
287
288
def items(self) -> List[Tuple[str, Any]]:
289
"""Get key-value pairs."""
290
291
def to(
292
self,
293
device: Union[str, torch.device, int]
294
) -> "BatchEncoding":
295
"""Move tensors to specified device."""
296
297
def convert_to_tensors(
298
self,
299
tensor_type: Optional[Union[str, TensorType]] = None,
300
prepend_batch_axis: bool = False
301
) -> "BatchEncoding":
302
"""Convert to specified tensor format."""
303
304
@property
305
def input_ids(self) -> Optional[List[List[int]]]:
306
"""Token IDs for input sequences."""
307
308
@property
309
def attention_mask(self) -> Optional[List[List[int]]]:
310
"""Attention mask (1 for real tokens, 0 for padding)."""
311
312
@property
313
def token_type_ids(self) -> Optional[List[List[int]]]:
314
"""Token type IDs for sequence pairs."""
315
316
def char_to_token(
317
self,
318
batch_or_char_index: int,
319
char_index: Optional[int] = None,
320
sequence_index: int = 0
321
) -> Optional[int]:
322
"""Convert character index to token index."""
323
324
def token_to_chars(
325
self,
326
batch_or_token_index: int,
327
token_index: Optional[int] = None,
328
sequence_index: int = 0
329
) -> Optional[Tuple[int, int]]:
330
"""Convert token index to character span."""
331
332
def word_to_tokens(
333
self,
334
batch_or_word_index: int,
335
word_index: Optional[int] = None,
336
sequence_index: int = 0
337
) -> Optional[Tuple[int, int]]:
338
"""Convert word index to token span."""
339
```
340
341
### Popular Tokenizer Implementations
342
343
#### BERT Tokenizers
344
```python { .api }
345
class BertTokenizer(PreTrainedTokenizer):
346
"""BERT WordPiece tokenizer."""
347
348
class BertTokenizerFast(PreTrainedTokenizerFast):
349
"""Fast BERT tokenizer."""
350
```
351
352
#### GPT Tokenizers
353
```python { .api }
354
class GPT2Tokenizer(PreTrainedTokenizer):
355
"""GPT-2 BPE tokenizer."""
356
357
class GPT2TokenizerFast(PreTrainedTokenizerFast):
358
"""Fast GPT-2 tokenizer."""
359
```
360
361
#### T5 Tokenizers
362
```python { .api }
363
class T5Tokenizer(PreTrainedTokenizer):
364
"""T5 SentencePiece tokenizer."""
365
366
class T5TokenizerFast(PreTrainedTokenizerFast):
367
"""Fast T5 tokenizer."""
368
```
369
370
#### RoBERTa Tokenizers
371
```python { .api }
372
class RobertaTokenizer(PreTrainedTokenizer):
373
"""RoBERTa BPE tokenizer."""
374
375
class RobertaTokenizerFast(PreTrainedTokenizerFast):
376
"""Fast RoBERTa tokenizer."""
377
```
378
379
### Special Token Handling
380
381
```python { .api }
382
class AddedToken:
383
"""Represents a token that was added to the vocabulary."""
384
385
def __init__(
386
self,
387
content: str,
388
single_word: bool = False,
389
lstrip: bool = False,
390
rstrip: bool = False,
391
normalized: bool = True,
392
special: bool = False
393
):
394
"""
395
Create an added token.
396
397
Args:
398
content: Token content
399
single_word: Whether token represents a single word
400
lstrip: Remove leading whitespace
401
rstrip: Remove trailing whitespace
402
normalized: Whether token is normalized
403
special: Whether this is a special token
404
"""
405
```
406
407
### Tokenization Utilities
408
409
Helper functions for common tokenization tasks.
410
411
```python { .api }
412
def is_tokenizers_available() -> bool:
413
"""Check if tokenizers library is available."""
414
415
def clean_up_tokenization(text: str) -> str:
416
"""Clean up tokenization artifacts in text."""
417
418
def get_pairs(word: Tuple[str, ...]) -> Set[Tuple[str, str]]:
419
"""Get all character pairs in a word (for BPE)."""
420
```
421
422
## Tokenization Examples
423
424
Common tokenization patterns and use cases:
425
426
```python
427
from transformers import AutoTokenizer
428
429
# Load tokenizer
430
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
431
432
# Basic tokenization
433
text = "Hello, world!"
434
tokens = tokenizer.tokenize(text)
435
# Output: ['hello', ',', 'world', '!']
436
437
# Encode to IDs
438
token_ids = tokenizer.encode(text)
439
# Output: [101, 7592, 1010, 2088, 999, 102] # [CLS] + tokens + [SEP]
440
441
# Decode back to text
442
decoded = tokenizer.decode(token_ids)
443
# Output: "[CLS] hello, world! [SEP]"
444
445
# Skip special tokens
446
decoded_clean = tokenizer.decode(token_ids, skip_special_tokens=True)
447
# Output: "hello, world!"
448
449
# Batch processing with padding
450
texts = ["Short text", "This is a much longer text that will be truncated"]
451
batch = tokenizer(
452
texts,
453
padding=True,
454
truncation=True,
455
max_length=10,
456
return_tensors="pt"
457
)
458
# Returns BatchEncoding with padded/truncated sequences
459
460
# Sequence pairs (for tasks like similarity, NLI)
461
result = tokenizer(
462
"What is AI?",
463
"Artificial Intelligence is machine learning.",
464
padding=True,
465
return_tensors="pt"
466
)
467
468
# Add custom special tokens
469
num_added = tokenizer.add_special_tokens({
470
"additional_special_tokens": ["[CUSTOM]", "[SPECIAL]"]
471
})
472
473
# Character-to-token mapping
474
encoding = tokenizer("Hello world", return_offsets_mapping=True)
475
char_to_token = encoding.char_to_token(6) # Character at position 6 -> token index
476
```
477
478
## Fast vs Slow Tokenizers
479
480
The library provides both Python-based ("slow") and Rust-based ("fast") tokenizers:
481
482
**Fast Tokenizers (Recommended):**
483
- Rust-based implementation for superior speed
484
- Better memory efficiency
485
- Additional features like offset mapping
486
- Parallel processing capabilities
487
- Available for most popular models
488
489
**Slow Tokenizers:**
490
- Pure Python implementation
491
- Full compatibility and customization
492
- Fallback when fast tokenizer unavailable
493
- Better for research and custom modifications
494
495
Use `use_fast=True` (default) to automatically select fast tokenizers when available.