Tessl Tile for pypi/transformers@4.56.0

or run

npx @tessl/cli init

Version

Tile

Overview

Evals

Files

docs

feature-extraction.md generation.md index.md models.md optimization.md pipelines.md tokenization.md training.md

tokenization.mddocs/

0
# Tokenization
1

2
Comprehensive tokenization with support for 100+ different tokenizers, handling subword tokenization, special tokens, efficient batch processing, and cross-framework compatibility. The tokenization system provides consistent APIs across different architectures while optimizing for speed and memory efficiency.
3

4
## Capabilities
5

6
### Auto Tokenizer
7

8
Automatic tokenizer selection based on model names or configurations.
9

10
```python { .api }
11
class AutoTokenizer:
12
    @classmethod
13
    def from_pretrained(
14
        cls,
15
        pretrained_model_name_or_path: Union[str, os.PathLike],
16
        *inputs,
17
        cache_dir: Union[str, os.PathLike] = None,
18
        force_download: bool = False,
19
        local_files_only: bool = False,
20
        token: Union[str, bool] = None,
21
        revision: str = "main",
22
        use_fast: bool = True,
23
        tokenizer_type: Optional[str] = None,
24
        trust_remote_code: bool = False,
25
        **kwargs
26
    ) -> PreTrainedTokenizer:
27
        """
28
        Load tokenizer automatically detecting the type.
29
        
30
        Args:
31
            pretrained_model_name_or_path: Model name or path
32
            cache_dir: Custom cache directory
33
            force_download: Force fresh download
34
            local_files_only: Only use local files
35
            token: Authentication token
36
            revision: Model revision/branch
37
            use_fast: Use fast (Rust-based) tokenizer when available
38
            tokenizer_type: Override auto-detected tokenizer type
39
            trust_remote_code: Allow custom tokenizer code
40
        
41
        Returns:
42
            Loaded tokenizer instance
43
        """
44
```
45

46
### Base Tokenizer Classes
47

48
Foundation classes for all tokenizer implementations.
49

50
```python { .api }
51
class PreTrainedTokenizer:
52
    """Base class for all Python tokenizers."""
53
    
54
    def __init__(
55
        self,
56
        model_max_length: int = None,
57
        padding_side: str = "right",
58
        truncation_side: str = "right",
59
        chat_template: str = None,
60
        model_input_names: List[str] = None,
61
        bos_token: Union[str, AddedToken] = None,
62
        eos_token: Union[str, AddedToken] = None,
63
        unk_token: Union[str, AddedToken] = None,
64
        sep_token: Union[str, AddedToken] = None,
65
        pad_token: Union[str, AddedToken] = None,
66
        cls_token: Union[str, AddedToken] = None,
67
        mask_token: Union[str, AddedToken] = None,
68
        additional_special_tokens: List[Union[str, AddedToken]] = None,
69
        **kwargs
70
    )
71
    
72
    def __call__(
73
        self,
74
        text: Union[str, List[str], List[List[str]]] = None,
75
        text_pair: Union[str, List[str], List[List[str]]] = None,
76
        text_target: Union[str, List[str], List[List[str]]] = None,
77
        text_pair_target: Union[str, List[str], List[List[str]]] = None,
78
        add_special_tokens: bool = True,
79
        padding: Union[bool, str] = False,
80
        truncation: Union[bool, str] = None,
81
        max_length: Optional[int] = None,
82
        stride: int = 0,
83
        is_split_into_words: bool = False,
84
        pad_to_multiple_of: Optional[int] = None,
85
        return_tensors: Optional[Union[str, TensorType]] = None,
86
        return_token_type_ids: Optional[bool] = None,
87
        return_attention_mask: Optional[bool] = None,
88
        return_overflowing_tokens: bool = False,
89
        return_special_tokens_mask: bool = False,
90
        return_offsets_mapping: bool = False,
91
        return_length: bool = False,
92
        verbose: bool = True,
93
        **kwargs
94
    ) -> BatchEncoding:
95
        """
96
        Main tokenization method with extensive options.
97
        
98
        Args:
99
            text: Input text(s) to tokenize
100
            text_pair: Paired text for sequence pair tasks
101
            add_special_tokens: Add model-specific special tokens
102
            padding: Padding strategy ("longest", "max_length", True, False)
103
            truncation: Truncation strategy (True, False, "longest_first", etc.)
104
            max_length: Maximum sequence length
105
            stride: Stride for overlapping windows
106
            is_split_into_words: Whether input is pre-tokenized
107
            pad_to_multiple_of: Pad length to multiple of this value
108
            return_tensors: Format of returned tensors ("pt", "tf", "np")
109
            return_token_type_ids: Include token type IDs
110
            return_attention_mask: Include attention mask
111
            return_overflowing_tokens: Return overflowing tokens
112
            return_special_tokens_mask: Mark special tokens
113
            return_offsets_mapping: Include character-to-token mapping
114
            return_length: Include sequence lengths
115
        
116
        Returns:
117
            BatchEncoding with tokenized inputs
118
        """
119
    
120
    def encode(
121
        self,
122
        text: Union[str, List[str], List[int]],
123
        text_pair: Optional[Union[str, List[str]]] = None,
124
        add_special_tokens: bool = True,
125
        padding: Union[bool, str] = False,
126
        truncation: Union[bool, str] = None,
127
        max_length: Optional[int] = None,
128
        stride: int = 0,
129
        return_tensors: Optional[Union[str, TensorType]] = None,
130
        **kwargs
131
    ) -> List[int]:
132
        """
133
        Encode text to token IDs.
134
        
135
        Args:
136
            text: Text to encode
137
            text_pair: Paired text for sequence pairs
138
            add_special_tokens: Add special tokens
139
            padding: Padding strategy
140
            truncation: Truncation strategy
141
            max_length: Maximum sequence length
142
            stride: Stride for overlapping windows
143
            return_tensors: Format of returned tensors
144
        
145
        Returns:
146
            List of token IDs
147
        """
148
    
149
    def decode(
150
        self,
151
        token_ids: Union[int, List[int], torch.Tensor, tf.Tensor, np.ndarray],
152
        skip_special_tokens: bool = False,
153
        clean_up_tokenization_spaces: bool = None,
154
        **kwargs
155
    ) -> str:
156
        """
157
        Decode token IDs back to text.
158
        
159
        Args:
160
            token_ids: Token IDs to decode
161
            skip_special_tokens: Skip special tokens in output
162
            clean_up_tokenization_spaces: Clean tokenization artifacts
163
        
164
        Returns:
165
            Decoded text string
166
        """
167
    
168
    def tokenize(
169
        self,
170
        text: str,
171
        pair: Optional[str] = None,
172
        add_special_tokens: bool = False,
173
        **kwargs
174
    ) -> List[str]:
175
        """
176
        Tokenize text into tokens (not IDs).
177
        
178
        Args:
179
            text: Text to tokenize
180
            pair: Paired text for sequence pairs
181
            add_special_tokens: Add special tokens
182
        
183
        Returns:
184
            List of token strings
185
        """
186
    
187
    def convert_tokens_to_ids(
188
        self, 
189
        tokens: Union[str, List[str]]
190
    ) -> Union[int, List[int]]:
191
        """Convert tokens to corresponding IDs."""
192
    
193
    def convert_ids_to_tokens(
194
        self,
195
        ids: Union[int, List[int]],
196
        skip_special_tokens: bool = False
197
    ) -> Union[str, List[str]]:
198
        """Convert IDs to corresponding tokens."""
199
    
200
    def add_special_tokens(
201
        self,
202
        special_tokens_dict: Dict[str, Union[str, AddedToken]]
203
    ) -> int:
204
        """
205
        Add special tokens to vocabulary.
206
        
207
        Args:
208
            special_tokens_dict: Dictionary of special tokens
209
        
210
        Returns:
211
            Number of tokens added
212
        """
213
    
214
    def save_pretrained(
215
        self,
216
        save_directory: Union[str, os.PathLike],
217
        legacy_format: Optional[bool] = None,
218
        filename_prefix: Optional[str] = None,
219
        push_to_hub: bool = False,
220
        **kwargs
221
    ) -> Tuple[str]:
222
        """Save tokenizer to directory."""
223

224
class PreTrainedTokenizerFast:
225
    """Base class for fast (Rust-based) tokenizers."""
226
    
227
    def __init__(
228
        self,
229
        tokenizer_object: Optional["Tokenizer"] = None,
230
        tokenizer_file: Optional[str] = None,
231
        **kwargs
232
    )
233
    
234
    # Inherits most methods from PreTrainedTokenizer with optimized implementations
235
    
236
    def train_new_from_iterator(
237
        self,
238
        text_iterator: Iterator[str],
239
        vocab_size: int,
240
        length: Optional[int] = None,
241
        new_special_tokens: Optional[List[str]] = None,
242
        special_tokens_map: Optional[Dict[str, str]] = None,
243
        **kwargs
244
    ) -> "PreTrainedTokenizerFast":
245
        """Train new tokenizer from text iterator."""
246
    
247
    def push_to_hub(
248
        self,
249
        repo_id: str,
250
        use_temp_dir: Optional[bool] = None,
251
        commit_message: Optional[str] = None,
252
        private: Optional[bool] = None,
253
        token: Union[bool, str] = None,
254
        **kwargs
255
    ) -> str:
256
        """Upload tokenizer to Hugging Face Hub."""
257
```
258

259
### Batch Encoding
260

261
Container for tokenizer outputs with tensor conversion capabilities.
262

263
```python { .api }
264
class BatchEncoding:
265
    """Container for tokenized inputs with convenient methods."""
266
    
267
    def __init__(
268
        self,
269
        data: Optional[Dict[str, Any]] = None,
270
        encoding: Optional[List["EncodingFast"]] = None,
271
        tensor_type: Union[None, str, TensorType] = None,
272
        prepend_batch_axis: bool = False,
273
        n_sequences: Optional[int] = None
274
    )
275
    
276
    def __getitem__(self, item: Union[str, int]) -> Union[Any, List[Any]]:
277
        """Access tokenized data by key or index."""
278
    
279
    def __setitem__(self, key: str, value: Any) -> None:
280
        """Set tokenized data value."""
281
    
282
    def keys(self) -> List[str]:
283
        """Get all available keys."""
284
    
285
    def values(self) -> List[Any]:
286
        """Get all values."""
287
    
288
    def items(self) -> List[Tuple[str, Any]]:
289
        """Get key-value pairs."""
290
    
291
    def to(
292
        self,
293
        device: Union[str, torch.device, int]
294
    ) -> "BatchEncoding":
295
        """Move tensors to specified device."""
296
    
297
    def convert_to_tensors(
298
        self,
299
        tensor_type: Optional[Union[str, TensorType]] = None,
300
        prepend_batch_axis: bool = False
301
    ) -> "BatchEncoding":
302
        """Convert to specified tensor format."""
303
    
304
    @property 
305
    def input_ids(self) -> Optional[List[List[int]]]:
306
        """Token IDs for input sequences."""
307
    
308
    @property
309
    def attention_mask(self) -> Optional[List[List[int]]]:
310
        """Attention mask (1 for real tokens, 0 for padding)."""
311
    
312
    @property  
313
    def token_type_ids(self) -> Optional[List[List[int]]]:
314
        """Token type IDs for sequence pairs."""
315
    
316
    def char_to_token(
317
        self,
318
        batch_or_char_index: int,
319
        char_index: Optional[int] = None,
320
        sequence_index: int = 0
321
    ) -> Optional[int]:
322
        """Convert character index to token index."""
323
    
324
    def token_to_chars(
325
        self,
326
        batch_or_token_index: int,
327
        token_index: Optional[int] = None,
328
        sequence_index: int = 0
329
    ) -> Optional[Tuple[int, int]]:
330
        """Convert token index to character span."""
331
    
332
    def word_to_tokens(
333
        self,
334
        batch_or_word_index: int,
335
        word_index: Optional[int] = None,
336
        sequence_index: int = 0
337
    ) -> Optional[Tuple[int, int]]:
338
        """Convert word index to token span."""
339
```
340

341
### Popular Tokenizer Implementations
342

343
#### BERT Tokenizers
344
```python { .api }
345
class BertTokenizer(PreTrainedTokenizer):
346
    """BERT WordPiece tokenizer."""
347

348
class BertTokenizerFast(PreTrainedTokenizerFast):
349
    """Fast BERT tokenizer."""
350
```
351

352
#### GPT Tokenizers  
353
```python { .api }
354
class GPT2Tokenizer(PreTrainedTokenizer):
355
    """GPT-2 BPE tokenizer."""
356

357
class GPT2TokenizerFast(PreTrainedTokenizerFast):
358
    """Fast GPT-2 tokenizer."""
359
```
360

361
#### T5 Tokenizers
362
```python { .api }
363
class T5Tokenizer(PreTrainedTokenizer):
364
    """T5 SentencePiece tokenizer."""
365

366
class T5TokenizerFast(PreTrainedTokenizerFast):
367
    """Fast T5 tokenizer."""
368
```
369

370
#### RoBERTa Tokenizers
371
```python { .api }
372
class RobertaTokenizer(PreTrainedTokenizer):
373
    """RoBERTa BPE tokenizer."""
374

375
class RobertaTokenizerFast(PreTrainedTokenizerFast):  
376
    """Fast RoBERTa tokenizer."""
377
```
378

379
### Special Token Handling
380

381
```python { .api }
382
class AddedToken:
383
    """Represents a token that was added to the vocabulary."""
384
    
385
    def __init__(
386
        self,
387
        content: str,
388
        single_word: bool = False,
389
        lstrip: bool = False,
390
        rstrip: bool = False,
391
        normalized: bool = True,
392
        special: bool = False
393
    ):
394
        """
395
        Create an added token.
396
        
397
        Args:
398
            content: Token content
399
            single_word: Whether token represents a single word
400
            lstrip: Remove leading whitespace
401
            rstrip: Remove trailing whitespace
402
            normalized: Whether token is normalized
403
            special: Whether this is a special token
404
        """
405
```
406

407
### Tokenization Utilities
408

409
Helper functions for common tokenization tasks.
410

411
```python { .api }
412
def is_tokenizers_available() -> bool:
413
    """Check if tokenizers library is available."""
414

415
def clean_up_tokenization(text: str) -> str:
416
    """Clean up tokenization artifacts in text."""
417

418
def get_pairs(word: Tuple[str, ...]) -> Set[Tuple[str, str]]:
419
    """Get all character pairs in a word (for BPE)."""
420
```
421

422
## Tokenization Examples
423

424
Common tokenization patterns and use cases:
425

426
```python
427
from transformers import AutoTokenizer
428

429
# Load tokenizer
430
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
431

432
# Basic tokenization
433
text = "Hello, world!"
434
tokens = tokenizer.tokenize(text)
435
# Output: ['hello', ',', 'world', '!']
436

437
# Encode to IDs
438
token_ids = tokenizer.encode(text)
439
# Output: [101, 7592, 1010, 2088, 999, 102]  # [CLS] + tokens + [SEP]
440

441
# Decode back to text
442
decoded = tokenizer.decode(token_ids)
443
# Output: "[CLS] hello, world! [SEP]"
444

445
# Skip special tokens
446
decoded_clean = tokenizer.decode(token_ids, skip_special_tokens=True)
447
# Output: "hello, world!"
448

449
# Batch processing with padding
450
texts = ["Short text", "This is a much longer text that will be truncated"]
451
batch = tokenizer(
452
    texts,
453
    padding=True,
454
    truncation=True,
455
    max_length=10,
456
    return_tensors="pt"
457
)
458
# Returns BatchEncoding with padded/truncated sequences
459

460
# Sequence pairs (for tasks like similarity, NLI)
461
result = tokenizer(
462
    "What is AI?",
463
    "Artificial Intelligence is machine learning.",
464
    padding=True,
465
    return_tensors="pt"
466
)
467

468
# Add custom special tokens
469
num_added = tokenizer.add_special_tokens({
470
    "additional_special_tokens": ["[CUSTOM]", "[SPECIAL]"]
471
})
472

473
# Character-to-token mapping
474
encoding = tokenizer("Hello world", return_offsets_mapping=True)
475
char_to_token = encoding.char_to_token(6)  # Character at position 6 -> token index
476
```
477

478
## Fast vs Slow Tokenizers
479

480
The library provides both Python-based ("slow") and Rust-based ("fast") tokenizers:
481

482
**Fast Tokenizers (Recommended):**
483
- Rust-based implementation for superior speed
484
- Better memory efficiency
485
- Additional features like offset mapping
486
- Parallel processing capabilities
487
- Available for most popular models
488

489
**Slow Tokenizers:**
490
- Pure Python implementation
491
- Full compatibility and customization
492
- Fallback when fast tokenizer unavailable
493
- Better for research and custom modifications
494

495
Use `use_fast=True` (default) to automatically select fast tokenizers when available.

Version

Tile

Files

tokenization.md.css-3qkkll{font-size:var(--chakra-font-sizes-sm);font-weight:var(--chakra-font-weights-normal);color:var(--chakra-colors-gray-300);}docs/

tokenization.mddocs/