or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

feature-extraction.mdgeneration.mdindex.mdmodels.mdoptimization.mdpipelines.mdtokenization.mdtraining.md

tokenization.mddocs/

0

# Tokenization

1

2

Comprehensive tokenization with support for 100+ different tokenizers, handling subword tokenization, special tokens, efficient batch processing, and cross-framework compatibility. The tokenization system provides consistent APIs across different architectures while optimizing for speed and memory efficiency.

3

4

## Capabilities

5

6

### Auto Tokenizer

7

8

Automatic tokenizer selection based on model names or configurations.

9

10

```python { .api }

11

class AutoTokenizer:

12

@classmethod

13

def from_pretrained(

14

cls,

15

pretrained_model_name_or_path: Union[str, os.PathLike],

16

*inputs,

17

cache_dir: Union[str, os.PathLike] = None,

18

force_download: bool = False,

19

local_files_only: bool = False,

20

token: Union[str, bool] = None,

21

revision: str = "main",

22

use_fast: bool = True,

23

tokenizer_type: Optional[str] = None,

24

trust_remote_code: bool = False,

25

**kwargs

26

) -> PreTrainedTokenizer:

27

"""

28

Load tokenizer automatically detecting the type.

29

30

Args:

31

pretrained_model_name_or_path: Model name or path

32

cache_dir: Custom cache directory

33

force_download: Force fresh download

34

local_files_only: Only use local files

35

token: Authentication token

36

revision: Model revision/branch

37

use_fast: Use fast (Rust-based) tokenizer when available

38

tokenizer_type: Override auto-detected tokenizer type

39

trust_remote_code: Allow custom tokenizer code

40

41

Returns:

42

Loaded tokenizer instance

43

"""

44

```

45

46

### Base Tokenizer Classes

47

48

Foundation classes for all tokenizer implementations.

49

50

```python { .api }

51

class PreTrainedTokenizer:

52

"""Base class for all Python tokenizers."""

53

54

def __init__(

55

self,

56

model_max_length: int = None,

57

padding_side: str = "right",

58

truncation_side: str = "right",

59

chat_template: str = None,

60

model_input_names: List[str] = None,

61

bos_token: Union[str, AddedToken] = None,

62

eos_token: Union[str, AddedToken] = None,

63

unk_token: Union[str, AddedToken] = None,

64

sep_token: Union[str, AddedToken] = None,

65

pad_token: Union[str, AddedToken] = None,

66

cls_token: Union[str, AddedToken] = None,

67

mask_token: Union[str, AddedToken] = None,

68

additional_special_tokens: List[Union[str, AddedToken]] = None,

69

**kwargs

70

)

71

72

def __call__(

73

self,

74

text: Union[str, List[str], List[List[str]]] = None,

75

text_pair: Union[str, List[str], List[List[str]]] = None,

76

text_target: Union[str, List[str], List[List[str]]] = None,

77

text_pair_target: Union[str, List[str], List[List[str]]] = None,

78

add_special_tokens: bool = True,

79

padding: Union[bool, str] = False,

80

truncation: Union[bool, str] = None,

81

max_length: Optional[int] = None,

82

stride: int = 0,

83

is_split_into_words: bool = False,

84

pad_to_multiple_of: Optional[int] = None,

85

return_tensors: Optional[Union[str, TensorType]] = None,

86

return_token_type_ids: Optional[bool] = None,

87

return_attention_mask: Optional[bool] = None,

88

return_overflowing_tokens: bool = False,

89

return_special_tokens_mask: bool = False,

90

return_offsets_mapping: bool = False,

91

return_length: bool = False,

92

verbose: bool = True,

93

**kwargs

94

) -> BatchEncoding:

95

"""

96

Main tokenization method with extensive options.

97

98

Args:

99

text: Input text(s) to tokenize

100

text_pair: Paired text for sequence pair tasks

101

add_special_tokens: Add model-specific special tokens

102

padding: Padding strategy ("longest", "max_length", True, False)

103

truncation: Truncation strategy (True, False, "longest_first", etc.)

104

max_length: Maximum sequence length

105

stride: Stride for overlapping windows

106

is_split_into_words: Whether input is pre-tokenized

107

pad_to_multiple_of: Pad length to multiple of this value

108

return_tensors: Format of returned tensors ("pt", "tf", "np")

109

return_token_type_ids: Include token type IDs

110

return_attention_mask: Include attention mask

111

return_overflowing_tokens: Return overflowing tokens

112

return_special_tokens_mask: Mark special tokens

113

return_offsets_mapping: Include character-to-token mapping

114

return_length: Include sequence lengths

115

116

Returns:

117

BatchEncoding with tokenized inputs

118

"""

119

120

def encode(

121

self,

122

text: Union[str, List[str], List[int]],

123

text_pair: Optional[Union[str, List[str]]] = None,

124

add_special_tokens: bool = True,

125

padding: Union[bool, str] = False,

126

truncation: Union[bool, str] = None,

127

max_length: Optional[int] = None,

128

stride: int = 0,

129

return_tensors: Optional[Union[str, TensorType]] = None,

130

**kwargs

131

) -> List[int]:

132

"""

133

Encode text to token IDs.

134

135

Args:

136

text: Text to encode

137

text_pair: Paired text for sequence pairs

138

add_special_tokens: Add special tokens

139

padding: Padding strategy

140

truncation: Truncation strategy

141

max_length: Maximum sequence length

142

stride: Stride for overlapping windows

143

return_tensors: Format of returned tensors

144

145

Returns:

146

List of token IDs

147

"""

148

149

def decode(

150

self,

151

token_ids: Union[int, List[int], torch.Tensor, tf.Tensor, np.ndarray],

152

skip_special_tokens: bool = False,

153

clean_up_tokenization_spaces: bool = None,

154

**kwargs

155

) -> str:

156

"""

157

Decode token IDs back to text.

158

159

Args:

160

token_ids: Token IDs to decode

161

skip_special_tokens: Skip special tokens in output

162

clean_up_tokenization_spaces: Clean tokenization artifacts

163

164

Returns:

165

Decoded text string

166

"""

167

168

def tokenize(

169

self,

170

text: str,

171

pair: Optional[str] = None,

172

add_special_tokens: bool = False,

173

**kwargs

174

) -> List[str]:

175

"""

176

Tokenize text into tokens (not IDs).

177

178

Args:

179

text: Text to tokenize

180

pair: Paired text for sequence pairs

181

add_special_tokens: Add special tokens

182

183

Returns:

184

List of token strings

185

"""

186

187

def convert_tokens_to_ids(

188

self,

189

tokens: Union[str, List[str]]

190

) -> Union[int, List[int]]:

191

"""Convert tokens to corresponding IDs."""

192

193

def convert_ids_to_tokens(

194

self,

195

ids: Union[int, List[int]],

196

skip_special_tokens: bool = False

197

) -> Union[str, List[str]]:

198

"""Convert IDs to corresponding tokens."""

199

200

def add_special_tokens(

201

self,

202

special_tokens_dict: Dict[str, Union[str, AddedToken]]

203

) -> int:

204

"""

205

Add special tokens to vocabulary.

206

207

Args:

208

special_tokens_dict: Dictionary of special tokens

209

210

Returns:

211

Number of tokens added

212

"""

213

214

def save_pretrained(

215

self,

216

save_directory: Union[str, os.PathLike],

217

legacy_format: Optional[bool] = None,

218

filename_prefix: Optional[str] = None,

219

push_to_hub: bool = False,

220

**kwargs

221

) -> Tuple[str]:

222

"""Save tokenizer to directory."""

223

224

class PreTrainedTokenizerFast:

225

"""Base class for fast (Rust-based) tokenizers."""

226

227

def __init__(

228

self,

229

tokenizer_object: Optional["Tokenizer"] = None,

230

tokenizer_file: Optional[str] = None,

231

**kwargs

232

)

233

234

# Inherits most methods from PreTrainedTokenizer with optimized implementations

235

236

def train_new_from_iterator(

237

self,

238

text_iterator: Iterator[str],

239

vocab_size: int,

240

length: Optional[int] = None,

241

new_special_tokens: Optional[List[str]] = None,

242

special_tokens_map: Optional[Dict[str, str]] = None,

243

**kwargs

244

) -> "PreTrainedTokenizerFast":

245

"""Train new tokenizer from text iterator."""

246

247

def push_to_hub(

248

self,

249

repo_id: str,

250

use_temp_dir: Optional[bool] = None,

251

commit_message: Optional[str] = None,

252

private: Optional[bool] = None,

253

token: Union[bool, str] = None,

254

**kwargs

255

) -> str:

256

"""Upload tokenizer to Hugging Face Hub."""

257

```

258

259

### Batch Encoding

260

261

Container for tokenizer outputs with tensor conversion capabilities.

262

263

```python { .api }

264

class BatchEncoding:

265

"""Container for tokenized inputs with convenient methods."""

266

267

def __init__(

268

self,

269

data: Optional[Dict[str, Any]] = None,

270

encoding: Optional[List["EncodingFast"]] = None,

271

tensor_type: Union[None, str, TensorType] = None,

272

prepend_batch_axis: bool = False,

273

n_sequences: Optional[int] = None

274

)

275

276

def __getitem__(self, item: Union[str, int]) -> Union[Any, List[Any]]:

277

"""Access tokenized data by key or index."""

278

279

def __setitem__(self, key: str, value: Any) -> None:

280

"""Set tokenized data value."""

281

282

def keys(self) -> List[str]:

283

"""Get all available keys."""

284

285

def values(self) -> List[Any]:

286

"""Get all values."""

287

288

def items(self) -> List[Tuple[str, Any]]:

289

"""Get key-value pairs."""

290

291

def to(

292

self,

293

device: Union[str, torch.device, int]

294

) -> "BatchEncoding":

295

"""Move tensors to specified device."""

296

297

def convert_to_tensors(

298

self,

299

tensor_type: Optional[Union[str, TensorType]] = None,

300

prepend_batch_axis: bool = False

301

) -> "BatchEncoding":

302

"""Convert to specified tensor format."""

303

304

@property

305

def input_ids(self) -> Optional[List[List[int]]]:

306

"""Token IDs for input sequences."""

307

308

@property

309

def attention_mask(self) -> Optional[List[List[int]]]:

310

"""Attention mask (1 for real tokens, 0 for padding)."""

311

312

@property

313

def token_type_ids(self) -> Optional[List[List[int]]]:

314

"""Token type IDs for sequence pairs."""

315

316

def char_to_token(

317

self,

318

batch_or_char_index: int,

319

char_index: Optional[int] = None,

320

sequence_index: int = 0

321

) -> Optional[int]:

322

"""Convert character index to token index."""

323

324

def token_to_chars(

325

self,

326

batch_or_token_index: int,

327

token_index: Optional[int] = None,

328

sequence_index: int = 0

329

) -> Optional[Tuple[int, int]]:

330

"""Convert token index to character span."""

331

332

def word_to_tokens(

333

self,

334

batch_or_word_index: int,

335

word_index: Optional[int] = None,

336

sequence_index: int = 0

337

) -> Optional[Tuple[int, int]]:

338

"""Convert word index to token span."""

339

```

340

341

### Popular Tokenizer Implementations

342

343

#### BERT Tokenizers

344

```python { .api }

345

class BertTokenizer(PreTrainedTokenizer):

346

"""BERT WordPiece tokenizer."""

347

348

class BertTokenizerFast(PreTrainedTokenizerFast):

349

"""Fast BERT tokenizer."""

350

```

351

352

#### GPT Tokenizers

353

```python { .api }

354

class GPT2Tokenizer(PreTrainedTokenizer):

355

"""GPT-2 BPE tokenizer."""

356

357

class GPT2TokenizerFast(PreTrainedTokenizerFast):

358

"""Fast GPT-2 tokenizer."""

359

```

360

361

#### T5 Tokenizers

362

```python { .api }

363

class T5Tokenizer(PreTrainedTokenizer):

364

"""T5 SentencePiece tokenizer."""

365

366

class T5TokenizerFast(PreTrainedTokenizerFast):

367

"""Fast T5 tokenizer."""

368

```

369

370

#### RoBERTa Tokenizers

371

```python { .api }

372

class RobertaTokenizer(PreTrainedTokenizer):

373

"""RoBERTa BPE tokenizer."""

374

375

class RobertaTokenizerFast(PreTrainedTokenizerFast):

376

"""Fast RoBERTa tokenizer."""

377

```

378

379

### Special Token Handling

380

381

```python { .api }

382

class AddedToken:

383

"""Represents a token that was added to the vocabulary."""

384

385

def __init__(

386

self,

387

content: str,

388

single_word: bool = False,

389

lstrip: bool = False,

390

rstrip: bool = False,

391

normalized: bool = True,

392

special: bool = False

393

):

394

"""

395

Create an added token.

396

397

Args:

398

content: Token content

399

single_word: Whether token represents a single word

400

lstrip: Remove leading whitespace

401

rstrip: Remove trailing whitespace

402

normalized: Whether token is normalized

403

special: Whether this is a special token

404

"""

405

```

406

407

### Tokenization Utilities

408

409

Helper functions for common tokenization tasks.

410

411

```python { .api }

412

def is_tokenizers_available() -> bool:

413

"""Check if tokenizers library is available."""

414

415

def clean_up_tokenization(text: str) -> str:

416

"""Clean up tokenization artifacts in text."""

417

418

def get_pairs(word: Tuple[str, ...]) -> Set[Tuple[str, str]]:

419

"""Get all character pairs in a word (for BPE)."""

420

```

421

422

## Tokenization Examples

423

424

Common tokenization patterns and use cases:

425

426

```python

427

from transformers import AutoTokenizer

428

429

# Load tokenizer

430

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

431

432

# Basic tokenization

433

text = "Hello, world!"

434

tokens = tokenizer.tokenize(text)

435

# Output: ['hello', ',', 'world', '!']

436

437

# Encode to IDs

438

token_ids = tokenizer.encode(text)

439

# Output: [101, 7592, 1010, 2088, 999, 102] # [CLS] + tokens + [SEP]

440

441

# Decode back to text

442

decoded = tokenizer.decode(token_ids)

443

# Output: "[CLS] hello, world! [SEP]"

444

445

# Skip special tokens

446

decoded_clean = tokenizer.decode(token_ids, skip_special_tokens=True)

447

# Output: "hello, world!"

448

449

# Batch processing with padding

450

texts = ["Short text", "This is a much longer text that will be truncated"]

451

batch = tokenizer(

452

texts,

453

padding=True,

454

truncation=True,

455

max_length=10,

456

return_tensors="pt"

457

)

458

# Returns BatchEncoding with padded/truncated sequences

459

460

# Sequence pairs (for tasks like similarity, NLI)

461

result = tokenizer(

462

"What is AI?",

463

"Artificial Intelligence is machine learning.",

464

padding=True,

465

return_tensors="pt"

466

)

467

468

# Add custom special tokens

469

num_added = tokenizer.add_special_tokens({

470

"additional_special_tokens": ["[CUSTOM]", "[SPECIAL]"]

471

})

472

473

# Character-to-token mapping

474

encoding = tokenizer("Hello world", return_offsets_mapping=True)

475

char_to_token = encoding.char_to_token(6) # Character at position 6 -> token index

476

```

477

478

## Fast vs Slow Tokenizers

479

480

The library provides both Python-based ("slow") and Rust-based ("fast") tokenizers:

481

482

**Fast Tokenizers (Recommended):**

483

- Rust-based implementation for superior speed

484

- Better memory efficiency

485

- Additional features like offset mapping

486

- Parallel processing capabilities

487

- Available for most popular models

488

489

**Slow Tokenizers:**

490

- Pure Python implementation

491

- Full compatibility and customization

492

- Fallback when fast tokenizer unavailable

493

- Better for research and custom modifications

494

495

Use `use_fast=True` (default) to automatically select fast tokenizers when available.