0
# Other Transformer Models
1
2
Additional transformer architectures beyond BERT and GPT-2, including OpenAI GPT, Transformer-XL, XLNet, XLM, RoBERTa, and DistilBERT. Each architecture has specific design characteristics optimized for different NLP tasks and languages.
3
4
## Capabilities
5
6
### XLNet Models
7
8
XLNet uses permutation-based training and relative positional encodings, combining the best of autoregressive and autoencoding approaches.
9
10
#### XLNetConfig
11
12
```python { .api }
13
class XLNetConfig(PretrainedConfig):
14
def __init__(
15
self,
16
vocab_size=32000,
17
d_model=1024,
18
n_layer=24,
19
n_head=16,
20
d_inner=4096,
21
ff_activation="gelu",
22
untie_r=True,
23
attn_type="bi",
24
initializer_range=0.02,
25
layer_norm_eps=1e-12,
26
dropout=0.1,
27
mem_len=None,
28
reuse_len=None,
29
bi_data=False,
30
clamp_len=-1,
31
same_length=False,
32
**kwargs
33
):
34
"""
35
Configuration for XLNet models.
36
37
Parameters:
38
- vocab_size (int): Vocabulary size
39
- d_model (int): Hidden layer dimensionality
40
- n_layer (int): Number of transformer layers
41
- n_head (int): Number of attention heads
42
- d_inner (int): Feed-forward layer dimensionality
43
- ff_activation (str): Feed-forward activation function
44
- untie_r (bool): Whether to untie relative position bias
45
- attn_type (str): Attention type ("bi" for bidirectional)
46
- dropout (float): Dropout probability
47
- mem_len (int): Memory length for recurrence
48
- reuse_len (int): Reuse length for recurrence
49
"""
50
```
51
52
#### XLNetModel
53
54
```python { .api }
55
class XLNetModel(PreTrainedModel):
56
def forward(
57
self,
58
input_ids=None,
59
attention_mask=None,
60
mems=None,
61
perm_mask=None,
62
target_mapping=None,
63
token_type_ids=None,
64
input_mask=None,
65
head_mask=None,
66
inputs_embeds=None
67
):
68
"""
69
Forward pass through XLNet model.
70
71
Parameters:
72
- input_ids (torch.Tensor): Token IDs
73
- attention_mask (torch.Tensor): Attention mask
74
- mems (List[torch.Tensor]): Memory from previous segments
75
- perm_mask (torch.Tensor): Permutation mask for attention
76
- target_mapping (torch.Tensor): Target mapping for partial prediction
77
- token_type_ids (torch.Tensor): Segment token indices
78
- input_mask (torch.Tensor): Input mask
79
- head_mask (torch.Tensor): Head mask
80
- inputs_embeds (torch.Tensor): Pre-computed embeddings
81
82
Returns:
83
XLNetModelOutput: Object with last_hidden_state and mems
84
"""
85
```
86
87
#### XLNetForSequenceClassification
88
89
```python { .api }
90
class XLNetForSequenceClassification(PreTrainedModel):
91
def forward(
92
self,
93
input_ids=None,
94
attention_mask=None,
95
mems=None,
96
perm_mask=None,
97
target_mapping=None,
98
token_type_ids=None,
99
input_mask=None,
100
head_mask=None,
101
inputs_embeds=None,
102
labels=None
103
):
104
"""
105
Forward pass for XLNet sequence classification.
106
107
Returns:
108
SequenceClassifierOutput: Object with loss and logits
109
"""
110
```
111
112
#### XLNetTokenizer
113
114
```python { .api }
115
class XLNetTokenizer(PreTrainedTokenizer):
116
def __init__(
117
self,
118
vocab_file,
119
do_lower_case=False,
120
remove_space=True,
121
keep_accents=False,
122
bos_token="<s>",
123
eos_token="</s>",
124
unk_token="<unk>",
125
sep_token="<sep>",
126
pad_token="<pad>",
127
cls_token="<cls>",
128
mask_token="<mask>",
129
**kwargs
130
):
131
"""
132
SentencePiece-based tokenizer for XLNet.
133
"""
134
```
135
136
#### SPIECE_UNDERLINE
137
138
```python { .api }
139
SPIECE_UNDERLINE: str = "▁"
140
# SentencePiece underline character used by XLNet tokenizer
141
# Represents the beginning of words in subword tokenization
142
```
143
144
**Usage Example:**
145
146
```python
147
from pytorch_transformers import XLNetForSequenceClassification, XLNetTokenizer
148
149
model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=2)
150
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
151
152
text = "This is a great movie!"
153
inputs = tokenizer(text, return_tensors="pt")
154
outputs = model(**inputs)
155
```
156
157
### RoBERTa Models
158
159
RoBERTa (Robustly Optimized BERT Pretraining Approach) improves upon BERT with better training procedures and hyperparameters.
160
161
#### RobertaConfig
162
163
```python { .api }
164
class RobertaConfig(BertConfig):
165
def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):
166
"""
167
Configuration for RoBERTa models (extends BertConfig).
168
169
Parameters:
170
- pad_token_id (int): Padding token ID
171
- bos_token_id (int): Beginning of sequence token ID
172
- eos_token_id (int): End of sequence token ID
173
"""
174
```
175
176
#### RobertaModel
177
178
```python { .api }
179
class RobertaModel(PreTrainedModel):
180
def forward(
181
self,
182
input_ids=None,
183
attention_mask=None,
184
token_type_ids=None,
185
position_ids=None,
186
head_mask=None,
187
inputs_embeds=None
188
):
189
"""
190
Forward pass through RoBERTa model.
191
192
Returns:
193
BaseModelOutputWithPooling: Object with last_hidden_state and pooler_output
194
"""
195
```
196
197
#### RobertaForMaskedLM
198
199
```python { .api }
200
class RobertaForMaskedLM(PreTrainedModel):
201
def forward(
202
self,
203
input_ids=None,
204
attention_mask=None,
205
token_type_ids=None,
206
position_ids=None,
207
head_mask=None,
208
inputs_embeds=None,
209
labels=None
210
):
211
"""
212
Forward pass for RoBERTa masked language modeling.
213
214
Returns:
215
MaskedLMOutput: Object with loss and prediction_scores
216
"""
217
```
218
219
#### RobertaTokenizer
220
221
```python { .api }
222
class RobertaTokenizer(PreTrainedTokenizer):
223
def __init__(
224
self,
225
vocab_file,
226
merges_file,
227
errors="replace",
228
bos_token="<s>",
229
eos_token="</s>",
230
sep_token="</s>",
231
cls_token="<s>",
232
unk_token="<unk>",
233
pad_token="<pad>",
234
mask_token="<mask>",
235
add_prefix_space=False,
236
**kwargs
237
):
238
"""
239
RoBERTa tokenizer (inherits from GPT2Tokenizer with different special tokens).
240
"""
241
```
242
243
### DistilBERT Models
244
245
DistilBERT is a distilled version of BERT that is 60% smaller and 60% faster while retaining 97% of BERT's performance.
246
247
#### DistilBertConfig
248
249
```python { .api }
250
class DistilBertConfig(PretrainedConfig):
251
def __init__(
252
self,
253
vocab_size=30522,
254
max_position_embeddings=512,
255
sinusoidal_pos_embds=False,
256
n_layers=6,
257
n_heads=12,
258
dim=768,
259
hidden_dim=3072,
260
dropout=0.1,
261
attention_dropout=0.1,
262
activation="gelu",
263
initializer_range=0.02,
264
**kwargs
265
):
266
"""
267
Configuration for DistilBERT models.
268
269
Parameters:
270
- vocab_size (int): Vocabulary size
271
- max_position_embeddings (int): Maximum sequence length
272
- sinusoidal_pos_embds (bool): Whether to use sinusoidal position embeddings
273
- n_layers (int): Number of transformer layers
274
- n_heads (int): Number of attention heads
275
- dim (int): Hidden layer dimensionality
276
- hidden_dim (int): Feed-forward layer dimensionality
277
- dropout (float): Dropout probability
278
- attention_dropout (float): Attention dropout probability
279
- activation (str): Activation function
280
"""
281
```
282
283
#### DistilBertModel
284
285
```python { .api }
286
class DistilBertModel(PreTrainedModel):
287
def forward(
288
self,
289
input_ids=None,
290
attention_mask=None,
291
head_mask=None,
292
inputs_embeds=None
293
):
294
"""
295
Forward pass through DistilBERT model.
296
297
Returns:
298
BaseModelOutput: Object with last_hidden_state
299
"""
300
```
301
302
#### DistilBertForSequenceClassification
303
304
```python { .api }
305
class DistilBertForSequenceClassification(PreTrainedModel):
306
def forward(
307
self,
308
input_ids=None,
309
attention_mask=None,
310
head_mask=None,
311
inputs_embeds=None,
312
labels=None
313
):
314
"""
315
Forward pass for DistilBERT sequence classification.
316
317
Returns:
318
SequenceClassifierOutput: Object with loss and logits
319
"""
320
```
321
322
#### DistilBertTokenizer
323
324
```python { .api }
325
class DistilBertTokenizer(PreTrainedTokenizer):
326
# Identical to BertTokenizer - uses same WordPiece tokenization
327
pass
328
```
329
330
### XLM Models
331
332
XLM (Cross-lingual Language Model) for multilingual understanding and cross-lingual transfer learning.
333
334
#### XLMConfig
335
336
```python { .api }
337
class XLMConfig(PretrainedConfig):
338
def __init__(
339
self,
340
vocab_size=30145,
341
emb_dim=2048,
342
n_layers=12,
343
n_heads=16,
344
dropout=0.1,
345
attention_dropout=0.1,
346
gelu_activation=True,
347
sinusoidal_embeddings=False,
348
causal=False,
349
asm=False,
350
n_langs=1,
351
use_lang_emb=True,
352
max_position_embeddings=512,
353
**kwargs
354
):
355
"""
356
Configuration for XLM models.
357
"""
358
```
359
360
#### XLMModel
361
362
```python { .api }
363
class XLMModel(PreTrainedModel):
364
def forward(
365
self,
366
input_ids=None,
367
attention_mask=None,
368
langs=None,
369
token_type_ids=None,
370
position_ids=None,
371
lengths=None,
372
cache=None,
373
head_mask=None,
374
inputs_embeds=None
375
):
376
"""
377
Forward pass through XLM model.
378
379
Returns:
380
XLMModelOutput: Object with last_hidden_state
381
"""
382
```
383
384
#### XLMTokenizer
385
386
```python { .api }
387
class XLMTokenizer(PreTrainedTokenizer):
388
def __init__(
389
self,
390
vocab_file,
391
merges_file,
392
unk_token="<unk>",
393
bos_token="<s>",
394
sep_token="</s>",
395
pad_token="<pad>",
396
cls_token="</s>",
397
mask_token="<special1>",
398
**kwargs
399
):
400
"""
401
BPE tokenizer for XLM multilingual models.
402
"""
403
```
404
405
### Transformer-XL Models
406
407
Transformer-XL enables learning longer-term dependencies with recurrence mechanisms and relative positional encodings.
408
409
#### TransfoXLConfig
410
411
```python { .api }
412
class TransfoXLConfig(PretrainedConfig):
413
def __init__(
414
self,
415
vocab_size=267735,
416
cutoffs=[20000, 40000, 200000],
417
d_model=1024,
418
d_embed=1024,
419
n_head=16,
420
d_head=64,
421
d_inner=4096,
422
div_val=4,
423
pre_lnorm=False,
424
n_layer=18,
425
mem_len=1600,
426
clamp_len=1000,
427
same_length=True,
428
**kwargs
429
):
430
"""
431
Configuration for Transformer-XL models.
432
"""
433
```
434
435
#### TransfoXLModel
436
437
```python { .api }
438
class TransfoXLModel(PreTrainedModel):
439
def forward(
440
self,
441
input_ids=None,
442
mems=None,
443
head_mask=None,
444
inputs_embeds=None
445
):
446
"""
447
Forward pass through Transformer-XL model.
448
449
Returns:
450
TransfoXLModelOutput: Object with last_hidden_state and mems
451
"""
452
```
453
454
#### TransfoXLTokenizer
455
456
```python { .api }
457
class TransfoXLTokenizer(PreTrainedTokenizer):
458
def __init__(
459
self,
460
special=None,
461
min_freq=0,
462
max_size=None,
463
lower_case=False,
464
delimiter=None,
465
vocab_file=None,
466
**kwargs
467
):
468
"""
469
Word-level tokenizer for Transformer-XL.
470
"""
471
```
472
473
### OpenAI GPT Models
474
475
The original OpenAI GPT (Generative Pre-trained Transformer) model.
476
477
#### OpenAIGPTConfig
478
479
```python { .api }
480
class OpenAIGPTConfig(PretrainedConfig):
481
def __init__(
482
self,
483
vocab_size=40478,
484
n_positions=512,
485
n_ctx=512,
486
n_embd=768,
487
n_layer=12,
488
n_head=12,
489
afn="gelu",
490
resid_pdrop=0.1,
491
embd_pdrop=0.1,
492
attn_pdrop=0.1,
493
layer_norm_epsilon=1e-5,
494
initializer_range=0.02,
495
**kwargs
496
):
497
"""
498
Configuration for OpenAI GPT models.
499
"""
500
```
501
502
#### OpenAIGPTModel
503
504
```python { .api }
505
class OpenAIGPTModel(PreTrainedModel):
506
def forward(
507
self,
508
input_ids=None,
509
attention_mask=None,
510
token_type_ids=None,
511
position_ids=None,
512
head_mask=None,
513
inputs_embeds=None
514
):
515
"""
516
Forward pass through OpenAI GPT model.
517
518
Returns:
519
BaseModelOutput: Object with last_hidden_state
520
"""
521
```
522
523
#### OpenAIGPTTokenizer
524
525
```python { .api }
526
class OpenAIGPTTokenizer(PreTrainedTokenizer):
527
def __init__(
528
self,
529
vocab_file,
530
merges_file,
531
unk_token="<unk>",
532
**kwargs
533
):
534
"""
535
BPE tokenizer for OpenAI GPT.
536
"""
537
```
538
539
## Archive Maps and Model Names
540
541
### Available Pre-trained Models
542
543
**XLNet:**
544
- `xlnet-base-cased`: 12-layer, 768-hidden, 12-heads, 110M parameters
545
- `xlnet-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
546
547
**RoBERTa:**
548
- `roberta-base`: 12-layer, 768-hidden, 12-heads, 125M parameters
549
- `roberta-large`: 24-layer, 1024-hidden, 16-heads, 355M parameters
550
551
**DistilBERT:**
552
- `distilbert-base-uncased`: 6-layer, 768-hidden, 12-heads, 66M parameters
553
- `distilbert-base-cased`: 6-layer, 768-hidden, 12-heads, 65M parameters (cased)
554
555
**XLM:**
556
- `xlm-mlm-en-2048`: English MLM model, 1024-hidden
557
- `xlm-mlm-100-1280`: 100-language MLM model, 1280-hidden
558
559
**Transformer-XL:**
560
- `transfo-xl-wt103`: Trained on WikiText-103, 1024-hidden, 18-layer
561
562
**OpenAI GPT:**
563
- `openai-gpt`: 12-layer, 768-hidden, 12-heads, 117M parameters
564
565
## Usage Examples
566
567
```python
568
# XLNet for sequence classification
569
from pytorch_transformers import XLNetForSequenceClassification, XLNetTokenizer
570
571
xlnet_model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=2)
572
xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
573
574
# RoBERTa for masked language modeling
575
from pytorch_transformers import RobertaForMaskedLM, RobertaTokenizer
576
577
roberta_model = RobertaForMaskedLM.from_pretrained("roberta-base")
578
roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
579
580
# DistilBERT for efficient inference
581
from pytorch_transformers import DistilBertForSequenceClassification, DistilBertTokenizer
582
583
distilbert_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")
584
distilbert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
585
586
# Process text with any model
587
text = "This is an example sentence."
588
inputs = tokenizer(text, return_tensors="pt")
589
outputs = model(**inputs)
590
```