0
# BERT Models
1
2
BERT (Bidirectional Encoder Representations from Transformers) models for various NLP tasks. BERT uses bidirectional attention to understand context from both directions, making it highly effective for understanding-based tasks like classification, question answering, and token-level predictions.
3
4
## Capabilities
5
6
### BertConfig
7
8
Configuration class for BERT models containing all hyperparameters and architecture specifications.
9
10
```python { .api }
11
class BertConfig(PretrainedConfig):
12
def __init__(
13
self,
14
vocab_size=30522,
15
hidden_size=768,
16
num_hidden_layers=12,
17
num_attention_heads=12,
18
intermediate_size=3072,
19
hidden_act="gelu",
20
hidden_dropout_prob=0.1,
21
attention_probs_dropout_prob=0.1,
22
max_position_embeddings=512,
23
type_vocab_size=2,
24
initializer_range=0.02,
25
layer_norm_eps=1e-12,
26
**kwargs
27
):
28
"""
29
Configuration for BERT models.
30
31
Parameters:
32
- vocab_size (int): Vocabulary size
33
- hidden_size (int): Hidden layer dimensionality
34
- num_hidden_layers (int): Number of transformer layers
35
- num_attention_heads (int): Number of attention heads per layer
36
- intermediate_size (int): Feed-forward layer dimensionality
37
- hidden_act (str): Activation function ("gelu", "relu", "swish")
38
- hidden_dropout_prob (float): Dropout probability for hidden layers
39
- attention_probs_dropout_prob (float): Dropout for attention probabilities
40
- max_position_embeddings (int): Maximum sequence length
41
- type_vocab_size (int): Number of token type embeddings
42
- initializer_range (float): Weight initialization range
43
- layer_norm_eps (float): Layer normalization epsilon
44
"""
45
```
46
47
### BertModel
48
49
Base BERT model for encoding sequences into contextualized representations.
50
51
```python { .api }
52
class BertModel(PreTrainedModel):
53
def __init__(self, config):
54
"""
55
Initialize BERT base model.
56
57
Parameters:
58
- config (BertConfig): Model configuration
59
"""
60
61
def forward(
62
self,
63
input_ids=None,
64
attention_mask=None,
65
token_type_ids=None,
66
position_ids=None,
67
head_mask=None,
68
inputs_embeds=None
69
):
70
"""
71
Forward pass through BERT model.
72
73
Parameters:
74
- input_ids (torch.Tensor): Token IDs of shape (batch_size, sequence_length)
75
- attention_mask (torch.Tensor): Attention mask to avoid padding tokens
76
- token_type_ids (torch.Tensor): Segment token indices for sentence pairs
77
- position_ids (torch.Tensor): Position indices
78
- head_mask (torch.Tensor): Mask to nullify selected heads
79
- inputs_embeds (torch.Tensor): Pre-computed embeddings
80
81
Returns:
82
BaseModelOutputWithPooling: Object with last_hidden_state and pooler_output
83
"""
84
```
85
86
**Usage Example:**
87
88
```python
89
from pytorch_transformers import BertModel, BertTokenizer
90
import torch
91
92
# Load model and tokenizer
93
model = BertModel.from_pretrained("bert-base-uncased")
94
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
95
96
# Prepare input
97
text = "The quick brown fox jumps over the lazy dog."
98
inputs = tokenizer(text, return_tensors="pt")
99
100
# Get model outputs
101
with torch.no_grad():
102
outputs = model(**inputs)
103
104
# Access representations
105
last_hidden_state = outputs.last_hidden_state # Shape: (1, seq_len, 768)
106
pooled_output = outputs.pooler_output # Shape: (1, 768)
107
108
print(f"Sequence representation shape: {last_hidden_state.shape}")
109
print(f"Pooled representation shape: {pooled_output.shape}")
110
```
111
112
### BertPreTrainedModel
113
114
Abstract base class for all BERT models that handles weight initialization and provides a simple interface for downloading and loading pre-trained models.
115
116
```python { .api }
117
class BertPreTrainedModel(PreTrainedModel):
118
config_class = BertConfig
119
pretrained_model_archive_map = BERT_PRETRAINED_MODEL_ARCHIVE_MAP
120
load_tf_weights = load_tf_weights_in_bert
121
base_model_prefix = "bert"
122
123
def _init_weights(self, module):
124
"""
125
Initialize the weights for BERT models.
126
127
Parameters:
128
- module (nn.Module): Module to initialize
129
"""
130
```
131
132
**Usage Example:**
133
134
```python
135
from pytorch_transformers import BertPreTrainedModel, BertConfig
136
137
# BertPreTrainedModel is typically used as a base class for custom BERT models
138
class CustomBertModel(BertPreTrainedModel):
139
def __init__(self, config):
140
super().__init__(config)
141
# Custom model implementation
142
143
def forward(self, input_ids):
144
# Custom forward implementation
145
pass
146
147
# Initialize with proper weight initialization
148
config = BertConfig()
149
model = CustomBertModel(config)
150
# Weights are automatically initialized according to BERT standards
151
```
152
153
### BertForPreTraining
154
155
BERT model for pre-training with both masked language modeling and next sentence prediction heads.
156
157
```python { .api }
158
class BertForPreTraining(BertPreTrainedModel):
159
def __init__(self, config):
160
"""
161
Initialize BERT for pre-training with MLM and NSP heads.
162
163
Parameters:
164
- config (BertConfig): Model configuration
165
"""
166
167
def forward(
168
self,
169
input_ids=None,
170
attention_mask=None,
171
token_type_ids=None,
172
position_ids=None,
173
head_mask=None,
174
inputs_embeds=None,
175
masked_lm_labels=None,
176
next_sentence_label=None
177
):
178
"""
179
Forward pass for pre-training with MLM and NSP tasks.
180
181
Parameters:
182
- input_ids (torch.Tensor): Token IDs
183
- attention_mask (torch.Tensor): Attention mask
184
- token_type_ids (torch.Tensor): Segment token indices
185
- position_ids (torch.Tensor): Position indices
186
- head_mask (torch.Tensor): Head mask
187
- inputs_embeds (torch.Tensor): Pre-computed embeddings
188
- masked_lm_labels (torch.Tensor): Labels for MLM loss
189
- next_sentence_label (torch.Tensor): Labels for NSP loss
190
191
Returns:
192
BertForPreTrainingOutput: Object with prediction_logits, seq_relationship_logits, and losses
193
"""
194
```
195
196
**Usage Example:**
197
198
```python
199
from pytorch_transformers import BertForPreTraining, BertTokenizer
200
import torch
201
202
# Load model and tokenizer
203
model = BertForPreTraining.from_pretrained("bert-base-uncased")
204
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
205
206
# Prepare pre-training data
207
text_a = "The cat sat on the"
208
text_b = "mat and slept peacefully"
209
inputs = tokenizer(text_a, text_b, return_tensors="pt")
210
211
# Add masked LM labels (replace some tokens with [MASK])
212
masked_inputs = inputs.copy()
213
masked_inputs['input_ids'][0, 5] = tokenizer.mask_token_id # Mask "on"
214
masked_lm_labels = inputs['input_ids'].clone()
215
masked_lm_labels[masked_inputs['input_ids'] != tokenizer.mask_token_id] = -1
216
217
# Add NSP label (0 = sentence B follows A, 1 = random sentence B)
218
next_sentence_label = torch.tensor([0])
219
220
# Forward pass
221
outputs = model(**masked_inputs,
222
masked_lm_labels=masked_lm_labels,
223
next_sentence_label=next_sentence_label)
224
225
print(f"MLM loss: {outputs.loss}")
226
print(f"NSP predictions: {torch.softmax(outputs.seq_relationship_logits, dim=-1)}")
227
```
228
229
### BertForNextSentencePrediction
230
231
BERT model with only a next sentence prediction head for determining if two sentences are consecutive.
232
233
```python { .api }
234
class BertForNextSentencePrediction(BertPreTrainedModel):
235
def __init__(self, config):
236
"""
237
Initialize BERT for next sentence prediction task.
238
239
Parameters:
240
- config (BertConfig): Model configuration
241
"""
242
243
def forward(
244
self,
245
input_ids=None,
246
attention_mask=None,
247
token_type_ids=None,
248
position_ids=None,
249
head_mask=None,
250
inputs_embeds=None,
251
next_sentence_label=None
252
):
253
"""
254
Forward pass for next sentence prediction.
255
256
Parameters:
257
- input_ids (torch.Tensor): Token IDs for sentence pair
258
- attention_mask (torch.Tensor): Attention mask
259
- token_type_ids (torch.Tensor): Segment token indices (0 for sentence A, 1 for sentence B)
260
- position_ids (torch.Tensor): Position indices
261
- head_mask (torch.Tensor): Head mask
262
- inputs_embeds (torch.Tensor): Pre-computed embeddings
263
- next_sentence_label (torch.Tensor): Labels (0=consecutive, 1=random)
264
265
Returns:
266
NextSentencePredictorOutput: Object with seq_relationship_logits and loss
267
"""
268
```
269
270
**Usage Example:**
271
272
```python
273
from pytorch_transformers import BertForNextSentencePrediction, BertTokenizer
274
import torch
275
276
# Load model and tokenizer
277
model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")
278
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
279
280
# Prepare sentence pairs
281
sentence_a = "The weather is nice today"
282
sentence_b = "I think I'll go for a walk" # Consecutive sentence
283
sentence_c = "Machine learning is fascinating" # Random sentence
284
285
# Encode pairs
286
consecutive_inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")
287
random_inputs = tokenizer(sentence_a, sentence_c, return_tensors="pt")
288
289
# Predict
290
with torch.no_grad():
291
consecutive_outputs = model(**consecutive_inputs)
292
random_outputs = model(**random_inputs)
293
294
# Get predictions (0=consecutive, 1=random)
295
consecutive_probs = torch.softmax(consecutive_outputs.logits, dim=-1)
296
random_probs = torch.softmax(random_outputs.logits, dim=-1)
297
298
print(f"Consecutive pair - P(consecutive): {consecutive_probs[0, 0]:.3f}")
299
print(f"Random pair - P(consecutive): {random_probs[0, 0]:.3f}")
300
```
301
302
### BertForMaskedLM
303
304
BERT model with a language modeling head for masked language modeling (MLM) tasks.
305
306
```python { .api }
307
class BertForMaskedLM(PreTrainedModel):
308
def __init__(self, config):
309
"""
310
Initialize BERT for masked language modeling.
311
312
Parameters:
313
- config (BertConfig): Model configuration
314
"""
315
316
def forward(
317
self,
318
input_ids=None,
319
attention_mask=None,
320
token_type_ids=None,
321
position_ids=None,
322
head_mask=None,
323
inputs_embeds=None,
324
labels=None
325
):
326
"""
327
Forward pass for masked language modeling.
328
329
Parameters:
330
- input_ids (torch.Tensor): Token IDs with [MASK] tokens
331
- attention_mask (torch.Tensor): Attention mask
332
- token_type_ids (torch.Tensor): Segment token indices
333
- position_ids (torch.Tensor): Position indices
334
- head_mask (torch.Tensor): Head mask
335
- inputs_embeds (torch.Tensor): Pre-computed embeddings
336
- labels (torch.Tensor): True token IDs for masked positions
337
338
Returns:
339
MaskedLMOutput: Object with loss and prediction_scores
340
"""
341
```
342
343
### BertForSequenceClassification
344
345
BERT model with a classification head for sequence-level classification tasks.
346
347
```python { .api }
348
class BertForSequenceClassification(PreTrainedModel):
349
def __init__(self, config):
350
"""
351
Initialize BERT for sequence classification.
352
353
Parameters:
354
- config (BertConfig): Model configuration with num_labels
355
"""
356
357
def forward(
358
self,
359
input_ids=None,
360
attention_mask=None,
361
token_type_ids=None,
362
position_ids=None,
363
head_mask=None,
364
inputs_embeds=None,
365
labels=None
366
):
367
"""
368
Forward pass for sequence classification.
369
370
Parameters:
371
- input_ids (torch.Tensor): Token IDs
372
- attention_mask (torch.Tensor): Attention mask
373
- token_type_ids (torch.Tensor): Segment token indices
374
- position_ids (torch.Tensor): Position indices
375
- head_mask (torch.Tensor): Head mask
376
- inputs_embeds (torch.Tensor): Pre-computed embeddings
377
- labels (torch.Tensor): Classification labels
378
379
Returns:
380
SequenceClassifierOutput: Object with loss and logits
381
"""
382
```
383
384
**Usage Example:**
385
386
```python
387
from pytorch_transformers import BertForSequenceClassification, BertTokenizer
388
import torch
389
390
# Load model for binary classification
391
model = BertForSequenceClassification.from_pretrained(
392
"bert-base-uncased",
393
num_labels=2
394
)
395
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
396
397
# Prepare input
398
text = "This movie is fantastic!"
399
inputs = tokenizer(text, return_tensors="pt")
400
401
# Get predictions
402
with torch.no_grad():
403
outputs = model(**inputs)
404
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
405
406
print(f"Positive probability: {predictions[0][1].item():.3f}")
407
```
408
409
### BertForQuestionAnswering
410
411
BERT model with a span classification head for extractive question answering.
412
413
```python { .api }
414
class BertForQuestionAnswering(PreTrainedModel):
415
def __init__(self, config):
416
"""
417
Initialize BERT for question answering.
418
419
Parameters:
420
- config (BertConfig): Model configuration
421
"""
422
423
def forward(
424
self,
425
input_ids=None,
426
attention_mask=None,
427
token_type_ids=None,
428
position_ids=None,
429
head_mask=None,
430
inputs_embeds=None,
431
start_positions=None,
432
end_positions=None
433
):
434
"""
435
Forward pass for question answering.
436
437
Parameters:
438
- input_ids (torch.Tensor): Token IDs for question and context
439
- attention_mask (torch.Tensor): Attention mask
440
- token_type_ids (torch.Tensor): Segment IDs (0 for question, 1 for context)
441
- position_ids (torch.Tensor): Position indices
442
- head_mask (torch.Tensor): Head mask
443
- inputs_embeds (torch.Tensor): Pre-computed embeddings
444
- start_positions (torch.Tensor): Start positions of answer spans
445
- end_positions (torch.Tensor): End positions of answer spans
446
447
Returns:
448
QuestionAnsweringModelOutput: Object with loss, start_logits, end_logits
449
"""
450
```
451
452
**Usage Example:**
453
454
```python
455
from pytorch_transformers import BertForQuestionAnswering, BertTokenizer
456
import torch
457
458
# Load model and tokenizer
459
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")
460
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
461
462
# Prepare question and context
463
question = "What is the capital of France?"
464
context = "France is a country in Europe. The capital of France is Paris."
465
466
# Tokenize with proper formatting
467
inputs = tokenizer.encode_plus(
468
question,
469
context,
470
return_tensors="pt",
471
max_length=512,
472
truncation=True
473
)
474
475
# Get answer span predictions
476
with torch.no_grad():
477
outputs = model(**inputs)
478
start_scores = outputs.start_logits
479
end_scores = outputs.end_logits
480
481
# Find best answer span
482
start_idx = torch.argmax(start_scores)
483
end_idx = torch.argmax(end_scores)
484
485
# Extract answer
486
answer_tokens = inputs["input_ids"][0][start_idx:end_idx+1]
487
answer = tokenizer.decode(answer_tokens)
488
print(f"Answer: {answer}")
489
```
490
491
### BertForTokenClassification
492
493
BERT model with a token classification head for token-level tasks like named entity recognition.
494
495
```python { .api }
496
class BertForTokenClassification(PreTrainedModel):
497
def __init__(self, config):
498
"""
499
Initialize BERT for token classification.
500
501
Parameters:
502
- config (BertConfig): Model configuration with num_labels
503
"""
504
505
def forward(
506
self,
507
input_ids=None,
508
attention_mask=None,
509
token_type_ids=None,
510
position_ids=None,
511
head_mask=None,
512
inputs_embeds=None,
513
labels=None
514
):
515
"""
516
Forward pass for token classification.
517
518
Parameters:
519
- input_ids (torch.Tensor): Token IDs
520
- attention_mask (torch.Tensor): Attention mask
521
- token_type_ids (torch.Tensor): Segment token indices
522
- position_ids (torch.Tensor): Position indices
523
- head_mask (torch.Tensor): Head mask
524
- inputs_embeds (torch.Tensor): Pre-computed embeddings
525
- labels (torch.Tensor): Token-level labels
526
527
Returns:
528
TokenClassifierOutput: Object with loss and logits
529
"""
530
```
531
532
### BertForMultipleChoice
533
534
BERT model for multiple choice tasks with a classification head over multiple choice options.
535
536
```python { .api }
537
class BertForMultipleChoice(PreTrainedModel):
538
def __init__(self, config):
539
"""
540
Initialize BERT for multiple choice.
541
542
Parameters:
543
- config (BertConfig): Model configuration
544
"""
545
546
def forward(
547
self,
548
input_ids=None,
549
attention_mask=None,
550
token_type_ids=None,
551
position_ids=None,
552
head_mask=None,
553
inputs_embeds=None,
554
labels=None
555
):
556
"""
557
Forward pass for multiple choice.
558
559
Parameters:
560
- input_ids (torch.Tensor): Token IDs of shape (batch_size, num_choices, seq_len)
561
- attention_mask (torch.Tensor): Attention mask
562
- token_type_ids (torch.Tensor): Segment token indices
563
- position_ids (torch.Tensor): Position indices
564
- head_mask (torch.Tensor): Head mask
565
- inputs_embeds (torch.Tensor): Pre-computed embeddings
566
- labels (torch.Tensor): Correct choice indices
567
568
Returns:
569
MultipleChoiceModelOutput: Object with loss and logits
570
"""
571
```
572
573
### BertTokenizer
574
575
WordPiece tokenizer for BERT models with proper handling of special tokens and subword tokenization.
576
577
```python { .api }
578
class BertTokenizer(PreTrainedTokenizer):
579
def __init__(
580
self,
581
vocab_file,
582
do_lower_case=True,
583
do_basic_tokenize=True,
584
never_split=None,
585
unk_token="[UNK]",
586
sep_token="[SEP]",
587
pad_token="[PAD]",
588
cls_token="[CLS]",
589
mask_token="[MASK]",
590
tokenize_chinese_chars=True,
591
**kwargs
592
):
593
"""
594
Initialize BERT tokenizer.
595
596
Parameters:
597
- vocab_file (str): Path to vocabulary file
598
- do_lower_case (bool): Whether to lowercase input
599
- do_basic_tokenize (bool): Whether to do basic tokenization
600
- never_split (List[str]): Tokens never to split
601
- unk_token (str): Unknown token
602
- sep_token (str): Separator token
603
- pad_token (str): Padding token
604
- cls_token (str): Classification token
605
- mask_token (str): Mask token
606
- tokenize_chinese_chars (bool): Whether to tokenize Chinese characters
607
"""
608
```
609
610
## Utility Functions
611
612
### load_tf_weights_in_bert
613
614
```python { .api }
615
def load_tf_weights_in_bert(model, tf_checkpoint_path):
616
"""
617
Load TensorFlow BERT checkpoint weights into a PyTorch BERT model.
618
619
Parameters:
620
- model (BertModel): PyTorch BERT model
621
- tf_checkpoint_path (str): Path to TensorFlow checkpoint
622
623
Returns:
624
BertModel: Model with loaded weights
625
"""
626
```
627
628
## Archive Maps
629
630
```python { .api }
631
BERT_PRETRAINED_MODEL_ARCHIVE_MAP: Dict[str, str]
632
# Maps model names to download URLs for pre-trained weights
633
634
BERT_PRETRAINED_CONFIG_ARCHIVE_MAP: Dict[str, str]
635
# Maps model names to download URLs for configurations
636
```
637
638
**Available Pre-trained Models:**
639
- `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters
640
- `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters
641
- `bert-base-cased`: 12-layer, 768-hidden, 12-heads, 110M parameters (cased)
642
- `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters (cased)
643
- `bert-base-multilingual-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters (multilingual)
644
- `bert-base-multilingual-cased`: 12-layer, 768-hidden, 12-heads, 110M parameters (multilingual, cased)
645
- `bert-base-chinese`: 12-layer, 768-hidden, 12-heads, 110M parameters (Chinese)