or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

auto-classes.mdbase-classes.mdbert-models.mdfile-utilities.mdgpt2-models.mdindex.mdoptimization.mdother-models.md

bert-models.mddocs/

0

# BERT Models

1

2

BERT (Bidirectional Encoder Representations from Transformers) models for various NLP tasks. BERT uses bidirectional attention to understand context from both directions, making it highly effective for understanding-based tasks like classification, question answering, and token-level predictions.

3

4

## Capabilities

5

6

### BertConfig

7

8

Configuration class for BERT models containing all hyperparameters and architecture specifications.

9

10

```python { .api }

11

class BertConfig(PretrainedConfig):

12

def __init__(

13

self,

14

vocab_size=30522,

15

hidden_size=768,

16

num_hidden_layers=12,

17

num_attention_heads=12,

18

intermediate_size=3072,

19

hidden_act="gelu",

20

hidden_dropout_prob=0.1,

21

attention_probs_dropout_prob=0.1,

22

max_position_embeddings=512,

23

type_vocab_size=2,

24

initializer_range=0.02,

25

layer_norm_eps=1e-12,

26

**kwargs

27

):

28

"""

29

Configuration for BERT models.

30

31

Parameters:

32

- vocab_size (int): Vocabulary size

33

- hidden_size (int): Hidden layer dimensionality

34

- num_hidden_layers (int): Number of transformer layers

35

- num_attention_heads (int): Number of attention heads per layer

36

- intermediate_size (int): Feed-forward layer dimensionality

37

- hidden_act (str): Activation function ("gelu", "relu", "swish")

38

- hidden_dropout_prob (float): Dropout probability for hidden layers

39

- attention_probs_dropout_prob (float): Dropout for attention probabilities

40

- max_position_embeddings (int): Maximum sequence length

41

- type_vocab_size (int): Number of token type embeddings

42

- initializer_range (float): Weight initialization range

43

- layer_norm_eps (float): Layer normalization epsilon

44

"""

45

```

46

47

### BertModel

48

49

Base BERT model for encoding sequences into contextualized representations.

50

51

```python { .api }

52

class BertModel(PreTrainedModel):

53

def __init__(self, config):

54

"""

55

Initialize BERT base model.

56

57

Parameters:

58

- config (BertConfig): Model configuration

59

"""

60

61

def forward(

62

self,

63

input_ids=None,

64

attention_mask=None,

65

token_type_ids=None,

66

position_ids=None,

67

head_mask=None,

68

inputs_embeds=None

69

):

70

"""

71

Forward pass through BERT model.

72

73

Parameters:

74

- input_ids (torch.Tensor): Token IDs of shape (batch_size, sequence_length)

75

- attention_mask (torch.Tensor): Attention mask to avoid padding tokens

76

- token_type_ids (torch.Tensor): Segment token indices for sentence pairs

77

- position_ids (torch.Tensor): Position indices

78

- head_mask (torch.Tensor): Mask to nullify selected heads

79

- inputs_embeds (torch.Tensor): Pre-computed embeddings

80

81

Returns:

82

BaseModelOutputWithPooling: Object with last_hidden_state and pooler_output

83

"""

84

```

85

86

**Usage Example:**

87

88

```python

89

from pytorch_transformers import BertModel, BertTokenizer

90

import torch

91

92

# Load model and tokenizer

93

model = BertModel.from_pretrained("bert-base-uncased")

94

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

95

96

# Prepare input

97

text = "The quick brown fox jumps over the lazy dog."

98

inputs = tokenizer(text, return_tensors="pt")

99

100

# Get model outputs

101

with torch.no_grad():

102

outputs = model(**inputs)

103

104

# Access representations

105

last_hidden_state = outputs.last_hidden_state # Shape: (1, seq_len, 768)

106

pooled_output = outputs.pooler_output # Shape: (1, 768)

107

108

print(f"Sequence representation shape: {last_hidden_state.shape}")

109

print(f"Pooled representation shape: {pooled_output.shape}")

110

```

111

112

### BertPreTrainedModel

113

114

Abstract base class for all BERT models that handles weight initialization and provides a simple interface for downloading and loading pre-trained models.

115

116

```python { .api }

117

class BertPreTrainedModel(PreTrainedModel):

118

config_class = BertConfig

119

pretrained_model_archive_map = BERT_PRETRAINED_MODEL_ARCHIVE_MAP

120

load_tf_weights = load_tf_weights_in_bert

121

base_model_prefix = "bert"

122

123

def _init_weights(self, module):

124

"""

125

Initialize the weights for BERT models.

126

127

Parameters:

128

- module (nn.Module): Module to initialize

129

"""

130

```

131

132

**Usage Example:**

133

134

```python

135

from pytorch_transformers import BertPreTrainedModel, BertConfig

136

137

# BertPreTrainedModel is typically used as a base class for custom BERT models

138

class CustomBertModel(BertPreTrainedModel):

139

def __init__(self, config):

140

super().__init__(config)

141

# Custom model implementation

142

143

def forward(self, input_ids):

144

# Custom forward implementation

145

pass

146

147

# Initialize with proper weight initialization

148

config = BertConfig()

149

model = CustomBertModel(config)

150

# Weights are automatically initialized according to BERT standards

151

```

152

153

### BertForPreTraining

154

155

BERT model for pre-training with both masked language modeling and next sentence prediction heads.

156

157

```python { .api }

158

class BertForPreTraining(BertPreTrainedModel):

159

def __init__(self, config):

160

"""

161

Initialize BERT for pre-training with MLM and NSP heads.

162

163

Parameters:

164

- config (BertConfig): Model configuration

165

"""

166

167

def forward(

168

self,

169

input_ids=None,

170

attention_mask=None,

171

token_type_ids=None,

172

position_ids=None,

173

head_mask=None,

174

inputs_embeds=None,

175

masked_lm_labels=None,

176

next_sentence_label=None

177

):

178

"""

179

Forward pass for pre-training with MLM and NSP tasks.

180

181

Parameters:

182

- input_ids (torch.Tensor): Token IDs

183

- attention_mask (torch.Tensor): Attention mask

184

- token_type_ids (torch.Tensor): Segment token indices

185

- position_ids (torch.Tensor): Position indices

186

- head_mask (torch.Tensor): Head mask

187

- inputs_embeds (torch.Tensor): Pre-computed embeddings

188

- masked_lm_labels (torch.Tensor): Labels for MLM loss

189

- next_sentence_label (torch.Tensor): Labels for NSP loss

190

191

Returns:

192

BertForPreTrainingOutput: Object with prediction_logits, seq_relationship_logits, and losses

193

"""

194

```

195

196

**Usage Example:**

197

198

```python

199

from pytorch_transformers import BertForPreTraining, BertTokenizer

200

import torch

201

202

# Load model and tokenizer

203

model = BertForPreTraining.from_pretrained("bert-base-uncased")

204

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

205

206

# Prepare pre-training data

207

text_a = "The cat sat on the"

208

text_b = "mat and slept peacefully"

209

inputs = tokenizer(text_a, text_b, return_tensors="pt")

210

211

# Add masked LM labels (replace some tokens with [MASK])

212

masked_inputs = inputs.copy()

213

masked_inputs['input_ids'][0, 5] = tokenizer.mask_token_id # Mask "on"

214

masked_lm_labels = inputs['input_ids'].clone()

215

masked_lm_labels[masked_inputs['input_ids'] != tokenizer.mask_token_id] = -1

216

217

# Add NSP label (0 = sentence B follows A, 1 = random sentence B)

218

next_sentence_label = torch.tensor([0])

219

220

# Forward pass

221

outputs = model(**masked_inputs,

222

masked_lm_labels=masked_lm_labels,

223

next_sentence_label=next_sentence_label)

224

225

print(f"MLM loss: {outputs.loss}")

226

print(f"NSP predictions: {torch.softmax(outputs.seq_relationship_logits, dim=-1)}")

227

```

228

229

### BertForNextSentencePrediction

230

231

BERT model with only a next sentence prediction head for determining if two sentences are consecutive.

232

233

```python { .api }

234

class BertForNextSentencePrediction(BertPreTrainedModel):

235

def __init__(self, config):

236

"""

237

Initialize BERT for next sentence prediction task.

238

239

Parameters:

240

- config (BertConfig): Model configuration

241

"""

242

243

def forward(

244

self,

245

input_ids=None,

246

attention_mask=None,

247

token_type_ids=None,

248

position_ids=None,

249

head_mask=None,

250

inputs_embeds=None,

251

next_sentence_label=None

252

):

253

"""

254

Forward pass for next sentence prediction.

255

256

Parameters:

257

- input_ids (torch.Tensor): Token IDs for sentence pair

258

- attention_mask (torch.Tensor): Attention mask

259

- token_type_ids (torch.Tensor): Segment token indices (0 for sentence A, 1 for sentence B)

260

- position_ids (torch.Tensor): Position indices

261

- head_mask (torch.Tensor): Head mask

262

- inputs_embeds (torch.Tensor): Pre-computed embeddings

263

- next_sentence_label (torch.Tensor): Labels (0=consecutive, 1=random)

264

265

Returns:

266

NextSentencePredictorOutput: Object with seq_relationship_logits and loss

267

"""

268

```

269

270

**Usage Example:**

271

272

```python

273

from pytorch_transformers import BertForNextSentencePrediction, BertTokenizer

274

import torch

275

276

# Load model and tokenizer

277

model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")

278

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

279

280

# Prepare sentence pairs

281

sentence_a = "The weather is nice today"

282

sentence_b = "I think I'll go for a walk" # Consecutive sentence

283

sentence_c = "Machine learning is fascinating" # Random sentence

284

285

# Encode pairs

286

consecutive_inputs = tokenizer(sentence_a, sentence_b, return_tensors="pt")

287

random_inputs = tokenizer(sentence_a, sentence_c, return_tensors="pt")

288

289

# Predict

290

with torch.no_grad():

291

consecutive_outputs = model(**consecutive_inputs)

292

random_outputs = model(**random_inputs)

293

294

# Get predictions (0=consecutive, 1=random)

295

consecutive_probs = torch.softmax(consecutive_outputs.logits, dim=-1)

296

random_probs = torch.softmax(random_outputs.logits, dim=-1)

297

298

print(f"Consecutive pair - P(consecutive): {consecutive_probs[0, 0]:.3f}")

299

print(f"Random pair - P(consecutive): {random_probs[0, 0]:.3f}")

300

```

301

302

### BertForMaskedLM

303

304

BERT model with a language modeling head for masked language modeling (MLM) tasks.

305

306

```python { .api }

307

class BertForMaskedLM(PreTrainedModel):

308

def __init__(self, config):

309

"""

310

Initialize BERT for masked language modeling.

311

312

Parameters:

313

- config (BertConfig): Model configuration

314

"""

315

316

def forward(

317

self,

318

input_ids=None,

319

attention_mask=None,

320

token_type_ids=None,

321

position_ids=None,

322

head_mask=None,

323

inputs_embeds=None,

324

labels=None

325

):

326

"""

327

Forward pass for masked language modeling.

328

329

Parameters:

330

- input_ids (torch.Tensor): Token IDs with [MASK] tokens

331

- attention_mask (torch.Tensor): Attention mask

332

- token_type_ids (torch.Tensor): Segment token indices

333

- position_ids (torch.Tensor): Position indices

334

- head_mask (torch.Tensor): Head mask

335

- inputs_embeds (torch.Tensor): Pre-computed embeddings

336

- labels (torch.Tensor): True token IDs for masked positions

337

338

Returns:

339

MaskedLMOutput: Object with loss and prediction_scores

340

"""

341

```

342

343

### BertForSequenceClassification

344

345

BERT model with a classification head for sequence-level classification tasks.

346

347

```python { .api }

348

class BertForSequenceClassification(PreTrainedModel):

349

def __init__(self, config):

350

"""

351

Initialize BERT for sequence classification.

352

353

Parameters:

354

- config (BertConfig): Model configuration with num_labels

355

"""

356

357

def forward(

358

self,

359

input_ids=None,

360

attention_mask=None,

361

token_type_ids=None,

362

position_ids=None,

363

head_mask=None,

364

inputs_embeds=None,

365

labels=None

366

):

367

"""

368

Forward pass for sequence classification.

369

370

Parameters:

371

- input_ids (torch.Tensor): Token IDs

372

- attention_mask (torch.Tensor): Attention mask

373

- token_type_ids (torch.Tensor): Segment token indices

374

- position_ids (torch.Tensor): Position indices

375

- head_mask (torch.Tensor): Head mask

376

- inputs_embeds (torch.Tensor): Pre-computed embeddings

377

- labels (torch.Tensor): Classification labels

378

379

Returns:

380

SequenceClassifierOutput: Object with loss and logits

381

"""

382

```

383

384

**Usage Example:**

385

386

```python

387

from pytorch_transformers import BertForSequenceClassification, BertTokenizer

388

import torch

389

390

# Load model for binary classification

391

model = BertForSequenceClassification.from_pretrained(

392

"bert-base-uncased",

393

num_labels=2

394

)

395

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

396

397

# Prepare input

398

text = "This movie is fantastic!"

399

inputs = tokenizer(text, return_tensors="pt")

400

401

# Get predictions

402

with torch.no_grad():

403

outputs = model(**inputs)

404

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

405

406

print(f"Positive probability: {predictions[0][1].item():.3f}")

407

```

408

409

### BertForQuestionAnswering

410

411

BERT model with a span classification head for extractive question answering.

412

413

```python { .api }

414

class BertForQuestionAnswering(PreTrainedModel):

415

def __init__(self, config):

416

"""

417

Initialize BERT for question answering.

418

419

Parameters:

420

- config (BertConfig): Model configuration

421

"""

422

423

def forward(

424

self,

425

input_ids=None,

426

attention_mask=None,

427

token_type_ids=None,

428

position_ids=None,

429

head_mask=None,

430

inputs_embeds=None,

431

start_positions=None,

432

end_positions=None

433

):

434

"""

435

Forward pass for question answering.

436

437

Parameters:

438

- input_ids (torch.Tensor): Token IDs for question and context

439

- attention_mask (torch.Tensor): Attention mask

440

- token_type_ids (torch.Tensor): Segment IDs (0 for question, 1 for context)

441

- position_ids (torch.Tensor): Position indices

442

- head_mask (torch.Tensor): Head mask

443

- inputs_embeds (torch.Tensor): Pre-computed embeddings

444

- start_positions (torch.Tensor): Start positions of answer spans

445

- end_positions (torch.Tensor): End positions of answer spans

446

447

Returns:

448

QuestionAnsweringModelOutput: Object with loss, start_logits, end_logits

449

"""

450

```

451

452

**Usage Example:**

453

454

```python

455

from pytorch_transformers import BertForQuestionAnswering, BertTokenizer

456

import torch

457

458

# Load model and tokenizer

459

model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

460

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

461

462

# Prepare question and context

463

question = "What is the capital of France?"

464

context = "France is a country in Europe. The capital of France is Paris."

465

466

# Tokenize with proper formatting

467

inputs = tokenizer.encode_plus(

468

question,

469

context,

470

return_tensors="pt",

471

max_length=512,

472

truncation=True

473

)

474

475

# Get answer span predictions

476

with torch.no_grad():

477

outputs = model(**inputs)

478

start_scores = outputs.start_logits

479

end_scores = outputs.end_logits

480

481

# Find best answer span

482

start_idx = torch.argmax(start_scores)

483

end_idx = torch.argmax(end_scores)

484

485

# Extract answer

486

answer_tokens = inputs["input_ids"][0][start_idx:end_idx+1]

487

answer = tokenizer.decode(answer_tokens)

488

print(f"Answer: {answer}")

489

```

490

491

### BertForTokenClassification

492

493

BERT model with a token classification head for token-level tasks like named entity recognition.

494

495

```python { .api }

496

class BertForTokenClassification(PreTrainedModel):

497

def __init__(self, config):

498

"""

499

Initialize BERT for token classification.

500

501

Parameters:

502

- config (BertConfig): Model configuration with num_labels

503

"""

504

505

def forward(

506

self,

507

input_ids=None,

508

attention_mask=None,

509

token_type_ids=None,

510

position_ids=None,

511

head_mask=None,

512

inputs_embeds=None,

513

labels=None

514

):

515

"""

516

Forward pass for token classification.

517

518

Parameters:

519

- input_ids (torch.Tensor): Token IDs

520

- attention_mask (torch.Tensor): Attention mask

521

- token_type_ids (torch.Tensor): Segment token indices

522

- position_ids (torch.Tensor): Position indices

523

- head_mask (torch.Tensor): Head mask

524

- inputs_embeds (torch.Tensor): Pre-computed embeddings

525

- labels (torch.Tensor): Token-level labels

526

527

Returns:

528

TokenClassifierOutput: Object with loss and logits

529

"""

530

```

531

532

### BertForMultipleChoice

533

534

BERT model for multiple choice tasks with a classification head over multiple choice options.

535

536

```python { .api }

537

class BertForMultipleChoice(PreTrainedModel):

538

def __init__(self, config):

539

"""

540

Initialize BERT for multiple choice.

541

542

Parameters:

543

- config (BertConfig): Model configuration

544

"""

545

546

def forward(

547

self,

548

input_ids=None,

549

attention_mask=None,

550

token_type_ids=None,

551

position_ids=None,

552

head_mask=None,

553

inputs_embeds=None,

554

labels=None

555

):

556

"""

557

Forward pass for multiple choice.

558

559

Parameters:

560

- input_ids (torch.Tensor): Token IDs of shape (batch_size, num_choices, seq_len)

561

- attention_mask (torch.Tensor): Attention mask

562

- token_type_ids (torch.Tensor): Segment token indices

563

- position_ids (torch.Tensor): Position indices

564

- head_mask (torch.Tensor): Head mask

565

- inputs_embeds (torch.Tensor): Pre-computed embeddings

566

- labels (torch.Tensor): Correct choice indices

567

568

Returns:

569

MultipleChoiceModelOutput: Object with loss and logits

570

"""

571

```

572

573

### BertTokenizer

574

575

WordPiece tokenizer for BERT models with proper handling of special tokens and subword tokenization.

576

577

```python { .api }

578

class BertTokenizer(PreTrainedTokenizer):

579

def __init__(

580

self,

581

vocab_file,

582

do_lower_case=True,

583

do_basic_tokenize=True,

584

never_split=None,

585

unk_token="[UNK]",

586

sep_token="[SEP]",

587

pad_token="[PAD]",

588

cls_token="[CLS]",

589

mask_token="[MASK]",

590

tokenize_chinese_chars=True,

591

**kwargs

592

):

593

"""

594

Initialize BERT tokenizer.

595

596

Parameters:

597

- vocab_file (str): Path to vocabulary file

598

- do_lower_case (bool): Whether to lowercase input

599

- do_basic_tokenize (bool): Whether to do basic tokenization

600

- never_split (List[str]): Tokens never to split

601

- unk_token (str): Unknown token

602

- sep_token (str): Separator token

603

- pad_token (str): Padding token

604

- cls_token (str): Classification token

605

- mask_token (str): Mask token

606

- tokenize_chinese_chars (bool): Whether to tokenize Chinese characters

607

"""

608

```

609

610

## Utility Functions

611

612

### load_tf_weights_in_bert

613

614

```python { .api }

615

def load_tf_weights_in_bert(model, tf_checkpoint_path):

616

"""

617

Load TensorFlow BERT checkpoint weights into a PyTorch BERT model.

618

619

Parameters:

620

- model (BertModel): PyTorch BERT model

621

- tf_checkpoint_path (str): Path to TensorFlow checkpoint

622

623

Returns:

624

BertModel: Model with loaded weights

625

"""

626

```

627

628

## Archive Maps

629

630

```python { .api }

631

BERT_PRETRAINED_MODEL_ARCHIVE_MAP: Dict[str, str]

632

# Maps model names to download URLs for pre-trained weights

633

634

BERT_PRETRAINED_CONFIG_ARCHIVE_MAP: Dict[str, str]

635

# Maps model names to download URLs for configurations

636

```

637

638

**Available Pre-trained Models:**

639

- `bert-base-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters

640

- `bert-large-uncased`: 24-layer, 1024-hidden, 16-heads, 340M parameters

641

- `bert-base-cased`: 12-layer, 768-hidden, 12-heads, 110M parameters (cased)

642

- `bert-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters (cased)

643

- `bert-base-multilingual-uncased`: 12-layer, 768-hidden, 12-heads, 110M parameters (multilingual)

644

- `bert-base-multilingual-cased`: 12-layer, 768-hidden, 12-heads, 110M parameters (multilingual, cased)

645

- `bert-base-chinese`: 12-layer, 768-hidden, 12-heads, 110M parameters (Chinese)