or run

npx @tessl/cli init
Log in

Version

Tile

Overview

Evals

Files

Files

docs

auto-classes.mdbase-classes.mdbert-models.mdfile-utilities.mdgpt2-models.mdindex.mdoptimization.mdother-models.md

other-models.mddocs/

0

# Other Transformer Models

1

2

Additional transformer architectures beyond BERT and GPT-2, including OpenAI GPT, Transformer-XL, XLNet, XLM, RoBERTa, and DistilBERT. Each architecture has specific design characteristics optimized for different NLP tasks and languages.

3

4

## Capabilities

5

6

### XLNet Models

7

8

XLNet uses permutation-based training and relative positional encodings, combining the best of autoregressive and autoencoding approaches.

9

10

#### XLNetConfig

11

12

```python { .api }

13

class XLNetConfig(PretrainedConfig):

14

def __init__(

15

self,

16

vocab_size=32000,

17

d_model=1024,

18

n_layer=24,

19

n_head=16,

20

d_inner=4096,

21

ff_activation="gelu",

22

untie_r=True,

23

attn_type="bi",

24

initializer_range=0.02,

25

layer_norm_eps=1e-12,

26

dropout=0.1,

27

mem_len=None,

28

reuse_len=None,

29

bi_data=False,

30

clamp_len=-1,

31

same_length=False,

32

**kwargs

33

):

34

"""

35

Configuration for XLNet models.

36

37

Parameters:

38

- vocab_size (int): Vocabulary size

39

- d_model (int): Hidden layer dimensionality

40

- n_layer (int): Number of transformer layers

41

- n_head (int): Number of attention heads

42

- d_inner (int): Feed-forward layer dimensionality

43

- ff_activation (str): Feed-forward activation function

44

- untie_r (bool): Whether to untie relative position bias

45

- attn_type (str): Attention type ("bi" for bidirectional)

46

- dropout (float): Dropout probability

47

- mem_len (int): Memory length for recurrence

48

- reuse_len (int): Reuse length for recurrence

49

"""

50

```

51

52

#### XLNetModel

53

54

```python { .api }

55

class XLNetModel(PreTrainedModel):

56

def forward(

57

self,

58

input_ids=None,

59

attention_mask=None,

60

mems=None,

61

perm_mask=None,

62

target_mapping=None,

63

token_type_ids=None,

64

input_mask=None,

65

head_mask=None,

66

inputs_embeds=None

67

):

68

"""

69

Forward pass through XLNet model.

70

71

Parameters:

72

- input_ids (torch.Tensor): Token IDs

73

- attention_mask (torch.Tensor): Attention mask

74

- mems (List[torch.Tensor]): Memory from previous segments

75

- perm_mask (torch.Tensor): Permutation mask for attention

76

- target_mapping (torch.Tensor): Target mapping for partial prediction

77

- token_type_ids (torch.Tensor): Segment token indices

78

- input_mask (torch.Tensor): Input mask

79

- head_mask (torch.Tensor): Head mask

80

- inputs_embeds (torch.Tensor): Pre-computed embeddings

81

82

Returns:

83

XLNetModelOutput: Object with last_hidden_state and mems

84

"""

85

```

86

87

#### XLNetForSequenceClassification

88

89

```python { .api }

90

class XLNetForSequenceClassification(PreTrainedModel):

91

def forward(

92

self,

93

input_ids=None,

94

attention_mask=None,

95

mems=None,

96

perm_mask=None,

97

target_mapping=None,

98

token_type_ids=None,

99

input_mask=None,

100

head_mask=None,

101

inputs_embeds=None,

102

labels=None

103

):

104

"""

105

Forward pass for XLNet sequence classification.

106

107

Returns:

108

SequenceClassifierOutput: Object with loss and logits

109

"""

110

```

111

112

#### XLNetTokenizer

113

114

```python { .api }

115

class XLNetTokenizer(PreTrainedTokenizer):

116

def __init__(

117

self,

118

vocab_file,

119

do_lower_case=False,

120

remove_space=True,

121

keep_accents=False,

122

bos_token="<s>",

123

eos_token="</s>",

124

unk_token="<unk>",

125

sep_token="<sep>",

126

pad_token="<pad>",

127

cls_token="<cls>",

128

mask_token="<mask>",

129

**kwargs

130

):

131

"""

132

SentencePiece-based tokenizer for XLNet.

133

"""

134

```

135

136

#### SPIECE_UNDERLINE

137

138

```python { .api }

139

SPIECE_UNDERLINE: str = "▁"

140

# SentencePiece underline character used by XLNet tokenizer

141

# Represents the beginning of words in subword tokenization

142

```

143

144

**Usage Example:**

145

146

```python

147

from pytorch_transformers import XLNetForSequenceClassification, XLNetTokenizer

148

149

model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=2)

150

tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

151

152

text = "This is a great movie!"

153

inputs = tokenizer(text, return_tensors="pt")

154

outputs = model(**inputs)

155

```

156

157

### RoBERTa Models

158

159

RoBERTa (Robustly Optimized BERT Pretraining Approach) improves upon BERT with better training procedures and hyperparameters.

160

161

#### RobertaConfig

162

163

```python { .api }

164

class RobertaConfig(BertConfig):

165

def __init__(self, pad_token_id=1, bos_token_id=0, eos_token_id=2, **kwargs):

166

"""

167

Configuration for RoBERTa models (extends BertConfig).

168

169

Parameters:

170

- pad_token_id (int): Padding token ID

171

- bos_token_id (int): Beginning of sequence token ID

172

- eos_token_id (int): End of sequence token ID

173

"""

174

```

175

176

#### RobertaModel

177

178

```python { .api }

179

class RobertaModel(PreTrainedModel):

180

def forward(

181

self,

182

input_ids=None,

183

attention_mask=None,

184

token_type_ids=None,

185

position_ids=None,

186

head_mask=None,

187

inputs_embeds=None

188

):

189

"""

190

Forward pass through RoBERTa model.

191

192

Returns:

193

BaseModelOutputWithPooling: Object with last_hidden_state and pooler_output

194

"""

195

```

196

197

#### RobertaForMaskedLM

198

199

```python { .api }

200

class RobertaForMaskedLM(PreTrainedModel):

201

def forward(

202

self,

203

input_ids=None,

204

attention_mask=None,

205

token_type_ids=None,

206

position_ids=None,

207

head_mask=None,

208

inputs_embeds=None,

209

labels=None

210

):

211

"""

212

Forward pass for RoBERTa masked language modeling.

213

214

Returns:

215

MaskedLMOutput: Object with loss and prediction_scores

216

"""

217

```

218

219

#### RobertaTokenizer

220

221

```python { .api }

222

class RobertaTokenizer(PreTrainedTokenizer):

223

def __init__(

224

self,

225

vocab_file,

226

merges_file,

227

errors="replace",

228

bos_token="<s>",

229

eos_token="</s>",

230

sep_token="</s>",

231

cls_token="<s>",

232

unk_token="<unk>",

233

pad_token="<pad>",

234

mask_token="<mask>",

235

add_prefix_space=False,

236

**kwargs

237

):

238

"""

239

RoBERTa tokenizer (inherits from GPT2Tokenizer with different special tokens).

240

"""

241

```

242

243

### DistilBERT Models

244

245

DistilBERT is a distilled version of BERT that is 60% smaller and 60% faster while retaining 97% of BERT's performance.

246

247

#### DistilBertConfig

248

249

```python { .api }

250

class DistilBertConfig(PretrainedConfig):

251

def __init__(

252

self,

253

vocab_size=30522,

254

max_position_embeddings=512,

255

sinusoidal_pos_embds=False,

256

n_layers=6,

257

n_heads=12,

258

dim=768,

259

hidden_dim=3072,

260

dropout=0.1,

261

attention_dropout=0.1,

262

activation="gelu",

263

initializer_range=0.02,

264

**kwargs

265

):

266

"""

267

Configuration for DistilBERT models.

268

269

Parameters:

270

- vocab_size (int): Vocabulary size

271

- max_position_embeddings (int): Maximum sequence length

272

- sinusoidal_pos_embds (bool): Whether to use sinusoidal position embeddings

273

- n_layers (int): Number of transformer layers

274

- n_heads (int): Number of attention heads

275

- dim (int): Hidden layer dimensionality

276

- hidden_dim (int): Feed-forward layer dimensionality

277

- dropout (float): Dropout probability

278

- attention_dropout (float): Attention dropout probability

279

- activation (str): Activation function

280

"""

281

```

282

283

#### DistilBertModel

284

285

```python { .api }

286

class DistilBertModel(PreTrainedModel):

287

def forward(

288

self,

289

input_ids=None,

290

attention_mask=None,

291

head_mask=None,

292

inputs_embeds=None

293

):

294

"""

295

Forward pass through DistilBERT model.

296

297

Returns:

298

BaseModelOutput: Object with last_hidden_state

299

"""

300

```

301

302

#### DistilBertForSequenceClassification

303

304

```python { .api }

305

class DistilBertForSequenceClassification(PreTrainedModel):

306

def forward(

307

self,

308

input_ids=None,

309

attention_mask=None,

310

head_mask=None,

311

inputs_embeds=None,

312

labels=None

313

):

314

"""

315

Forward pass for DistilBERT sequence classification.

316

317

Returns:

318

SequenceClassifierOutput: Object with loss and logits

319

"""

320

```

321

322

#### DistilBertTokenizer

323

324

```python { .api }

325

class DistilBertTokenizer(PreTrainedTokenizer):

326

# Identical to BertTokenizer - uses same WordPiece tokenization

327

pass

328

```

329

330

### XLM Models

331

332

XLM (Cross-lingual Language Model) for multilingual understanding and cross-lingual transfer learning.

333

334

#### XLMConfig

335

336

```python { .api }

337

class XLMConfig(PretrainedConfig):

338

def __init__(

339

self,

340

vocab_size=30145,

341

emb_dim=2048,

342

n_layers=12,

343

n_heads=16,

344

dropout=0.1,

345

attention_dropout=0.1,

346

gelu_activation=True,

347

sinusoidal_embeddings=False,

348

causal=False,

349

asm=False,

350

n_langs=1,

351

use_lang_emb=True,

352

max_position_embeddings=512,

353

**kwargs

354

):

355

"""

356

Configuration for XLM models.

357

"""

358

```

359

360

#### XLMModel

361

362

```python { .api }

363

class XLMModel(PreTrainedModel):

364

def forward(

365

self,

366

input_ids=None,

367

attention_mask=None,

368

langs=None,

369

token_type_ids=None,

370

position_ids=None,

371

lengths=None,

372

cache=None,

373

head_mask=None,

374

inputs_embeds=None

375

):

376

"""

377

Forward pass through XLM model.

378

379

Returns:

380

XLMModelOutput: Object with last_hidden_state

381

"""

382

```

383

384

#### XLMTokenizer

385

386

```python { .api }

387

class XLMTokenizer(PreTrainedTokenizer):

388

def __init__(

389

self,

390

vocab_file,

391

merges_file,

392

unk_token="<unk>",

393

bos_token="<s>",

394

sep_token="</s>",

395

pad_token="<pad>",

396

cls_token="</s>",

397

mask_token="<special1>",

398

**kwargs

399

):

400

"""

401

BPE tokenizer for XLM multilingual models.

402

"""

403

```

404

405

### Transformer-XL Models

406

407

Transformer-XL enables learning longer-term dependencies with recurrence mechanisms and relative positional encodings.

408

409

#### TransfoXLConfig

410

411

```python { .api }

412

class TransfoXLConfig(PretrainedConfig):

413

def __init__(

414

self,

415

vocab_size=267735,

416

cutoffs=[20000, 40000, 200000],

417

d_model=1024,

418

d_embed=1024,

419

n_head=16,

420

d_head=64,

421

d_inner=4096,

422

div_val=4,

423

pre_lnorm=False,

424

n_layer=18,

425

mem_len=1600,

426

clamp_len=1000,

427

same_length=True,

428

**kwargs

429

):

430

"""

431

Configuration for Transformer-XL models.

432

"""

433

```

434

435

#### TransfoXLModel

436

437

```python { .api }

438

class TransfoXLModel(PreTrainedModel):

439

def forward(

440

self,

441

input_ids=None,

442

mems=None,

443

head_mask=None,

444

inputs_embeds=None

445

):

446

"""

447

Forward pass through Transformer-XL model.

448

449

Returns:

450

TransfoXLModelOutput: Object with last_hidden_state and mems

451

"""

452

```

453

454

#### TransfoXLTokenizer

455

456

```python { .api }

457

class TransfoXLTokenizer(PreTrainedTokenizer):

458

def __init__(

459

self,

460

special=None,

461

min_freq=0,

462

max_size=None,

463

lower_case=False,

464

delimiter=None,

465

vocab_file=None,

466

**kwargs

467

):

468

"""

469

Word-level tokenizer for Transformer-XL.

470

"""

471

```

472

473

### OpenAI GPT Models

474

475

The original OpenAI GPT (Generative Pre-trained Transformer) model.

476

477

#### OpenAIGPTConfig

478

479

```python { .api }

480

class OpenAIGPTConfig(PretrainedConfig):

481

def __init__(

482

self,

483

vocab_size=40478,

484

n_positions=512,

485

n_ctx=512,

486

n_embd=768,

487

n_layer=12,

488

n_head=12,

489

afn="gelu",

490

resid_pdrop=0.1,

491

embd_pdrop=0.1,

492

attn_pdrop=0.1,

493

layer_norm_epsilon=1e-5,

494

initializer_range=0.02,

495

**kwargs

496

):

497

"""

498

Configuration for OpenAI GPT models.

499

"""

500

```

501

502

#### OpenAIGPTModel

503

504

```python { .api }

505

class OpenAIGPTModel(PreTrainedModel):

506

def forward(

507

self,

508

input_ids=None,

509

attention_mask=None,

510

token_type_ids=None,

511

position_ids=None,

512

head_mask=None,

513

inputs_embeds=None

514

):

515

"""

516

Forward pass through OpenAI GPT model.

517

518

Returns:

519

BaseModelOutput: Object with last_hidden_state

520

"""

521

```

522

523

#### OpenAIGPTTokenizer

524

525

```python { .api }

526

class OpenAIGPTTokenizer(PreTrainedTokenizer):

527

def __init__(

528

self,

529

vocab_file,

530

merges_file,

531

unk_token="<unk>",

532

**kwargs

533

):

534

"""

535

BPE tokenizer for OpenAI GPT.

536

"""

537

```

538

539

## Archive Maps and Model Names

540

541

### Available Pre-trained Models

542

543

**XLNet:**

544

- `xlnet-base-cased`: 12-layer, 768-hidden, 12-heads, 110M parameters

545

- `xlnet-large-cased`: 24-layer, 1024-hidden, 16-heads, 340M parameters

546

547

**RoBERTa:**

548

- `roberta-base`: 12-layer, 768-hidden, 12-heads, 125M parameters

549

- `roberta-large`: 24-layer, 1024-hidden, 16-heads, 355M parameters

550

551

**DistilBERT:**

552

- `distilbert-base-uncased`: 6-layer, 768-hidden, 12-heads, 66M parameters

553

- `distilbert-base-cased`: 6-layer, 768-hidden, 12-heads, 65M parameters (cased)

554

555

**XLM:**

556

- `xlm-mlm-en-2048`: English MLM model, 1024-hidden

557

- `xlm-mlm-100-1280`: 100-language MLM model, 1280-hidden

558

559

**Transformer-XL:**

560

- `transfo-xl-wt103`: Trained on WikiText-103, 1024-hidden, 18-layer

561

562

**OpenAI GPT:**

563

- `openai-gpt`: 12-layer, 768-hidden, 12-heads, 117M parameters

564

565

## Usage Examples

566

567

```python

568

# XLNet for sequence classification

569

from pytorch_transformers import XLNetForSequenceClassification, XLNetTokenizer

570

571

xlnet_model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=2)

572

xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

573

574

# RoBERTa for masked language modeling

575

from pytorch_transformers import RobertaForMaskedLM, RobertaTokenizer

576

577

roberta_model = RobertaForMaskedLM.from_pretrained("roberta-base")

578

roberta_tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

579

580

# DistilBERT for efficient inference

581

from pytorch_transformers import DistilBertForSequenceClassification, DistilBertTokenizer

582

583

distilbert_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

584

distilbert_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

585

586

# Process text with any model

587

text = "This is an example sentence."

588

inputs = tokenizer(text, return_tensors="pt")

589

outputs = model(**inputs)

590

```