Repository of pre-trained NLP Transformer models: BERT & RoBERTa, GPT & GPT-2, Transformer-XL, XLNet and XLM
—
Specialized optimizers and learning rate schedulers designed for transformer model training and fine-tuning. These optimization tools implement best practices for training large language models with proper weight decay, warmup schedules, and learning rate decay patterns.
Adam optimizer with weight decay fix, specifically designed for transformer models. Unlike standard Adam with L2 regularization, AdamW applies weight decay directly to the parameters.
class AdamW:
def __init__(
self,
params,
lr=1e-3,
betas=(0.9, 0.999),
eps=1e-8,
weight_decay=0.01,
correct_bias=True
):
"""
Initialize AdamW optimizer.
Parameters:
- params: Iterable of parameters to optimize
- lr (float): Learning rate
- betas (Tuple[float, float]): Coefficients for gradient and squared gradient moving averages
- eps (float): Term added to denominator for numerical stability
- weight_decay (float): Weight decay coefficient
- correct_bias (bool): Whether to correct bias in moment estimates
"""
def step(self, closure=None):
"""
Perform a single optimization step.
Parameters:
- closure (callable, optional): Closure that reevaluates model and returns loss
Returns:
float: Loss value if closure is provided
"""
def zero_grad(self):
"""
Clear gradients of all optimized parameters.
"""Usage Example:
from pytorch_transformers import AdamW, BertForSequenceClassification
import torch
# Load model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Initialize optimizer
optimizer = AdamW(
model.parameters(),
lr=2e-5,
weight_decay=0.01,
correct_bias=False
)
# Training step
inputs = torch.randint(0, 1000, (8, 128)) # Dummy input
labels = torch.randint(0, 2, (8,)) # Dummy labels
optimizer.zero_grad()
outputs = model(inputs, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"Loss: {loss.item():.4f}")Various learning rate scheduling strategies commonly used in transformer training, including warmup phases and different decay patterns.
Maintains a constant learning rate throughout training.
def ConstantLRSchedule(optimizer, last_epoch=-1):
"""
Create a constant learning rate schedule.
Parameters:
- optimizer: Wrapped optimizer
- last_epoch (int): Index of last epoch
Returns:
LambdaLR: Learning rate scheduler
"""Linear warmup followed by constant learning rate.
def WarmupConstantSchedule(optimizer, warmup_steps, last_epoch=-1):
"""
Create a schedule with linear warmup followed by constant learning rate.
Parameters:
- optimizer: Wrapped optimizer
- warmup_steps (int): Number of warmup steps
- last_epoch (int): Index of last epoch
Returns:
LambdaLR: Learning rate scheduler
"""Linear warmup followed by linear decay to zero.
def WarmupLinearSchedule(optimizer, warmup_steps, t_total, last_epoch=-1):
"""
Create a schedule with linear warmup followed by linear decay.
Parameters:
- optimizer: Wrapped optimizer
- warmup_steps (int): Number of warmup steps
- t_total (int): Total number of training steps
- last_epoch (int): Index of last epoch
Returns:
LambdaLR: Learning rate scheduler
"""Linear warmup followed by cosine annealing decay.
def WarmupCosineSchedule(optimizer, warmup_steps, t_total, cycles=0.5, last_epoch=-1):
"""
Create a schedule with linear warmup followed by cosine annealing.
Parameters:
- optimizer: Wrapped optimizer
- warmup_steps (int): Number of warmup steps
- t_total (int): Total number of training steps
- cycles (float): Number of cosine cycles (0.5 for half cosine)
- last_epoch (int): Index of last epoch
Returns:
LambdaLR: Learning rate scheduler
"""Linear warmup followed by cosine annealing with hard restarts.
def WarmupCosineWithHardRestartsSchedule(optimizer, warmup_steps, t_total, cycles=1.0, last_epoch=-1):
"""
Create a schedule with linear warmup followed by cosine annealing with hard restarts.
Parameters:
- optimizer: Wrapped optimizer
- warmup_steps (int): Number of warmup steps
- t_total (int): Total number of training steps
- cycles (float): Number of restart cycles
- last_epoch (int): Index of last epoch
Returns:
LambdaLR: Learning rate scheduler
"""Usage Examples:
from pytorch_transformers import (
AdamW,
WarmupLinearSchedule,
WarmupCosineSchedule,
WarmupConstantSchedule
)
# Setup model and optimizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
optimizer = AdamW(model.parameters(), lr=2e-5)
# Training configuration
num_epochs = 3
num_training_steps = 1000
warmup_steps = 100
# Linear schedule with warmup
linear_scheduler = WarmupLinearSchedule(
optimizer,
warmup_steps=warmup_steps,
t_total=num_training_steps
)
# Cosine schedule with warmup
cosine_scheduler = WarmupCosineSchedule(
optimizer,
warmup_steps=warmup_steps,
t_total=num_training_steps,
cycles=0.5
)
# Constant schedule with warmup
constant_scheduler = WarmupConstantSchedule(
optimizer,
warmup_steps=warmup_steps
)
# Training loop example
for epoch in range(num_epochs):
for step in range(num_training_steps // num_epochs):
# Training step
optimizer.zero_grad()
# ... forward pass, loss calculation, backward pass ...
optimizer.step()
linear_scheduler.step() # Update learning rate
# Log current learning rate
current_lr = optimizer.param_groups[0]['lr']
if step % 100 == 0:
print(f"Epoch {epoch}, Step {step}, LR: {current_lr:.2e}")Fine-tuning Pre-trained Models:
Warmup Steps:
# Recommended setup for BERT fine-tuning
total_steps = len(train_dataloader) * num_epochs
warmup_steps = int(0.1 * total_steps)
optimizer = AdamW(
model.parameters(),
lr=2e-5,
weight_decay=0.01,
correct_bias=False
)
scheduler = WarmupLinearSchedule(
optimizer,
warmup_steps=warmup_steps,
t_total=total_steps
)Recommended weight decay values:
Parameter groups with different weight decay:
# Apply weight decay only to weights, not biases or layer norms
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": 0.01,
},
{
"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
"weight_decay": 0.0,
},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)import torch.nn.utils as nn_utils
# Training step with gradient clipping
optimizer.zero_grad()
loss.backward()
# Clip gradients to prevent exploding gradients
nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()from torch.cuda.amp import autocast, GradScaler
# Initialize gradient scaler for mixed precision
scaler = GradScaler()
# Training step with mixed precision
optimizer.zero_grad()
with autocast():
outputs = model(**inputs)
loss = outputs.loss
# Scale loss and backward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
scheduler.step()Different learning rate schedules behave differently during training:
Linear Schedule: Steady decrease after warmup
Cosine Schedule: Smooth decay following cosine curve
Constant Schedule: Maintains rate after warmup
Cosine with Restarts: Periodic learning rate increases
Install with Tessl CLI
npx tessl i tessl/pypi-pytorch-transformers