tessl/pypi-pytorch-transformers

Repository of pre-trained NLP Transformer models: BERT & RoBERTa, GPT & GPT-2, Transformer-XL, XLNet and XLM

—

Pending

Overview

Eval results

Files

Optimization

Name: tessl/pypi-pytorch-transformers
Author: tessl

Specialized optimizers and learning rate schedulers designed for transformer model training and fine-tuning. These optimization tools implement best practices for training large language models with proper weight decay, warmup schedules, and learning rate decay patterns.

Capabilities

AdamW Optimizer

Adam optimizer with weight decay fix, specifically designed for transformer models. Unlike standard Adam with L2 regularization, AdamW applies weight decay directly to the parameters.

class AdamW:
    def __init__(
        self,
        params,
        lr=1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=0.01,
        correct_bias=True
    ):
        """
        Initialize AdamW optimizer.
        
        Parameters:
        - params: Iterable of parameters to optimize
        - lr (float): Learning rate
        - betas (Tuple[float, float]): Coefficients for gradient and squared gradient moving averages
        - eps (float): Term added to denominator for numerical stability
        - weight_decay (float): Weight decay coefficient
        - correct_bias (bool): Whether to correct bias in moment estimates
        """
    
    def step(self, closure=None):
        """
        Perform a single optimization step.
        
        Parameters:
        - closure (callable, optional): Closure that reevaluates model and returns loss
        
        Returns:
        float: Loss value if closure is provided
        """
    
    def zero_grad(self):
        """
        Clear gradients of all optimized parameters.
        """

Usage Example:

from pytorch_transformers import AdamW, BertForSequenceClassification
import torch

# Load model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Initialize optimizer
optimizer = AdamW(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01,
    correct_bias=False
)

# Training step
inputs = torch.randint(0, 1000, (8, 128))  # Dummy input
labels = torch.randint(0, 2, (8,))         # Dummy labels

optimizer.zero_grad()
outputs = model(inputs, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()

print(f"Loss: {loss.item():.4f}")

Learning Rate Schedulers

Various learning rate scheduling strategies commonly used in transformer training, including warmup phases and different decay patterns.

ConstantLRSchedule

Maintains a constant learning rate throughout training.

def ConstantLRSchedule(optimizer, last_epoch=-1):
    """
    Create a constant learning rate schedule.
    
    Parameters:
    - optimizer: Wrapped optimizer
    - last_epoch (int): Index of last epoch
    
    Returns:
    LambdaLR: Learning rate scheduler
    """

WarmupConstantSchedule

Linear warmup followed by constant learning rate.

def WarmupConstantSchedule(optimizer, warmup_steps, last_epoch=-1):
    """
    Create a schedule with linear warmup followed by constant learning rate.
    
    Parameters:
    - optimizer: Wrapped optimizer
    - warmup_steps (int): Number of warmup steps
    - last_epoch (int): Index of last epoch
    
    Returns:
    LambdaLR: Learning rate scheduler
    """

WarmupLinearSchedule

Linear warmup followed by linear decay to zero.

def WarmupLinearSchedule(optimizer, warmup_steps, t_total, last_epoch=-1):
    """
    Create a schedule with linear warmup followed by linear decay.
    
    Parameters:
    - optimizer: Wrapped optimizer
    - warmup_steps (int): Number of warmup steps
    - t_total (int): Total number of training steps
    - last_epoch (int): Index of last epoch
    
    Returns:
    LambdaLR: Learning rate scheduler
    """

WarmupCosineSchedule

Linear warmup followed by cosine annealing decay.

def WarmupCosineSchedule(optimizer, warmup_steps, t_total, cycles=0.5, last_epoch=-1):
    """
    Create a schedule with linear warmup followed by cosine annealing.
    
    Parameters:
    - optimizer: Wrapped optimizer
    - warmup_steps (int): Number of warmup steps
    - t_total (int): Total number of training steps
    - cycles (float): Number of cosine cycles (0.5 for half cosine)
    - last_epoch (int): Index of last epoch
    
    Returns:
    LambdaLR: Learning rate scheduler
    """

WarmupCosineWithHardRestartsSchedule

Linear warmup followed by cosine annealing with hard restarts.

def WarmupCosineWithHardRestartsSchedule(optimizer, warmup_steps, t_total, cycles=1.0, last_epoch=-1):
    """
    Create a schedule with linear warmup followed by cosine annealing with hard restarts.
    
    Parameters:
    - optimizer: Wrapped optimizer
    - warmup_steps (int): Number of warmup steps
    - t_total (int): Total number of training steps
    - cycles (float): Number of restart cycles
    - last_epoch (int): Index of last epoch
    
    Returns:
    LambdaLR: Learning rate scheduler
    """

Usage Examples:

from pytorch_transformers import (
    AdamW, 
    WarmupLinearSchedule, 
    WarmupCosineSchedule,
    WarmupConstantSchedule
)

# Setup model and optimizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training configuration
num_epochs = 3
num_training_steps = 1000
warmup_steps = 100

# Linear schedule with warmup
linear_scheduler = WarmupLinearSchedule(
    optimizer,
    warmup_steps=warmup_steps,
    t_total=num_training_steps
)

# Cosine schedule with warmup
cosine_scheduler = WarmupCosineSchedule(
    optimizer,
    warmup_steps=warmup_steps,
    t_total=num_training_steps,
    cycles=0.5
)

# Constant schedule with warmup
constant_scheduler = WarmupConstantSchedule(
    optimizer,
    warmup_steps=warmup_steps
)

# Training loop example
for epoch in range(num_epochs):
    for step in range(num_training_steps // num_epochs):
        # Training step
        optimizer.zero_grad()
        # ... forward pass, loss calculation, backward pass ...
        optimizer.step()
        linear_scheduler.step()  # Update learning rate
        
        # Log current learning rate
        current_lr = optimizer.param_groups[0]['lr']
        if step % 100 == 0:
            print(f"Epoch {epoch}, Step {step}, LR: {current_lr:.2e}")

Optimization Best Practices

Learning Rate Selection

Fine-tuning Pre-trained Models:

BERT/RoBERTa: 2e-5, 3e-5, 5e-5
GPT-2: 1e-4, 2e-4, 5e-4
Smaller models: Higher learning rates (up to 1e-3)

Warmup Steps:

Typically 10% of total training steps
For short training: 500-1000 steps
For long training: 5000-10000 steps

# Recommended setup for BERT fine-tuning
total_steps = len(train_dataloader) * num_epochs
warmup_steps = int(0.1 * total_steps)

optimizer = AdamW(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01,
    correct_bias=False
)

scheduler = WarmupLinearSchedule(
    optimizer,
    warmup_steps=warmup_steps,
    t_total=total_steps
)

Weight Decay Configuration

Recommended weight decay values:

Default: 0.01
Larger models: 0.1
Smaller models: 0.001

Parameter groups with different weight decay:

# Apply weight decay only to weights, not biases or layer norms
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": 0.01,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]

optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)

Gradient Clipping

import torch.nn.utils as nn_utils

# Training step with gradient clipping
optimizer.zero_grad()
loss.backward()

# Clip gradients to prevent exploding gradients
nn_utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()
scheduler.step()

Mixed Precision Training

from torch.cuda.amp import autocast, GradScaler

# Initialize gradient scaler for mixed precision
scaler = GradScaler()

# Training step with mixed precision
optimizer.zero_grad()

with autocast():
    outputs = model(**inputs)
    loss = outputs.loss

# Scale loss and backward pass
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
scheduler.step()

Schedule Visualization

Different learning rate schedules behave differently during training:

Linear Schedule: Steady decrease after warmup

Best for: Most fine-tuning tasks
Characteristics: Predictable, stable convergence

Cosine Schedule: Smooth decay following cosine curve

Best for: Long training runs, better final performance
Characteristics: Slower initial decay, faster final decay

Constant Schedule: Maintains rate after warmup

Best for: Continued pre-training, domain adaptation
Characteristics: No decay, constant exploration

Cosine with Restarts: Periodic learning rate increases

Best for: Finding better local minima, avoiding plateaus
Characteristics: Multiple convergence opportunities

Install with Tessl CLI

npx tessl i tessl/pypi-pytorch-transformers

docs