CtrlK
BlogDocsLog inGet started
Tessl Logo

tessl/pypi-accelerate

HuggingFace Accelerate is a PyTorch library that simplifies distributed and mixed-precision training by abstracting away the boilerplate code needed for multi-GPU, TPU, and mixed-precision setups.

Pending

Quality

Pending

Does it follow best practices?

Impact

Pending

No eval scenarios have been run

Overview
Eval results
Files

cli-commands.mddocs/

CLI Commands

Command-line tools for configuration, launching distributed training, memory estimation, and environment management. These tools provide an easy interface for setting up and managing distributed training workflows.

Capabilities

Configuration Management

Interactive configuration setup and management commands.

accelerate config

Description: Interactive configuration wizard that guides users through setting up distributed training configuration.

Options:

  • --config_file PATH - Specify custom config file location (default: ~/.cache/huggingface/accelerate/default_config.yaml)

Usage:

# Run interactive configuration
accelerate config

# Use custom config file location  
accelerate config --config_file ./my_config.yaml

Interactive Options:

  • Compute environment (local machine, SageMaker, etc.)
  • Distributed training type (no distributed, multi-GPU, multi-node, etc.)
  • Number of processes/GPUs to use
  • Mixed precision training mode (no, fp16, bf16, fp8)
  • DeepSpeed or FSDP configuration
  • Machine rank and addresses for multi-node training

Training Launch

Launch distributed training scripts with automatic environment setup.

accelerate launch [OPTIONS] SCRIPT [SCRIPT_ARGS...]

Description: Launch training scripts with automatic distributed training setup based on configuration.

Common Options:

  • --config_file PATH - Use specific config file
  • --cpu - Force CPU usage even if GPU available
  • --multi_gpu - Use multi-GPU training
  • --mixed_precision {no,fp16,bf16,fp8} - Mixed precision mode
  • --num_processes NUM - Number of processes to use
  • --num_machines NUM - Number of machines for multi-node training
  • --machine_rank RANK - Rank of current machine
  • --main_process_ip IP - IP address of main process
  • --main_process_port PORT - Port for main process communication
  • --deepspeed_config_file PATH - DeepSpeed configuration file
  • --fsdp_config_file PATH - FSDP configuration file
  • --dynamo_backend BACKEND - Torch Dynamo backend for compilation

Usage Examples:

# Basic single-GPU training
accelerate launch train.py --batch_size 32

# Multi-GPU training with mixed precision
accelerate launch --mixed_precision fp16 --num_processes 4 train.py

# Multi-node training
accelerate launch \
    --num_machines 2 \
    --num_processes 8 \
    --machine_rank 0 \
    --main_process_ip 192.168.1.100 \
    --main_process_port 29500 \
    train.py

# DeepSpeed training
accelerate launch \
    --deepspeed_config_file ds_config.json \
    --num_processes 4 \
    train.py

# With Torch compilation
accelerate launch \
    --dynamo_backend inductor \
    --mixed_precision bf16 \
    train.py

Environment Information

Display environment and configuration information for debugging.

accelerate env

Description: Show detailed information about the current Accelerate installation, hardware, and configuration.

Output includes:

  • Accelerate version and installation details
  • PyTorch version and CUDA availability
  • Hardware information (GPUs, memory, etc.)
  • Current configuration settings
  • Available optional dependencies

Usage:

# Show environment information
accelerate env

Memory Estimation

Estimate memory requirements for model training and inference.

accelerate estimate-memory [OPTIONS] MODEL_NAME

Description: Estimate GPU memory requirements for training or inference with specific models.

Options:

  • --library_name {transformers,timm,diffusers} - Model library (default: transformers)
  • --dtypes DTYPES - Data types to test (comma-separated: float32,float16,bfloat16,int8,int4)
  • --num_gpus NUM - Number of GPUs available
  • --trust_remote_code - Trust remote code in model loading
  • --access_token TOKEN - Hugging Face access token

Usage Examples:

# Estimate memory for a model
accelerate estimate-memory microsoft/DialoGPT-medium

# Test multiple data types
accelerate estimate-memory \
    --dtypes float32,float16,bfloat16 \
    --num_gpus 2 \
    microsoft/DialoGPT-large

# With custom library
accelerate estimate-memory \
    --library_name timm \
    --dtypes float16,bfloat16 \
    resnet50

Training Setup Testing

Test distributed training setup and communication.

accelerate test [OPTIONS]

Description: Test distributed training setup by running a simple training loop to verify configuration.

Options:

  • --config_file PATH - Use specific config file
  • --num_processes NUM - Override number of processes

Usage:

# Test current configuration
accelerate test

# Test with specific number of processes
accelerate test --num_processes 4

Model Weight Merging

Merge sharded model checkpoints into a single file.

accelerate merge-weights [OPTIONS] INPUT_DIR OUTPUT_DIR

Description: Merge model weights that have been sharded across multiple files back into a single checkpoint file.

Options:

  • --model_name_or_path PATH - Model name or path for configuration
  • --torch_dtype {float16,bfloat16,float32} - Target data type
  • --safe_serialization - Use safetensors format for output

Usage:

# Merge sharded weights
accelerate merge-weights ./sharded_model ./merged_model

# With specific dtype and safe serialization
accelerate merge-weights \
    --torch_dtype float16 \
    --safe_serialization \
    ./sharded_model ./merged_model

TPU Utilities

TPU-specific utilities and commands.

accelerate tpu-config

Description: Configure TPU-specific settings for training.

Usage:

# Configure TPU settings
accelerate tpu-config

Configuration Migration

Migrate configuration to newer formats.

accelerate to-fsdp2 [OPTIONS]

Description: Convert existing FSDP configuration to FSDP2 format.

Options:

  • --config_file PATH - Input config file
  • --output_file PATH - Output config file

Usage:

# Convert FSDP config to FSDP2
accelerate to-fsdp2 --config_file old_config.yaml --output_file new_config.yaml

Configuration File Format

The configuration file uses YAML format and contains distributed training settings:

# Example configuration file
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Usage Examples

Complete Setup Workflow

# 1. Configure Accelerate
accelerate config
# Follow interactive prompts to set up configuration

# 2. Test the setup
accelerate test

# 3. Check environment
accelerate env

# 4. Estimate memory for your model
accelerate estimate-memory microsoft/DialoGPT-medium

# 5. Launch training
accelerate launch train.py --learning_rate 1e-4 --batch_size 16

Multi-Node Training Setup

# On main node (machine rank 0)
accelerate launch \
    --num_machines 2 \
    --num_processes 8 \
    --machine_rank 0 \
    --main_process_ip 192.168.1.100 \
    --main_process_port 29500 \
    --mixed_precision fp16 \
    train.py

# On worker node (machine rank 1)  
accelerate launch \
    --num_machines 2 \
    --num_processes 8 \
    --machine_rank 1 \
    --main_process_ip 192.168.1.100 \
    --main_process_port 29500 \
    --mixed_precision fp16 \
    train.py

DeepSpeed Integration

# Create DeepSpeed config file (ds_config.json)
cat > ds_config.json << EOF
{
    "train_batch_size": 16,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 1e-4
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu"
        }
    },
    "fp16": {
        "enabled": true
    }
}
EOF

# Launch with DeepSpeed
accelerate launch \
    --deepspeed_config_file ds_config.json \
    --num_processes 4 \
    train.py

Memory Optimization Workflow

# 1. Estimate memory requirements  
accelerate estimate-memory \
    --dtypes float32,float16,bfloat16,int8 \
    --num_gpus 2 \
    microsoft/DialoGPT-large

# 2. Based on results, configure with appropriate settings
accelerate config
# Select mixed precision based on memory estimates

# 3. Test configuration
accelerate test

# 4. Launch optimized training
accelerate launch \
    --mixed_precision bf16 \
    --gradient_accumulation_steps 8 \
    train.py --batch_size 4

Development and Debugging

# Debug distributed setup
accelerate test --num_processes 2

# Check detailed environment info
accelerate env

# Launch with verbose output for debugging
accelerate launch --debug train.py

# Test with different configurations
accelerate launch --cpu train.py  # Force CPU
accelerate launch --mixed_precision no train.py  # Disable mixed precision

Checkpoint Management

# After training with model sharding
ls ./my_model/
# pytorch_model-00001-of-00004.bin
# pytorch_model-00002-of-00004.bin  
# pytorch_model-00003-of-00004.bin
# pytorch_model-00004-of-00004.bin
# pytorch_model.bin.index.json

# Merge sharded weights
accelerate merge-weights \
    --safe_serialization \
    ./my_model ./my_model_merged

ls ./my_model_merged/
# model.safetensors (single merged file)

Install with Tessl CLI

npx tessl i tessl/pypi-accelerate

docs

big-modeling.md

cli-commands.md

configuration.md

core-training.md

distributed-operations.md

index.md

utilities.md

tile.json