HuggingFace Accelerate is a PyTorch library that simplifies distributed and mixed-precision training by abstracting away the boilerplate code needed for multi-GPU, TPU, and mixed-precision setups.
—
Quality
Pending
Does it follow best practices?
Impact
Pending
No eval scenarios have been run
Command-line tools for configuration, launching distributed training, memory estimation, and environment management. These tools provide an easy interface for setting up and managing distributed training workflows.
Interactive configuration setup and management commands.
accelerate configDescription: Interactive configuration wizard that guides users through setting up distributed training configuration.
Options:
--config_file PATH - Specify custom config file location (default: ~/.cache/huggingface/accelerate/default_config.yaml)Usage:
# Run interactive configuration
accelerate config
# Use custom config file location
accelerate config --config_file ./my_config.yamlInteractive Options:
Launch distributed training scripts with automatic environment setup.
accelerate launch [OPTIONS] SCRIPT [SCRIPT_ARGS...]Description: Launch training scripts with automatic distributed training setup based on configuration.
Common Options:
--config_file PATH - Use specific config file--cpu - Force CPU usage even if GPU available--multi_gpu - Use multi-GPU training--mixed_precision {no,fp16,bf16,fp8} - Mixed precision mode--num_processes NUM - Number of processes to use--num_machines NUM - Number of machines for multi-node training--machine_rank RANK - Rank of current machine--main_process_ip IP - IP address of main process--main_process_port PORT - Port for main process communication--deepspeed_config_file PATH - DeepSpeed configuration file--fsdp_config_file PATH - FSDP configuration file--dynamo_backend BACKEND - Torch Dynamo backend for compilationUsage Examples:
# Basic single-GPU training
accelerate launch train.py --batch_size 32
# Multi-GPU training with mixed precision
accelerate launch --mixed_precision fp16 --num_processes 4 train.py
# Multi-node training
accelerate launch \
--num_machines 2 \
--num_processes 8 \
--machine_rank 0 \
--main_process_ip 192.168.1.100 \
--main_process_port 29500 \
train.py
# DeepSpeed training
accelerate launch \
--deepspeed_config_file ds_config.json \
--num_processes 4 \
train.py
# With Torch compilation
accelerate launch \
--dynamo_backend inductor \
--mixed_precision bf16 \
train.pyDisplay environment and configuration information for debugging.
accelerate envDescription: Show detailed information about the current Accelerate installation, hardware, and configuration.
Output includes:
Usage:
# Show environment information
accelerate envEstimate memory requirements for model training and inference.
accelerate estimate-memory [OPTIONS] MODEL_NAMEDescription: Estimate GPU memory requirements for training or inference with specific models.
Options:
--library_name {transformers,timm,diffusers} - Model library (default: transformers)--dtypes DTYPES - Data types to test (comma-separated: float32,float16,bfloat16,int8,int4)--num_gpus NUM - Number of GPUs available--trust_remote_code - Trust remote code in model loading--access_token TOKEN - Hugging Face access tokenUsage Examples:
# Estimate memory for a model
accelerate estimate-memory microsoft/DialoGPT-medium
# Test multiple data types
accelerate estimate-memory \
--dtypes float32,float16,bfloat16 \
--num_gpus 2 \
microsoft/DialoGPT-large
# With custom library
accelerate estimate-memory \
--library_name timm \
--dtypes float16,bfloat16 \
resnet50Test distributed training setup and communication.
accelerate test [OPTIONS]Description: Test distributed training setup by running a simple training loop to verify configuration.
Options:
--config_file PATH - Use specific config file--num_processes NUM - Override number of processesUsage:
# Test current configuration
accelerate test
# Test with specific number of processes
accelerate test --num_processes 4Merge sharded model checkpoints into a single file.
accelerate merge-weights [OPTIONS] INPUT_DIR OUTPUT_DIRDescription: Merge model weights that have been sharded across multiple files back into a single checkpoint file.
Options:
--model_name_or_path PATH - Model name or path for configuration--torch_dtype {float16,bfloat16,float32} - Target data type--safe_serialization - Use safetensors format for outputUsage:
# Merge sharded weights
accelerate merge-weights ./sharded_model ./merged_model
# With specific dtype and safe serialization
accelerate merge-weights \
--torch_dtype float16 \
--safe_serialization \
./sharded_model ./merged_modelTPU-specific utilities and commands.
accelerate tpu-configDescription: Configure TPU-specific settings for training.
Usage:
# Configure TPU settings
accelerate tpu-configMigrate configuration to newer formats.
accelerate to-fsdp2 [OPTIONS]Description: Convert existing FSDP configuration to FSDP2 format.
Options:
--config_file PATH - Input config file--output_file PATH - Output config fileUsage:
# Convert FSDP config to FSDP2
accelerate to-fsdp2 --config_file old_config.yaml --output_file new_config.yamlThe configuration file uses YAML format and contains distributed training settings:
# Example configuration file
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false# 1. Configure Accelerate
accelerate config
# Follow interactive prompts to set up configuration
# 2. Test the setup
accelerate test
# 3. Check environment
accelerate env
# 4. Estimate memory for your model
accelerate estimate-memory microsoft/DialoGPT-medium
# 5. Launch training
accelerate launch train.py --learning_rate 1e-4 --batch_size 16# On main node (machine rank 0)
accelerate launch \
--num_machines 2 \
--num_processes 8 \
--machine_rank 0 \
--main_process_ip 192.168.1.100 \
--main_process_port 29500 \
--mixed_precision fp16 \
train.py
# On worker node (machine rank 1)
accelerate launch \
--num_machines 2 \
--num_processes 8 \
--machine_rank 1 \
--main_process_ip 192.168.1.100 \
--main_process_port 29500 \
--mixed_precision fp16 \
train.py# Create DeepSpeed config file (ds_config.json)
cat > ds_config.json << EOF
{
"train_batch_size": 16,
"gradient_accumulation_steps": 4,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu"
}
},
"fp16": {
"enabled": true
}
}
EOF
# Launch with DeepSpeed
accelerate launch \
--deepspeed_config_file ds_config.json \
--num_processes 4 \
train.py# 1. Estimate memory requirements
accelerate estimate-memory \
--dtypes float32,float16,bfloat16,int8 \
--num_gpus 2 \
microsoft/DialoGPT-large
# 2. Based on results, configure with appropriate settings
accelerate config
# Select mixed precision based on memory estimates
# 3. Test configuration
accelerate test
# 4. Launch optimized training
accelerate launch \
--mixed_precision bf16 \
--gradient_accumulation_steps 8 \
train.py --batch_size 4# Debug distributed setup
accelerate test --num_processes 2
# Check detailed environment info
accelerate env
# Launch with verbose output for debugging
accelerate launch --debug train.py
# Test with different configurations
accelerate launch --cpu train.py # Force CPU
accelerate launch --mixed_precision no train.py # Disable mixed precision# After training with model sharding
ls ./my_model/
# pytorch_model-00001-of-00004.bin
# pytorch_model-00002-of-00004.bin
# pytorch_model-00003-of-00004.bin
# pytorch_model-00004-of-00004.bin
# pytorch_model.bin.index.json
# Merge sharded weights
accelerate merge-weights \
--safe_serialization \
./my_model ./my_model_merged
ls ./my_model_merged/
# model.safetensors (single merged file)Install with Tessl CLI
npx tessl i tessl/pypi-accelerate