Summarize by Aili
Training MoEs at Scale with PyTorch
๐ Abstract
The article discusses how Databricks has worked with the PyTorch team to scale the training of Mixture of Experts (MoE) models to over 3,000 GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch.
๐ Q&A
[01] What is a Mixture of Experts (MoE) model?
- A MoE model is a model architecture that uses multiple expert networks to make predictions.
- A gating network is used to route and combine the outputs of experts, ensuring each expert is trained on a different, specialized distribution of tokens.
- Compared to dense models, MoEs provide more efficient training for a given compute budget as the gating network only sends tokens to a subset of experts, reducing the computational load.
[02] What is MegaBlocks?
- MegaBlocks is an efficient MoE implementation that uses sparse matrix multiplication to compute expert outputs in parallel despite uneven token assignment.
- MegaBlocks implements a dropless MoE that avoids dropping tokens while using GPU kernels that maintain efficient training.
- Prior to MegaBlocks, dynamic routing formulations forced a tradeoff between model quality and hardware efficiency.
[03] How does expert parallelism work?
- Expert parallelism is a form of model parallelism where different experts are placed on different GPUs for better performance.
- Tokens are sent to the device that contains the expert, instead of communicating expert weights across all GPUs.
- This allows processing a few, larger matrix multiplications instead of several small matrix multiplications, better exploiting GPU capabilities.
[04] How do the authors leverage PyTorch's features for scaling MoE training?
- They use PyTorch's Fully Sharded Data Parallel (FSDP) to shard weights and optimizer states across GPUs, reducing memory pressure.
- They utilize Hybrid Sharded Data Parallel (HSDP) to balance memory efficiency and communication cost during large-scale distributed training.
- They combine expert parallelism and data parallelism using PyTorch's DTensor abstraction to achieve near-linear scaling across large clusters.
[05] How do the authors ensure fault tolerance and elasticity in MoE training?
- They use PyTorch Distributed Checkpoint to save and restore model state accurately across different cluster configurations.
- They support sharded checkpoints, where each GPU saves and loads only its portion of the model, improving robustness and speed.
- They leverage the replication in HSDP to first download checkpoints on one replica and then send the necessary shards to other replicas, improving checkpoint resumption times.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.