Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
๐ Abstract
The article discusses the problem of load imbalance in Mixture-of-Experts (MoE) models, which can lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but this introduces undesired gradients that can impair model performance. The paper proposes a novel approach called "Loss-Free Balancing" that controls load balance without introducing interference gradients.
๐ Q&A
[01] Introduction
1. What are the two main drawbacks of uncontrolled routing strategies in MoE models?
- Routing collapse, where the model consistently selects only a few experts, hindering sufficient training of the other experts
- Load imbalance can exacerbate computation bottlenecks when experts are distributed across multiple devices
2. How do existing methods address the load imbalance issue? Existing methods commonly employ an auxiliary loss to encourage load balance, but this introduces undesired gradients that can impair model performance.
3. What is the dilemma between load balance and model performance that existing methods face?
- A small auxiliary loss coefficient can lead to poor load balance
- A large auxiliary loss coefficient can impair training and result in suboptimal performance
4. How does the proposed Loss-Free Balancing method aim to address this dilemma? Loss-Free Balancing directly controls the expert load balance without introducing unexpected gradients other than the gradients from the language modeling loss.
[02] Auxiliary-Loss-Free Load Balancing Strategy
1. How does Loss-Free Balancing control the load balance? Loss-Free Balancing adds an expert-wise bias term to the gating scores of each expert, and dynamically updates the biases based on the recent load of each expert.
2. How does Loss-Free Balancing differ from other load balancing methods in terms of interference gradients and future token leakage?
- Loss-Free Balancing does not introduce any interference gradients, unlike the auxiliary-loss-controlled method.
- Loss-Free Balancing maintains the causal constraint of language modeling and does not suffer from future token leakage, unlike the Expert Choice method.
[03] Experiments
1. What are the key findings from the main results?
- Compared to the auxiliary-loss-controlled method, Loss-Free Balancing achieves better perplexity and much better global load balance for both the 1B and 3B models.
- Loss-Free Balancing maintains a persistent advantage on load balance throughout the training process.
2. What are the key findings from the empirical studies on the bias update algorithm?
- An appropriate update rate of 0.1 achieves good training balance and validation perplexity.
- Using additive biases is more suitable than multiplicative biases.
[04] Discussion
1. How is Loss-Free Balancing compatible with expert parallelism? Loss-Free Balancing can achieve nearly optimal global load balance, and the load balance in each computation step will get closer to the global load balance as the computation batch size increases, making it naturally compatible with expert parallelism.
2. What is the issue of future token leakage in the Expert Choice method? The Expert Choice method violates the causal constraint of language modeling, as future tokens can influence the expert assignment of previous tokens. This can lead to severe information leakage, which destroys the generalization of the model.