MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
๐ Abstract
The article discusses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. It focuses on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. The article underscores the significance of model architecture for sub-billion scale LLMs, contrary to the prevailing belief emphasizing the pivotal role of data and parameter quantity.
๐ Q&A
[01] Improving Sub-billion Scale LLM Design
1. What are the key techniques explored in the article to build a strong baseline sub-billion scale LLM? The article explores four key techniques to build a strong baseline sub-billion scale LLM:
- Adopting SwiGLU feed-forward network (FFN)
- Leveraging deep and thin architectures
- Revisiting embedding sharing methods
- Utilizing grouped-query attention mechanisms
2. How does the article demonstrate the importance of model depth over width for small LLMs? The article's experimental results consistently show that deeper and thinner models outperform their shallower and wider counterparts across various tasks, including zero-shot common sense reasoning, question answering, and reading comprehension. This finding challenges the prevailing belief that model performance is primarily determined by the number of parameters, the size of the training dataset, and the number of training iterations.
3. What is the immediate block-wise weight-sharing approach proposed in the article, and how does it improve performance without increasing model size? The article proposes an immediate block-wise weight-sharing approach, where every two transformer blocks share weights. This technique avoids the need to transfer weights between the SRAM and DRAM, resulting in improved overall execution speed for auto-regressive inference, while only incurring a marginal latency overhead.
[02] Main Results
1. How do the proposed MobileLLM models perform compared to previous sub-billion parameter models on zero-shot common sense reasoning tasks? The MobileLLM models significantly outperform previous state-of-the-art sub-billion parameter models on zero-shot common sense reasoning tasks. For example, MobileLLM-125M achieves a 2.7% accuracy boost over the previous 125M state-of-the-art model, and MobileLLM-350M outperforms the previous 350M state-of-the-art model by 4.3%.
2. How do the MobileLLM models perform on downstream tasks such as chat and API calling compared to other sub-billion parameter models? The MobileLLM models demonstrate significant improvements compared to previous sub-billion parameter models on chat benchmarks, such as AlpacaEval and MT-Bench. Additionally, the MobileLLM-350M model achieves comparable exact-match scores to the much larger LLaMA-v2 7B model on an API calling task, highlighting the capability of small models for common on-device use cases.
3. How does the article demonstrate the scalability of the proposed design principles to larger model sizes? The article extends the proposed design principles to larger model sizes, including MobileLLM-600M, 1B, and 1.5B. The results show that the MobileLLM model family continues to outperform previous state-of-the-art sub-billion parameter models, even at the 1.5B scale.