Summarize by Aili
Gemma 2: Improving Open Language Models at a Practical Size
๐ Abstract
The article introduces Gemma 2, a new family of open language models ranging from 2 billion to 27 billion parameters. The key points are:
- Gemma 2 models use technical modifications such as interleaving local-global attentions and group-query attention, and train smaller models (2B and 9B) using knowledge distillation.
- The resulting models deliver state-of-the-art performance for their size, and are competitive with models that are 2-3x larger.
- The article provides details on the model architecture, pre-training, and post-training, as well as extensive evaluations on automated benchmarks and human evaluations across various domains.
- Responsible deployment and safety considerations are also discussed, including the use of the Responsible Generative AI Toolkit.
๐ Q&A
[01] Model Architecture
1. What are the key architectural differences between Gemma 1 and Gemma 2 models?
- Gemma 2 models alternate between local sliding window attention and global attention in every other layer.
- They use post-norm and pre-norm with RMSNorm for stabilizing training.
- The 27B and 9B models use Grouped-Query Attention (GQA) mechanism.
2. What are the parameter counts for the different Gemma 2 models?
- 2.6B model has 590M embedding parameters and 2B non-embedding parameters.
- 9B model has 918M embedding parameters and 8.3B non-embedding parameters.
- 27B model has 1.18B embedding parameters and 26B non-embedding parameters.
[02] Pre-training
1. How did the Gemma 2 models leverage knowledge distillation during pre-training?
- The 2.6B and 9B models were trained using knowledge distillation, where they learned from the probability distribution of the next token predicted by a larger teacher model.
- This allowed the smaller models to be trained on a much larger quantity of tokens (over 50x the compute-optimal amount) compared to training from scratch.
2. What were the key details of the pre-training infrastructure?
- The 2.6B model was trained on 512 TPUv5e chips with 512-way data replication and 1-way model sharding.
- The 9B model was trained on 4096 TPUv4 chips with 1024-way data replication and 4-way model sharding.
- The 27B model was trained on 6144 TPUv5p chips with 768-way data replication and 8-way model sharding.
[03] Post-training
1. What were the key steps in the post-training process?
- Supervised fine-tuning (SFT) on a mix of synthetic and human-generated prompt-response pairs.
- Reinforcement learning from human feedback (RLHF) using a reward model trained on labeled preference data.
- Averaging the models obtained from each phase to improve overall performance.
2. How did the post-training data and formatting differ from Gemma 1?
- The post-training data was extended with a mixture of internal and external public data, including prompts from the LMSYS-chat-1M dataset.
- The formatting schema was updated to use control tokens for the start/end of user and model turns, as well as the beginning/end of the sequence.
[04] Evaluations
1. How did the pre-trained Gemma 2 models perform compared to other open models of similar size?
- The 27B Gemma 2 model outperformed the similar-sized Qwen1.5 32B model and was competitive with the larger LLaMA-3 70B model on benchmarks like MMLU, GSM8K, and ARC-c.
- The 2.6B and 9B Gemma 2 models trained with distillation showed significant improvements over the previous Gemma 1 models of comparable size.
2. What were the key findings from the safety and responsibility evaluations?
- Gemma 2 models showed lower violation rates on safety policies compared to previous Gemini models, especially for content related to child safety.
- On offensive cybersecurity tasks, the 27B Gemma 2 model performed better than the 7B CodeGemma V1 model but was still less capable than the Gemini 1.5 Pro model.
- Gemma 2 models exhibited moderate persuasion capabilities in human studies, with no significant differences compared to Gemini models.
</output_format>
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.