Gemma 2: Improving Open Language Models at a Practical Size
๐ Abstract
The article introduces Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. The key points are:
- Gemma 2 applies several known technical modifications to the Transformer architecture, such as interleaving local-global attentions and group-query attention.
- The 2B and 9B models are trained with knowledge distillation instead of next token prediction, resulting in better performance for their size.
- The models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times larger.
- All Gemma 2 models are released to the community.
๐ Q&A
[01] Model Architecture
1. What are the key architectural differences between Gemma 1 and Gemma 2 models?
- Gemma 2 uses deeper networks compared to Gemma 1.
- Gemma 2 alternates between local sliding window attention and global attention in every other layer.
- Gemma 2 uses logit soft-capping, post-norm and pre-norm with RMSNorm, and Grouped-Query Attention (GQA).
2. What are the key parameter settings for the Gemma 2 models?
- The 2B model has 2,024,517,888 non-embedding parameters, the 9B model has 8,324,201,984, and the 27B model has 26,047,480,320.
- All models use a context length of 8192 tokens, Rotary Position Embeddings (RoPE), and the approximated GeGLU non-linearity.
[02] Pre-training
1. What are the key differences in the pre-training data and infrastructure used for Gemma 2 compared to Gemma 1?
- Gemma 2 is trained on 13 trillion tokens for the 27B model, 8 trillion for the 9B model, and 2 trillion for the 2B model.
- The models are trained on TPUv4, TPUv5e, and TPUv5p with data replication and model sharding.
- Knowledge distillation is used to train the 2B and 9B models, with the 27B model trained from scratch.
2. How does the carbon footprint of Gemma 2 pre-training compare to Gemma 1?
- The carbon emissions from pre-training the Gemma 2 models are estimated to be similar to Gemma 1, as Google data centers are carbon neutral.
[03] Post-training
1. What are the key differences in the post-training process for Gemma 2 compared to Gemma 1?
- Gemma 2 uses the same control tokens as Gemma 1, but a different formatting schema where the model explicitly ends generations with <end_of_turn>
tokens. - The post-training data mixture and hyperparameters were tuned to improve helpfulness while minimizing safety and hallucination issues.
2. How does the performance of the Gemma 2 IT models compare to Gemma 1 and other open models?
- The Gemma 2 IT models significantly outperform the previous Gemma 1 models and are competitive with larger open models on a variety of benchmarks.
- The Gemma 2 27B IT model in particular sets a new state-of-the-art on the LMSYS Chatbot Arena evaluation.
[04] Evaluations
1. How do the pre-trained Gemma 2 models compare to other large open models?
- The Gemma 2 27B model outperforms a similar-sized model (Qwen1.5 34B) and is competitive with a larger model (LLaMA-3 70B) on the HuggingFace evaluation suite.
- The Gemma 2 2B and 9B models show significant improvements over the previous Gemma 1 versions, benefiting from the knowledge distillation approach.
2. What are the key findings from the safety and responsibility evaluations of the Gemma 2 models?
- The Gemma 2 models show lower rates of verbatim and approximate memorization of training data compared to prior models.
- The models perform well on safety-critical evaluations like offensive cybersecurity and code vulnerability detection, though they still have limitations compared to specialized systems.
- The Gemma 2 models demonstrate strong conversational and rapport-building capabilities, but do not show significant advantages over prior models in terms of deception or persuasion.