Summarize by Aili

Gemma 2: Improving Open Language Models at a Practical Size

🌈 Abstract

The article introduces Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. The key points are:

Gemma 2 applies several known technical modifications to the Transformer architecture, such as interleaving local-global attentions and group-query attention.
The 2B and 9B models are trained with knowledge distillation instead of next token prediction, resulting in better performance for their size.
The models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times larger.
All Gemma 2 models are released to the community.

🙋 Q&A

[01] Model Architecture

1. What are the key architectural differences between Gemma 1 and Gemma 2 models?

Gemma 2 uses deeper networks compared to Gemma 1.
Gemma 2 alternates between local sliding window attention and global attention in every other layer.
Gemma 2 uses logit soft-capping, post-norm and pre-norm with RMSNorm, and Grouped-Query Attention (GQA).

2. What are the key parameter settings for the Gemma 2 models?

The 2B model has 2,024,517,888 non-embedding parameters, the 9B model has 8,324,201,984, and the 27B model has 26,047,480,320.
All models use a context length of 8192 tokens, Rotary Position Embeddings (RoPE), and the approximated GeGLU non-linearity.

[02] Pre-training

1. What are the key differences in the pre-training data and infrastructure used for Gemma 2 compared to Gemma 1?

Gemma 2 is trained on 13 trillion tokens for the 27B model, 8 trillion for the 9B model, and 2 trillion for the 2B model.
The models are trained on TPUv4, TPUv5e, and TPUv5p with data replication and model sharding.
Knowledge distillation is used to train the 2B and 9B models, with the 27B model trained from scratch.

2. How does the carbon footprint of Gemma 2 pre-training compare to Gemma 1?

The carbon emissions from pre-training the Gemma 2 models are estimated to be similar to Gemma 1, as Google data centers are carbon neutral.

[03] Post-training

1. What are the key differences in the post-training process for Gemma 2 compared to Gemma 1?

Gemma 2 uses the same control tokens as Gemma 1, but a different formatting schema where the model explicitly ends generations with <end_of_turn> tokens.
The post-training data mixture and hyperparameters were tuned to improve helpfulness while minimizing safety and hallucination issues.

2. How does the performance of the Gemma 2 IT models compare to Gemma 1 and other open models?

The Gemma 2 IT models significantly outperform the previous Gemma 1 models and are competitive with larger open models on a variety of benchmarks.
The Gemma 2 27B IT model in particular sets a new state-of-the-art on the LMSYS Chatbot Arena evaluation.

[04] Evaluations

1. How do the pre-trained Gemma 2 models compare to other large open models?

The Gemma 2 27B model outperforms a similar-sized model (Qwen1.5 34B) and is competitive with a larger model (LLaMA-3 70B) on the HuggingFace evaluation suite.
The Gemma 2 2B and 9B models show significant improvements over the previous Gemma 1 versions, benefiting from the knowledge distillation approach.

2. What are the key findings from the safety and responsibility evaluations of the Gemma 2 models?

The Gemma 2 models show lower rates of verbatim and approximate memorization of training data compared to prior models.
The models perform well on safety-critical evaluations like offensive cybersecurity and code vulnerability detection, though they still have limitations compared to specialized systems.
The Gemma 2 models demonstrate strong conversational and rapport-building capabilities, but do not show significant advantages over prior models in terms of deception or persuasion.

Shared by Daniel Chen ·

Install fromChrome Web Store