Summarize by Aili

Gemma 2: Improving Open Language Models at a Practical Size

🌈 Abstract

The article introduces Gemma 2, a new family of open language models ranging from 2 billion to 27 billion parameters. The key points are:

Gemma 2 models use technical modifications such as interleaving local-global attentions and group-query attention, and train smaller models (2B and 9B) using knowledge distillation.
The resulting models deliver state-of-the-art performance for their size, and are competitive with models that are 2-3x larger.
The article provides details on the model architecture, pre-training, and post-training, as well as extensive evaluations on automated benchmarks and human evaluations across various domains.
Responsible deployment and safety considerations are also discussed, including the use of the Responsible Generative AI Toolkit.

1. What are the key architectural differences between Gemma 1 and Gemma 2 models?

Gemma 2 models alternate between local sliding window attention and global attention in every other layer.
They use post-norm and pre-norm with RMSNorm for stabilizing training.
The 27B and 9B models use Grouped-Query Attention (GQA) mechanism.

2. What are the parameter counts for the different Gemma 2 models?

1. How did the Gemma 2 models leverage knowledge distillation during pre-training?

The 2.6B and 9B models were trained using knowledge distillation, where they learned from the probability distribution of the next token predicted by a larger teacher model.
This allowed the smaller models to be trained on a much larger quantity of tokens (over 50x the compute-optimal amount) compared to training from scratch.

2. What were the key details of the pre-training infrastructure?

The 2.6B model was trained on 512 TPUv5e chips with 512-way data replication and 1-way model sharding.
The 9B model was trained on 4096 TPUv4 chips with 1024-way data replication and 4-way model sharding.
The 27B model was trained on 6144 TPUv5p chips with 768-way data replication and 8-way model sharding.

1. What were the key steps in the post-training process?

Supervised fine-tuning (SFT) on a mix of synthetic and human-generated prompt-response pairs.
Reinforcement learning from human feedback (RLHF) using a reward model trained on labeled preference data.
Averaging the models obtained from each phase to improve overall performance.

2. How did the post-training data and formatting differ from Gemma 1?

The post-training data was extended with a mixture of internal and external public data, including prompts from the LMSYS-chat-1M dataset.
The formatting schema was updated to use control tokens for the start/end of user and model turns, as well as the beginning/end of the sequence.

1. How did the pre-trained Gemma 2 models perform compared to other open models of similar size?

The 27B Gemma 2 model outperformed the similar-sized Qwen1.5 32B model and was competitive with the larger LLaMA-3 70B model on benchmarks like MMLU, GSM8K, and ARC-c.
The 2.6B and 9B Gemma 2 models trained with distillation showed significant improvements over the previous Gemma 1 models of comparable size.

2. What were the key findings from the safety and responsibility evaluations?

Gemma 2 models showed lower violation rates on safety policies compared to previous Gemini models, especially for content related to child safety.
On offensive cybersecurity tasks, the 27B Gemma 2 model performed better than the 7B CodeGemma V1 model but was still less capable than the Gemini 1.5 Pro model.
Gemma 2 models exhibited moderate persuasion capabilities in human studies, with no significant differences compared to Gemini models.

</output_format>

Shared by Daniel Chen ·