magic starSummarize by Aili

Gemma 2: Improving Open Language Models at a Practical Size

๐ŸŒˆ Abstract

The article introduces Gemma 2, a new family of open language models ranging from 2 billion to 27 billion parameters. The key points are:

  • Gemma 2 models use technical modifications such as interleaving local-global attentions and group-query attention, and train smaller models (2B and 9B) using knowledge distillation.
  • The resulting models deliver state-of-the-art performance for their size, and are competitive with models that are 2-3x larger.
  • The article provides details on the model architecture, pre-training, and post-training, as well as extensive evaluations on automated benchmarks and human evaluations across various domains.
  • Responsible deployment and safety considerations are also discussed, including the use of the Responsible Generative AI Toolkit.

๐Ÿ™‹ Q&A

[01] Model Architecture

1. What are the key architectural differences between Gemma 1 and Gemma 2 models?

  • Gemma 2 models alternate between local sliding window attention and global attention in every other layer.
  • They use post-norm and pre-norm with RMSNorm for stabilizing training.
  • The 27B and 9B models use Grouped-Query Attention (GQA) mechanism.

2. What are the parameter counts for the different Gemma 2 models?

  • 2.6B model has 590M embedding parameters and 2B non-embedding parameters.
  • 9B model has 918M embedding parameters and 8.3B non-embedding parameters.
  • 27B model has 1.18B embedding parameters and 26B non-embedding parameters.

[02] Pre-training

1. How did the Gemma 2 models leverage knowledge distillation during pre-training?

  • The 2.6B and 9B models were trained using knowledge distillation, where they learned from the probability distribution of the next token predicted by a larger teacher model.
  • This allowed the smaller models to be trained on a much larger quantity of tokens (over 50x the compute-optimal amount) compared to training from scratch.

2. What were the key details of the pre-training infrastructure?

  • The 2.6B model was trained on 512 TPUv5e chips with 512-way data replication and 1-way model sharding.
  • The 9B model was trained on 4096 TPUv4 chips with 1024-way data replication and 4-way model sharding.
  • The 27B model was trained on 6144 TPUv5p chips with 768-way data replication and 8-way model sharding.

[03] Post-training

1. What were the key steps in the post-training process?

  • Supervised fine-tuning (SFT) on a mix of synthetic and human-generated prompt-response pairs.
  • Reinforcement learning from human feedback (RLHF) using a reward model trained on labeled preference data.
  • Averaging the models obtained from each phase to improve overall performance.

2. How did the post-training data and formatting differ from Gemma 1?

  • The post-training data was extended with a mixture of internal and external public data, including prompts from the LMSYS-chat-1M dataset.
  • The formatting schema was updated to use control tokens for the start/end of user and model turns, as well as the beginning/end of the sequence.

[04] Evaluations

1. How did the pre-trained Gemma 2 models perform compared to other open models of similar size?

  • The 27B Gemma 2 model outperformed the similar-sized Qwen1.5 32B model and was competitive with the larger LLaMA-3 70B model on benchmarks like MMLU, GSM8K, and ARC-c.
  • The 2.6B and 9B Gemma 2 models trained with distillation showed significant improvements over the previous Gemma 1 models of comparable size.

2. What were the key findings from the safety and responsibility evaluations?

  • Gemma 2 models showed lower violation rates on safety policies compared to previous Gemini models, especially for content related to child safety.
  • On offensive cybersecurity tasks, the 27B Gemma 2 model performed better than the 7B CodeGemma V1 model but was still less capable than the Gemini 1.5 Pro model.
  • Gemma 2 models exhibited moderate persuasion capabilities in human studies, with no significant differences compared to Gemini models.


Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.