magic starSummarize by Aili

Gemma 2: Improving Open Language Models at a Practical Size

๐ŸŒˆ Abstract

The article introduces Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. The key points are:

  • Gemma 2 applies several known technical modifications to the Transformer architecture, such as interleaving local-global attentions and group-query attention.
  • The 2B and 9B models are trained with knowledge distillation instead of next token prediction, resulting in better performance for their size.
  • The models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times larger.
  • All Gemma 2 models are released to the community.

๐Ÿ™‹ Q&A

[01] Model Architecture

1. What are the key architectural differences between Gemma 1 and Gemma 2 models?

  • Gemma 2 uses deeper networks compared to Gemma 1.
  • Gemma 2 alternates between local sliding window attention and global attention in every other layer.
  • Gemma 2 uses logit soft-capping, post-norm and pre-norm with RMSNorm, and Grouped-Query Attention (GQA).

2. What are the key parameter settings for the Gemma 2 models?

  • The 2B model has 2,024,517,888 non-embedding parameters, the 9B model has 8,324,201,984, and the 27B model has 26,047,480,320.
  • All models use a context length of 8192 tokens, Rotary Position Embeddings (RoPE), and the approximated GeGLU non-linearity.

[02] Pre-training

1. What are the key differences in the pre-training data and infrastructure used for Gemma 2 compared to Gemma 1?

  • Gemma 2 is trained on 13 trillion tokens for the 27B model, 8 trillion for the 9B model, and 2 trillion for the 2B model.
  • The models are trained on TPUv4, TPUv5e, and TPUv5p with data replication and model sharding.
  • Knowledge distillation is used to train the 2B and 9B models, with the 27B model trained from scratch.

2. How does the carbon footprint of Gemma 2 pre-training compare to Gemma 1?

  • The carbon emissions from pre-training the Gemma 2 models are estimated to be similar to Gemma 1, as Google data centers are carbon neutral.

[03] Post-training

1. What are the key differences in the post-training process for Gemma 2 compared to Gemma 1?

  • Gemma 2 uses the same control tokens as Gemma 1, but a different formatting schema where the model explicitly ends generations with <end_of_turn> tokens.
  • The post-training data mixture and hyperparameters were tuned to improve helpfulness while minimizing safety and hallucination issues.

2. How does the performance of the Gemma 2 IT models compare to Gemma 1 and other open models?

  • The Gemma 2 IT models significantly outperform the previous Gemma 1 models and are competitive with larger open models on a variety of benchmarks.
  • The Gemma 2 27B IT model in particular sets a new state-of-the-art on the LMSYS Chatbot Arena evaluation.

[04] Evaluations

1. How do the pre-trained Gemma 2 models compare to other large open models?

  • The Gemma 2 27B model outperforms a similar-sized model (Qwen1.5 34B) and is competitive with a larger model (LLaMA-3 70B) on the HuggingFace evaluation suite.
  • The Gemma 2 2B and 9B models show significant improvements over the previous Gemma 1 versions, benefiting from the knowledge distillation approach.

2. What are the key findings from the safety and responsibility evaluations of the Gemma 2 models?

  • The Gemma 2 models show lower rates of verbatim and approximate memorization of training data compared to prior models.
  • The models perform well on safety-critical evaluations like offensive cybersecurity and code vulnerability detection, though they still have limitations compared to specialized systems.
  • The Gemma 2 models demonstrate strong conversational and rapport-building capabilities, but do not show significant advantages over prior models in terms of deception or persuasion.
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.