Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models
๐ Abstract
The article presents a comprehensive study of ternary, quantized, and FP16 language models, called the Spectra LLM suite. It includes 54 language models ranging from 99M to 3.9B parameters, trained on 300B tokens. The suite includes FloatLMs, post-training quantized QuantLMs (3, 4, 6, and 8 bits), and ternary LLMs (TriLMs) - an improved architecture for ternary language modeling that outperforms previously proposed ternary models. The article evaluates the performance of these models across various benchmarks related to commonsense, reasoning, knowledge, and toxicity, and provides insights into their training dynamics and scaling trends.
๐ Q&A
[01] Spectra LLM Suite
1. What is the Spectra LLM suite? The Spectra LLM suite is a comprehensive set of 54 language models spanning different parameter counts (99M to 3.9B) and bitwidths (FP16, 3-bit, 4-bit, 6-bit, 8-bit, and ternary). It includes FloatLMs, post-training quantized QuantLMs, and ternary LLMs (TriLMs).
2. What are the key contributions of the Spectra LLM suite? The key contributions of the Spectra LLM suite are:
- It provides a comprehensive set of models across different parameter counts and bitwidths, trained on the same 300B token dataset.
- It introduces an improved ternary language modeling architecture (TriLM) that outperforms previously proposed ternary models.
- It enables comparative analysis of TriLMs, FloatLMs, and QuantLMs across various benchmarks related to commonsense, reasoning, knowledge, and toxicity.
- It releases over 500 intermediate checkpoints of the models, facilitating research on training dynamics and interpretability.
3. How does the Spectra suite enable research advancements? The Spectra suite enables advancements in several ways:
- It provides transparency into the training process by releasing intermediate checkpoints and detailed documentation.
- It facilitates research on the capacities and limitations of ternary models at various scales.
- It enhances interpretability by enabling the study of models at the connection level, in contrast to the neuron-level focus of previous work.
[02] TriLM Architecture and Training
1. What are the key architectural differences between TriLM and BitNet? The key architectural differences between TriLM and BitNet are:
- TriLM follows a pre-normalization approach, normalizing before each linear layer, unlike BitNet which normalizes before and after each linear layer.
- TriLM uses RMSNorm with a scale parameter, while BitNet uses parameterless RMSNorm.
- TriLM employs SwiGLU Gated MLP, while BitNet uses a standard transformer MLP.
2. How does the TriLM optimization schedule differ from BitNet? The TriLM optimization schedule differs from BitNet in two key ways:
- Peak learning rate reduction: At the halfway point of training, TriLM reduces the peak learning rate.
- Weight decay removal: At the two-thirds point of training, TriLM removes the weight decay (L2 regularization).
These interventions lead to a sudden drop in training loss at the halfway point and accelerated convergence in the final third of training.
3. How do the training dynamics of TriLM compare to FloatLM? The training dynamics of TriLM and FloatLM show the following key differences:
- TriLMs exhibit a sudden drop in training loss at the halfway point when the peak learning rate is reduced.
- TriLMs demonstrate accelerated convergence in the final third of training when the weight decay is removed.
- The training loss curves of TriLMs show consistent improvement with increasing parameter count, while the gap between TriLM and FloatLM of similar parameter count narrows at larger scales.
[03] Evaluation and Comparative Analysis
1. How do TriLMs, FloatLMs, and QuantLMs perform on commonsense and reasoning benchmarks? On commonsense and reasoning benchmarks:
- At the 2.4B and 3.9B parameter scales, TriLMs consistently outperform QuantLMs and FloatLMs of the same size (in bits).
- TriLM 3.9B matches the performance of FloatLM 3.9B on these benchmarks, despite having fewer bits.
2. How do the models perform on knowledge-based tasks? On knowledge-based tasks like SciQ and TriviaQA:
- At the billion-parameter scale, TriLMs offer better performance than QuantLMs and FloatLMs of the same size (in bits).
- TriLM 3.9B demonstrates competitive performance to FloatLM 3.9B across these knowledge-based benchmarks.
3. How do the models compare in terms of toxicity and stereotyping? Regarding toxicity and stereotyping:
- TriLM 3.9B exhibits the same level of toxicity and stereotyping as FloatLM 3.9B, significantly higher than a similarly sized FloatLM 830M.
- The gap in toxicity and stereotyping between TriLMs and FloatLMs narrows as the model size increases.
4. How do the models perform on validation perplexity? On validation perplexity:
- TriLMs scale much better in terms of performance for their size (in bits) compared to FloatLMs.
- At the 3.9B scale, the gap in validation perplexity between TriLM and FloatLM starts to decrease, especially on less noisy datasets like Penn Tree Bank and Lambada.
- However, a gap in perplexity persists on web-based corpora, both in-domain and out-of-domain, at the 3.9B scale.