Training LLMs over Neurally Compressed Text
๐ Abstract
The paper explores the idea of training large language models (LLMs) over highly compressed text, with the goal of achieving greater efficiency in training and serving. The authors investigate various compression methods, focusing primarily on Arithmetic Coding (AC), and propose a novel technique called "Equal-Info Windows" to improve the learnability of the compressed text. The key findings are:
- Training LLMs directly over naively AC-compressed text is not feasible, as the compressed output is too opaque for the model to learn.
- Equal-Info Windows, which reset the AC encoder and model context at fixed bit thresholds, enable effective learning over compressed text, outperforming byte-level baselines in terms of perplexity and inference speed.
- While the best-performing Equal-Info Windows models still underperform subword tokenizers in terms of perplexity, they have the benefit of shorter sequence lengths, which can reduce latency.
- The authors provide extensive analysis on the properties that contribute to learnability, and offer suggestions for further improving high-compression tokenizers.
๐ Q&A
[01] Motivation and Background
1. What are the main advantages of training LLMs over compressed text? The main advantages are:
- Efficiency: Compressing the text allows the model to process more text for the same computational cost, improving training and inference efficiency.
- Longer context: Compressed text allows the model to condition on longer contextual dependencies without increasing compute.
- Adaptive compute allocation: Compression can spread information more uniformly across the sequence, allowing the model to allocate compute more effectively.
2. What are the key challenges in training LLMs over compressed text? The key challenges are:
- Learnability: Strong compression can remove too much information, making the compressed output too opaque for the model to learn.
- Numerical stability: Compression methods are sensitive to the precise model probabilities used, which can be hard to guarantee in practice.
- Multi-model inference: Training over compressed text requires storing and running multiple models (the compressor and the LLM) during inference.
[02] Methods
1. How does the Arithmetic Coding (AC) compression method work? AC uses a language model to assign probabilities to the input text, and then encodes the text into a bitstream by partitioning a floating-point interval based on the cumulative probabilities of the symbols. The final bitstream represents the binary expansion of a number within the final interval.
2. What is the "Equal-Info Windows" compression method proposed by the authors? Equal-Info Windows is a modification to AC compression where the text is segmented into fixed-size windows, and the AC encoder is reset at the end of each window. This ensures that each window can be decoded independently, making the compressed output more learnable by the LLM.
3. How do the authors tokenize the compressed bitstreams for training the LLM? The authors convert the compressed bitstreams into token sequences by grouping every N bits into a single token, resulting in a vocabulary size of 2^N. They explore settings of N = 8 and N = 16, finding that larger vocabularies improve performance.
[03] Results
1. What are the key findings regarding training LLMs over AC-compressed text?
- Training directly over AC-compressed text fails, as the output is too uniform and opaque for the model to learn.
- The "Equal-Info Windows" method enables effective learning over compressed text, outperforming byte-level baselines in terms of perplexity and inference speed.
- However, the best-performing Equal-Info Windows models still underperform subword tokenizers in terms of perplexity, despite their shorter sequence lengths.
2. How do the authors explain the difficulty in learning AC-compressed text? The authors hypothesize that the main challenge stems from the difficulty of learning the AC compression and decompression process itself, rather than just the modeling component. They show that even the sub-tasks of AC-compressing and AC-decompressing text are not learned well by the LLM.
3. What are the key findings regarding the "learnability" of different compression methods?
- GZip-compressed text is learnable by the LLM, but not competitive in terms of compression rate and performance.
- Equal-Info Windows compression retains more "non-uniform" information in the bitstream compared to vanilla AC, which contributes to its improved learnability.
- Larger token vocabularies (e.g., 65,536 vs. 256) also improve learnability, even when controlling for the amount of raw text seen during training.
[04] Analysis
1. How do the authors characterize the differences between the tokenization of Equal-Info Windows and SentencePiece?
- SentencePiece produces a relatively stable text-to-token mapping, where the same word typically maps to the same token sequence.
- Equal-Info Windows has a more unstable and less "semantic" tokenization, where the same word can be split into different token sequences depending on context.
2. What insights do the authors gain about how the LLM learns to decode the Equal-Info Windows compressed text?
- The model learns to decode the window-initial tokens first, and then progressively learns to decode the later tokens within each window.
- The model struggles to maintain accurate decoding as it moves further into each window, suggesting a limit to how long it can reliably track the AC decompression process.
3. How do the authors explain the importance of the "non-uniformity" of the compressed bitstream for learnability? The authors find that compression methods that produce more non-uniform bitstreams (e.g., Equal-Info Windows) are more learnable by the LLM, compared to methods that produce more uniform outputs (e.g., vanilla AC). This suggests that the LLM needs to be able to detect some remaining patterns in the compressed data in order to effectively model it.