You Only Cache Once: Decoder-Decoder Architectures for Language Models
๐ Abstract
The paper introduces a decoder-decoder architecture called YOCO (You Only Cache Once) for large language models. YOCO consists of two components: a self-decoder that efficiently encodes global key-value (KV) caches, and a cross-decoder that reuses these shared KV caches via cross-attention. This design substantially reduces GPU memory demands while retaining global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, significantly speeding up the prefill stage. Experiments demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. The paper also extends YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes.
๐ Q&A
[01] You Only Cache Once (YOCO)
1. What are the two main components of the YOCO architecture? The YOCO architecture consists of two main components:
- Self-decoder: Efficiently encodes global key-value (KV) caches
- Cross-decoder: Reuses the shared KV caches produced by the self-decoder via cross-attention
2. How does YOCO reduce GPU memory demands compared to Transformers? YOCO only caches the KV pairs once, whereas Transformer decoders have to store keys and values during inference. This allows YOCO to roughly save GPU memory for caches compared to Transformer decoders.
3. How does YOCO's computation flow enable faster prefilling? The computation flow of the decoder-decoder architecture in YOCO allows prefilling to early exit before entering the self-decoder. This property speeds up the prefill stage dramatically, as only half the layers are needed for forward computation compared to Transformer.
[02] Design Choices of Self-Decoder
1. What are the two efficient self-attention methods used in the self-decoder? The two efficient self-attention methods used in the self-decoder are:
- Gated retention (gRet)
- Sliding-window attention
2. How does gated retention (gRet) achieve training parallelism, good performance, and low inference cost? Gated retention unifies the parallel, recurrent, and chunkwise recurrent computation paradigms. The training process usually uses the parallel or chunkwise recurrent paradigms, while the inference stage can employ the recurrent paradigm for constant KV memory.
3. How does sliding-window attention reduce the KV cache memory complexity compared to vanilla Transformer decoders? Sliding-window attention restricts the attention range to a fixed window size, reducing the KV cache memory complexity from to , i.e., the memory usage is constant rather than increasing with sequence length.
[03] Experiments
1. How does the scaling performance of YOCO compare to Transformer-based models? The scaling curves from 160M to 13B show that YOCO is competitive compared to Transformer-based models. YOCO with gated retention and sliding-window attention both achieve comparable performance to the Llama-optimized Transformer architecture.
2. How does YOCO-3B-1M perform on the Needle-In-A-Haystack and Multi-Needle Retrieval tests? YOCO-3B-1M passes the Needle-In-A-Haystack test with near-perfect accuracy, demonstrating strong long-context modeling capability. In the Multi-Needle Retrieval test, YOCO-3B-1M outperforms several well-trained 128K-length models and achieves comparable performance to the 7B-size LWM-1M-text model.
3. What are the key advantages of YOCO in terms of inference efficiency? YOCO significantly reduces the GPU memory footprint, prefill latency, and improves throughput compared to Transformer-based models. For example, YOCO can serve 128K tokens with 1GB GPU memory, while Transformer with grouped-query attention can only support 1.6K tokens at 65B model size.
</output_format>