LazyLLM
๐ Abstract
The paper introduces LazyLLM, a novel token pruning technique to improve the efficiency of large language model (LLM) inference, particularly for long context scenarios. LazyLLM selectively computes the key-value (KV) cache for tokens that are important for predicting the next token, and "lazily" defers the computation of remaining tokens to later steps when they become relevant. This helps to significantly reduce the time-to-first-token (TTFT) during the prefilling stage of LLM inference without requiring any model fine-tuning.
๐ Q&A
[01] Introduction
1. What are the two sequential stages of standard prompt-based LLM inference? The two sequential stages of standard prompt-based LLM inference are:
- Prefilling stage: The model computes and saves the KV cache of each token from the prompt, and predicts the first token.
- Decoding stage: The model reuses the cached KVs to decode the next token iteratively until the stop criteria are met.
2. What is the challenge with the prefilling stage for long prompts? For long prompts, the prefilling stage can be slow because the KV cache must be computed for all tokens, and the cost of computing attention increases quadratically with the number of tokens in the prompts. This can significantly increase the time-to-first-token (TTFT).
3. What is the key question that the paper aims to address? The paper aims to address the question of whether all prompt tokens are essential for generating the first token, and explores ways to selectively compute the KV cache to improve the TTFT.
[02] LazyLLM
1. What is the key idea behind LazyLLM? The key idea behind LazyLLM is to selectively compute the KV cache for tokens that are important for predicting the next token, and "lazily" defer the computation of remaining tokens to later steps when they become relevant.
2. How does LazyLLM perform progressive token pruning? LazyLLM uses the attention score of the prior transformer layer to measure the importance of tokens and progressively prunes tokens along the depth of the transformer. It keeps more tokens at earlier transformer layers and gradually reduces the number of tokens towards the end of the transformer.
3. How does LazyLLM address the challenge of reviving previously pruned tokens? To address the challenge of reviving previously pruned tokens, LazyLLM introduces an additional caching mechanism called Aux Cache, which stores the hidden states of pruned tokens. This enables a computationally efficient pathway to revive pruned tokens and ensures that the worst runtime of LazyLLM is never slower than the baseline.
[03] Experiments
1. What are the key findings from the experiments on TTFT speedup vs. accuracy? The experiments show that LazyLLM can consistently achieve better TTFT speedup with negligible accuracy drop across multiple tasks, compared to baseline methods like random token dropping, static token pruning, and prompt compression.
2. How does LazyLLM impact the overall generation speed? The experiments show that LazyLLM reduces the total amount of computation by selecting a smaller subset of tokens from the prompt, which leads to additional speedup in the overall generation process across diverse tasks.
3. What is the effect of the locations of pruning layers and the number of tokens pruned? The experiments show that pruning at later transformer layers consistently has better performance than pruning at earlier layers, suggesting that later layers are less sensitive to token pruning. Gradually reducing the number of tokens towards the end of the transformer helps to balance the speedup and accuracy.
</output_format>