
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
๐ Abstract
The paper proposes TC-FPx, a GPU kernel design scheme that provides unified Tensor Core support for float-point weights with various quantization bit-widths, such as 6-bit (FP6). This enables efficient inference of large language models (LLMs) by addressing the "memory wall" issues during LLM inference. The authors integrate TC-FPx into an existing inference system to provide a new end-to-end support called FP6-LLM, which achieves better trade-offs between inference cost and model quality compared to existing approaches.
๐ Q&A
[01] Motivation for FP6 Quantization
1. What are the key advantages of FP6 quantization compared to 8-bit and 4-bit quantization?
- FP6 quantization can achieve lower inference cost than 8-bit quantization by significantly reducing the GPU memory required to store model weights and accelerating the inference through reduced GPU DRAM access.
- FP6 quantization can preserve model quality better than 4-bit quantization, displaying strong and consistent performance across various tasks including code generation and zero-shot perplexity.
2. What are the key challenges in supporting FP6 quantization efficiently on modern GPUs?
- The irregular bit-width of FP6 weights makes memory access unfriendly to the GPU memory hierarchy.
- The high computational overhead of de-quantizing FP6 weights to FP16 at runtime can significantly slow down the overall execution.
[02] TC-FPx Kernel Design
1. Why is it essential to enable Tensor Cores when performing inference of quantized LLMs?
- Traditional SIMT cores are an order of magnitude slower than Tensor Cores for linear layer execution.
- A large fraction of the SIMT core's computational power will be used to de-quantize the model weights at runtime, further reducing the available computational power for matrix multiplication.
2. What is the key insight behind the "Ahead-of-time Bit-level Pre-packing" technique?
- By reordering and pre-packing the weights ahead of time, the memory access pattern can be optimized to be well-aligned with the 32-bit granularity required by the GPU memory hierarchy, eliminating the inefficiency caused by the irregular bit-width.
3. How does the "SIMT-Efficient GPU Runtime" design reduce the overhead of weight de-quantization?
- It leverages optimized bit-wise SIMT core instructions to efficiently perform the FPx-to-FP16 de-quantization.
- It exploits bit-level parallelism to de-quantize multiple weights simultaneously, further reducing the runtime overhead.
[03] Evaluation
1. What are the key performance improvements achieved by FP6-LLM compared to the FP16 baseline?
- For the LLaMA-70b model, FP6-LLM achieves up to 2.5x higher normalized inference throughput (tokens per GPU-second) using only a single GPU, compared to the FP16 baseline using two GPUs.
- For the OPT-30b model, FP6-LLM achieves up to 4x higher normalized inference throughput compared to the FP16 baseline.
2. How does the performance of TC-FPx kernel compare to other state-of-the-art quantization approaches?
- TC-FPx outperforms the 8-bit quantization support in TensorRT-LLM by up to 2.5x, and achieves similar performance as the 4-bit quantization approaches while providing significantly better model quality.