You Need to Pay Better Attention
๐ Abstract
The paper introduces three new attention mechanisms - Optimised Attention, Efficient Attention, and Super Attention - that outperform standard multi-head attention in terms of efficiency and learning capabilities. The key contributions are:
- Optimised Attention reduces the attention layer size by 25% and computational cost by one matrix multiplication per head, while performing similarly to standard attention.
- Efficient Attention reduces the attention layer size by 50% and computational cost by two matrix multiplications, performing on par with standard attention while being up to twice as fast.
- Super Attention outperforms standard attention by 2-7% in accuracy on vision and language tasks, while reducing the attention layer size by 25% and computational cost by up to 45% when the context size is equal to or smaller than the model dimension.
The proposed attention mechanisms aim to improve the performance and deployability of Transformer models, especially on edge devices.
๐ Q&A
[01] Optimised Attention
1. How does Optimised Attention reduce the computational cost and number of parameters compared to standard attention? Optimised Attention reduces the attention layer size by 25% and the computational cost by one matrix multiplication per head, compared to standard attention. This is achieved by absorbing the value transformation matrix into the attention scores, eliminating the need for a separate matrix multiplication.
2. How does the performance of Optimised Attention compare to standard attention? Optimised Attention performs similarly to standard attention in terms of learning capabilities, as demonstrated in the experiments. The authors show that Optimised Attention can replace standard attention in models that rely on multiple attention heads without significantly affecting the model's performance.
3. How does Optimised Attention maintain the linear rank of the attention mechanism compared to standard attention? The authors show that Optimised Attention is equivalent to standard multi-head attention in terms of linear rank, meaning it preserves the same amount of linearly independent information in the attention scores.
[02] Efficient Attention
1. How does Efficient Attention further optimize the attention mechanism compared to Optimised Attention? Efficient Attention builds on top of Optimised Attention and optimizes it even further. It reduces the attention layer size by 50% and the computational cost by two matrix multiplications, compared to standard attention. This is achieved by replacing the multiple attention score matrices with a single matrix, while still maintaining the same linear rank as standard attention.
2. How does the performance of Efficient Attention compare to standard attention? Efficient Attention performs on par with standard attention in terms of loss and accuracy, while being up to twice as fast. The authors demonstrate that single-head Efficient Attention can match the performance of multi-head standard attention, while being significantly faster and smaller.
[03] Super Attention
1. How does Super Attention differ from the previous attention mechanisms? Super Attention introduces an additional learnable "alignment" kernel that vertically aligns and mixes the values before applying the attention scores. This aligns with the authors' third observed principle that a linear kernel between the inputs increases the learning capabilities.
2. How does the performance of Super Attention compare to the other attention mechanisms? Super Attention outperforms standard attention, as well as Optimised and Efficient Attention, by a significant margin (2-7% higher accuracy) in both vision and language tasks. It achieves this while being smaller than standard attention by at least 25% and faster by up to 45% when the context size is equal to or smaller than the model dimension.
3. What are the efficiency benefits of Super Attention compared to standard attention? Super Attention has at least two matrix multiplications and 25% fewer parameters than standard attention, whenever the model dimension is greater than or equal to the context length. This makes Super Attention more efficient and suitable for deployment on edge devices.
[04] Evaluation
1. How did the authors evaluate the proposed attention mechanisms? The authors evaluated the attention mechanisms on image classification tasks using the MNIST and CIFAR100 datasets, as well as text sentiment analysis on the IMDB Movie Reviews and Amazon Reviews datasets. They trained Transformer models using each attention mechanism and compared the performance in terms of accuracy, loss, and inference speed on an edge device (MacBook Pro with M2 chip).
2. What were the key findings from the evaluation? The results showed that:
- Increasing the number of attention heads in standard and Optimised Attention models led to increased training and inference times, with little to no gain in performance.
- Efficient Attention and Super Attention models performed on par or better than standard attention models, while being significantly smaller and faster (up to 50% faster on the edge device).
- Super Attention outperformed all other attention mechanisms by a substantial margin (2-7% higher accuracy) on both vision and language tasks.
3. How do the proposed attention mechanisms address the challenges of large foundation and language models? The introduced attention mechanisms, particularly Efficient Attention and Super Attention, address the challenges of large models by reducing computational costs and model sizes, making AI systems more efficient, accessible, and sustainable. The authors argue that these attention mechanisms can redefine the landscape of AI by enabling more powerful and deployable models, especially on edge devices.