Summarize by Aili

LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

🌈 Abstract

The paper proposes a novel Vision Transformer architecture called LeMeViT that addresses the spatial redundancy in images, particularly in remote sensing images, by efficiently learning sparse meta tokens to represent dense image tokens. The key components are:

Learnable meta tokens that are initialized from image tokens via cross-attention and then updated through information exchange with image tokens via a novel Dual Cross-Attention (DCA) module.
DCA replaces the original self-attention mechanism, promoting information exchange between meta tokens and image tokens in a computationally efficient manner by reducing the complexity from quadratic to linear.
The hierarchical LeMeViT architecture employs DCA in the early stages and standard attention in the later stages for a better trade-off between efficiency and performance.

Experiments on image classification, remote sensing scene recognition, and various dense prediction tasks demonstrate that LeMeViT achieves competitive performance while enjoying significant computational efficiency compared to other efficient Vision Transformer models.

🙋 Q&A

[01] Learnable Meta Tokens and Dual Cross-Attention

1. What are the key components of the proposed LeMeViT architecture?

Learnable meta tokens that are initialized from image tokens via cross-attention and then updated through information exchange with image tokens
Dual Cross-Attention (DCA) module that promotes information exchange between meta tokens and image tokens in a computationally efficient manner by reducing the complexity from quadratic to linear

2. How does the DCA module work? The DCA module replaces the original self-attention mechanism. It employs a dual-branch structure where image tokens and meta tokens serve as query and key (value) tokens alternatively, significantly reducing the computational complexity compared to self-attention.

3. How is the LeMeViT architecture designed? LeMeViT follows a hierarchical structure, with DCA employed in the early stages and standard attention used in the later stages to achieve a better trade-off between efficiency and performance.

[02] Experiments and Results

1. What are the key findings from the experiments on image classification? Compared to other efficient Vision Transformer models, LeMeViT achieves the best trade-off between efficiency (throughput, parameter count, MACs) and performance (top-1 accuracy) on the ImageNet-1K benchmark.

2. How does LeMeViT perform on remote sensing tasks? LeMeViT demonstrates competitive performance and significant computational efficiency compared to Swin Transformer and ViTAE on remote sensing scene recognition, object detection, semantic segmentation, and change detection tasks.

3. What insights are provided by the attention map visualizations? The visualizations show that the learned meta tokens can effectively attend to semantic parts of the images, both for natural images and remote sensing images, indicating that the meta tokens can learn effective representations by aggregating important semantic regions.

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store