Summarize by Aili

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

🌈 Abstract

The article discusses the task interference problem in multimodal large language models (MLLMs) and proposes a Mixture of Multimodal Experts (MoME) architecture to mitigate it. The key components of MoME are:

Mixture of Vision Experts (MoVE): Adaptively aggregates visual features from various vision encoders using an Adaptive Deformable Transformation (ADT) module and an instance-level soft router.
Mixture of Language Experts (MoLE): Incorporates sparsely gated experts into the language model to achieve performance gains with minimal computational overhead.

The authors demonstrate that MoME can effectively adapt to task differences in both vision and language modalities, leading to significant performance improvements across various vision-language tasks compared to generalist MLLMs.

🙋 Q&A

[01] Analysis on MoVE

1. What are the key components of MoVE and how do they help mitigate task interference? The key components of MoVE are:

Adaptive Deformable Transformation (ADT) module: Transforms visual features from diverse vision encoders into a unified-length feature sequence, aligning the representations and reducing information loss.
Instance-level soft router: Generates customized fusion ratios for the visual representations from different vision encoders, allowing adaptive aggregation based on the given instructions.

These components help MoVE effectively combine the strengths of various vision encoders and adapt to the task-specific visual perception requirements, mitigating task interference.

2. How do the experiments demonstrate the effectiveness of ADT and the router in MoVE? The experiments show that:

Using ADT to transform the visual features leads to a 4-point average performance improvement compared to simple pooling.
Incorporating the instance-level soft router to adaptively aggregate the transformed visual features further boosts the performance, achieving an impressive 69.39% average across all tasks, significantly outperforming the methods with a single vision encoder.

The visualization of the routing results also demonstrates that MoVE can adaptively modulate the features from different vision encoders to specialize in various task groups.

[02] Analysis on MoLE

1. What is the key idea behind MoLE and how does it improve upon conventional MoE methods? The key idea behind MoLE is to incorporate several parameter-efficient adapters as experts in the language model, instead of using multiple parallel feed-forward network layers as in conventional MoE methods. This allows MoLE to enhance the multitasking abilities with minimal computational overhead.

2. How do the experiments show the effectiveness of MoLE? The experiments show that:

All MoLE variations outperform the baseline model with a single adapter, with the sentence-embedding-based router achieving the highest average performance.
The visualization of the routing distributions across different datasets demonstrates that the MoLE experts exhibit clear specialization in distinct task groups, effectively mitigating task interference.

[03] Comparison with state-of-the-art MLLMs

1. How does the performance of MoME compare to other generalist and MoE-based MLLMs on popular VL tasks? The results show that MoME achieves promising outcomes on most datasets compared to other generalist and MoE-based MLLMs, especially on TextCaps, Flickr30K, and IconQA. This indicates that MoME possesses excellent multimodal understanding abilities and can effectively adapt to various vision-language tasks.

[04] Qualitative Analysis

1. How do the visualized examples demonstrate the adaptive behavior of MoVE and MoLE? The visualized examples show that MoVE can adaptively modulate the features from different vision encoders based on the task requirements. For example, it utilizes more DINO features for visual grounding tasks, and more Pix2Struct features for text-intensive document understanding tasks.

Similarly, the MoLE experts exhibit clear specialization in distinct task groups, routing the samples to the appropriate experts to mitigate task interference.

</output_format>

Shared by Daniel Chen ·

Install fromChrome Web Store