Elysium: Exploring Object-level Perception in Videos via MLLM
๐ Abstract
The paper introduces Elysium, an end-to-end trainable multi-modal large language model (MLLM) designed to handle object-level perception tasks in both images and videos. The key contributions are:
-
Introducing two new tasks - Referring Single Object Tracking (RSOT) and Video Referring Expression Generation (Video-REG) - to bridge the gap between language and tracking in videos.
-
Constructing a large-scale dataset called ElysiumTrack-1M to support tasks like SOT, RSOT, and Video-REG.
-
Proposing a visual token compression network called T-Selector to balance performance and visual token consumption, enabling MLLMs to process a larger number of video frames.
-
Demonstrating the effectiveness of Elysium in various object-level perception tasks through extensive experiments.
๐ Q&A
[01] Introduction
1. What are the key challenges in applying multi-modal large language models (MLLMs) to video-related tasks? The key challenges are:
- Extensive pretraining on large-scale video datasets is required to equip MLLMs with the capability to perceive objects across multiple frames and understand inter-frame relationships.
- Processing a large number of frames within the context window of Large Language Models (LLMs) can impose a significant computational burden.
2. How does the paper classify video tasks based on the granularity they address? The paper classifies video tasks into three categories:
- Video-level tasks (e.g., VideoQA, Video Caption)
- Frame-level tasks (e.g., Video Grounding, Dense Video Captioning, Video Highlight Detection)
- Object-level tasks (e.g., Single Object Tracking, Multi-Object Tracking, Video Object Segmentation)
3. What are the key challenges in handling object-level tasks in videos using MLLMs? The key challenges are:
- Differentiating and locating objects in each frame while ensuring temporal consistency with higher granularity.
- Minimizing the use of visual tokens to achieve a larger context window.
- Limited availability of large-scale training data for object-level tasks in videos.
[02] Construct ElysiumTrack-1M dataset
1. What are the two new tasks introduced in the paper? The two new tasks introduced are:
- Referring Single Object Tracking (RSOT): Identifying and locating a specific object within an entire video by the given language expression.
- Video Referring Expression Generation (Video-REG): Predicting a description of an object given its coordinates in any frame of a video.
2. How was the ElysiumTrack-1M dataset constructed? The dataset construction pipeline consists of two steps:
- Generating raw noun-chunk-bounding-box pairs from video captions and object detection.
- Extending the raw pairs to noun-chunk-trajectory pairs using a tracking model and filtering out inaccurate tracking instances.
3. What are the key statistics and characteristics of the ElysiumTrack-1M dataset?
- The dataset contains 1.27 million noun-chunk-trajectory pairs, which is significantly larger than existing tracking datasets.
- Each trajectory is accompanied by an expression that refers to the corresponding object.
- The dataset is split into a training set of 1.27 million videos and an evaluation set of 500 videos.
[03] Elysium
1. What are the key components of the Elysium architecture? The key components are:
- CLIP-ViT-L as the visual encoder
- Vicuna as the large language model (LLM)
- A specially designed token compressor called T-Selector to connect the visual encoder and LLM
2. How does the T-Selector work? The T-Selector aims to strike a balance between the visual token count and performance. It comprises:
- A gating operation conducted by an MLP layer and a softmax layer to determine which tokens to select
- An MLP layer that transforms the hidden dimension to match that of the LLM
3. What are the training setups for Elysium? The training process involves two stages:
- Pretraining on large-scale image data, including a two-step process to initialize the T-Selector and then train Elysium end-to-end.
- Finetuning on high-quality data, including a mixture of image and video datasets, with a focus on optimizing performance in object-level tasks.
[04] Experiments
1. How does Elysium perform on image grounding tasks? Elysium achieves state-of-the-art performance on commonly used image grounding datasets like RefCOCO, RefCOCO+, and RefCOCOg, despite employing a visual token compression method.
2. How does Elysium perform on video question-answering tasks? Elysium achieves state-of-the-art performance on multiple video question-answering datasets, including MSVD-QA, MSRVTT-QA, TGIF-QA, and ActivityNet-QA, in a zero-shot setting.
3. How does Elysium perform on single object tracking tasks? Elysium demonstrates comparable performance to baseline methods on various single object tracking datasets, even in a zero-shot setting. However, its performance is relatively less satisfactory when dealing with datasets containing small objects.
4. How does Elysium perform on the new tasks introduced in the dataset, RSOT and Video-REG? Elysium outperforms the baseline method MiniGPT-v2 on both the RSOT and Video-REG tasks, showcasing its ability to capture temporal awareness and coherence.