Theia: Distilling Diverse Vision Foundation Models for Robot Learning
๐ Abstract
The article introduces Theia, a vision foundation model that distills multiple off-the-shelf vision foundation models (VFMs) trained on diverse visual tasks to enhance downstream robot learning. Theia's rich visual representations encode diverse visual knowledge, leading to improved robot learning performance compared to individual VFMs and prior robot learning models.
๐ Q&A
[01] Introduction
1. What is the key motivation behind the work on Theia? The key motivation is that vision-based robot policy learning requires strong and diverse visual comprehension, but existing off-the-shelf VFMs usually underperform relative to visual representation models tailored for specific tasks in robot learning. The authors aim to combine multiple large VFMs into a single, smaller model for robot learning that leverages diverse visual understanding abilities from VFMs.
2. What are the key contributions of this work? The key contributions are:
- Introduction of Theia, a model that combines knowledge from multiple VFMs into a single, smaller model using knowledge distillation with low training cost.
- Extensive experiments demonstrating that Theia outperforms its teacher models and prior robot learning models using less training data and smaller model sizes.
- Identification of key factors relevant to robot learning performance, such as model size, the use of spatial tokens, and the entropy of representation norms.
[02] Related Work
1. What are the different approaches for learning visual representations for robot learning? The related work discusses several approaches for learning visual representations for robot learning, including:
- Pre-training, joint-learning, or a combination of both using trainable or frozen visual representations.
- Using off-the-shelf visual encoders as well as training visual representations from scratch.
- Exploring different training objectives and auxiliary tasks such as data augmentation, prediction tasks, contrastive learning, and self-supervised learning.
- Introducing inductive biases and constraints to handle invariance and equivariance in visual observations.
2. What are vision foundation models (VFMs) and how have they been used in prior work? Vision foundation models are models trained on large-scale data that exhibit strong task-specific performance and transferability to unseen domains and new tasks. Prior work has explored distilling individual VFMs into more compact models, as well as combining VFMs with language models to perform tasks like object detection.
[03] Method
1. What is the overall design of the Theia framework? Theia consists of a visual encoder (backbone) and a set of feature translators for distillation. The visual encoder produces a rich "Theia-representation" that is used for downstream robot learning tasks. The feature translators map the Theia-representation to the representations of the teacher VFMs to enable distillation.
2. How does Theia distill knowledge from multiple VFMs? Theia distills the knowledge of multiple VFMs (CLIP, DINOv2, ViT, SAM, Depth-Anything) into a smaller model by using a combination of cosine and smooth-L1 losses to match the outputs of the feature translators with the corresponding teacher VFM representations.
3. What design choices were made for the Theia-representation? The authors chose to focus on spatial tokens rather than the [CLS] token, as spatially-dense representations are important for diverse visual understanding. They also introduced "register tokens" to provide some benefits for learning high-quality representations.
[04] Experiments
1. How does Theia perform compared to baseline models on the CortexBench simulation tasks? Theia outperforms all evaluated baseline models, including prior works on robot learning representations (R3M, VIP, MVP, VC-1) as well as agglomerative models for vision tasks (RADIO, E-RADIO). Theia models scale effectively from tiny to base sizes, with Theia-S and Theia-B being the only models to break scores of 80 on the MuJoCo subset of CortexBench.
2. What insights are gained from the ablation studies and analysis of visual representations? The ablation studies and analysis reveal several insights:
- Spatial tokens are more effective than the [CLS] token for Transformer-based models in robot learning.
- Model size scaling generally improves performance, with the effect more pronounced when using the [CLS] token.
- The entropy of the feature norm distribution correlates strongly with improved robot learning performance, suggesting that representations with higher feature diversity are beneficial.
3. How does Theia perform on real-world robot learning tasks compared to baselines? On the real-world robot tasks (Door Opening, Pick-and-Place, Toy-Microwave Cooking, Drawer Opening), Theia-B achieves the highest success rates across most tasks, demonstrating the effectiveness of its visual representations for both conventional and diffusion-based policy heads, as well as for both freezing and fine-tuning the visual representation.
[05] What Makes Visual Representations Good for Robot Learning?
1. What key factors were identified as relating to robot learning performance? The analysis identified three key factors relating to robot learning performance:
- Model size: Larger models generally perform better.
- Use of spatial tokens: Spatial tokens are more effective than [CLS] tokens.
- Entropy of representation norms: Higher entropy in feature norm distributions correlates with improved robot learning performance.
2. How was the correlation between feature norm entropy and robot learning performance observed? The authors found a strong correlation (R=0.943) between feature norm entropy and robot learning performance among regular models, and a high correlation (R=0.638) among distilled models. They hypothesize that spatial token representations with higher entropy (better feature diversity) encode more information that aids policy learning.
</output_format>