Learning to Route Among Specialized Experts for Zero-Shot Generalization
๐ Abstract
The paper proposes a method called Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE) that recycles specialized language models created through parameter-efficient fine-tuning (PEFT) to improve zero-shot generalization. PHATGOOSE learns a per-token and per-module routing strategy to adaptively choose which specialized modules to use, rather than relying on a single best expert. The authors show that PHATGOOSE outperforms prior methods for recycling experts and can even match or exceed the performance of explicit multitask training, despite not requiring simultaneous access to the datasets used to create the specialized models.
๐ Q&A
[01] Learning to Route Among Specialized Experts for Zero-Shot Generalization
1. What is the key problem that the paper aims to address? The paper aims to address the problem of leveraging a large collection of specialized language models created through parameter-efficient fine-tuning (PEFT) to improve the zero-shot generalization capabilities of a base language model.
2. What are the key challenges the paper identifies in this problem setting? The key challenges identified are:
- Determining a way to make the independently trained specialized models function together to improve zero-shot performance
- Doing so with minimal additional compute beyond the initial PEFT training
- Making routing decisions solely based on the input query, without access to the datasets used to train the specialized models
3. How does PHATGOOSE address these challenges? PHATGOOSE addresses these challenges by:
- Introducing a computationally inexpensive step after PEFT training where a per-module gate is trained
- Using the parameters of these gates to perform adaptive per-token and per-module routing during inference
- This allows PHATGOOSE to compose knowledge from multiple specialized models without requiring access to the original training datasets
4. How does PHATGOOSE's approach differ from prior methods for recycling specialized models? Prior methods focused on choosing a single best specialized model based on properties of the input query. In contrast, PHATGOOSE learns to adaptively route to different modules for different tokens, allowing it to combine capabilities from multiple experts.
5. What are the key findings from the experiments evaluating PHATGOOSE? The key findings are:
- PHATGOOSE outperforms prior methods for recycling specialized experts on zero-shot benchmarks
- In some cases, PHATGOOSE even outperforms explicit multitask training, despite not requiring simultaneous access to the training datasets
- Qualitative analysis suggests PHATGOOSE's success comes from its ability to effectively combine knowledge from diverse specialized models, rather than just aligning with the single best expert
[02] Decentralized development of zero-shot models
1. What is the problem setting and key assumptions the paper makes? The paper considers a decentralized setting where individual contributors train specialized PEFT-based models on their own datasets, without access to each other's data. The goal is to use this collection of specialized models to improve zero-shot generalization on unseen tasks.
2. What are the key constraints and requirements the paper identifies for this problem setting? The key constraints are:
- Avoid placing additional burdens on contributors beyond training their PEFT-based model
- Do not require simultaneous access to the datasets used to train the specialized models
- Aim to improve zero-shot performance on unseen tasks, not just held-in tasks
3. How do these constraints differ from prior work on recycling specialized models? Prior work often assumed access to the training datasets of the specialized models, or focused on few-shot learning rather than zero-shot generalization. The decentralized setting and zero-shot focus are unique to this paper.
[03] Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE)
1. How does PHATGOOSE work at a high level? PHATGOOSE works by:
- Having contributors train a task-specific gate in addition to their PEFT module
- Combining the gate parameters from all contributors to perform adaptive per-token and per-module routing during inference
2. What is the motivation behind the gate training approach? The motivation is that the gate vector for a PEFT module will learn to associate with characteristics of activations relevant to the task the module was trained on. Combining gates from multiple modules can then enable effective routing based on the relevance of each module to the input.
3. Why does PHATGOOSE train the gates separately from the PEFT modules? Training the gates separately, with the PEFT modules frozen, prevents the rest of the model from co-adapting with the gates. This ensures the gates are learning to route effectively based on the inherent properties of the PEFT modules.
4. How does PHATGOOSE perform routing during inference? During inference, PHATGOOSE computes the affinity between each activation and the routing vector of each PEFT module. It then selects the top-k modules with the highest affinities and scales their outputs accordingly to produce the final output.
[04] Experiments
1. What are the key datasets and benchmarks used in the experiments? The experiments use two sets of specialized PEFT models:
- T0 Held-In: 36 models trained on the same datasets as the T0 model
- FLAN: 166 models trained on a larger collection of datasets from the FLAN benchmark
The zero-shot generalization is evaluated on T0 Held-Out, BIG-Bench Hard, and BIG-Bench Lite.
2. What are the main baselines considered, and how does PHATGOOSE compare to them? The main baselines are:
- Retrieval: Choosing a single expert based on query embedding similarity
- Merged Experts: Averaging the parameters of all expert modules
- Multitask: Explicit multitask training (which violates the problem setting)
PHATGOOSE significantly outperforms the retrieval and merged experts baselines, and in some cases even matches or exceeds the performance of multitask training.
3. What insights does the qualitative analysis provide about PHATGOOSE's routing strategy? The qualitative analysis suggests that PHATGOOSE's success comes from its ability to effectively combine knowledge from diverse specialized models, rather than just aligning with the single best expert. It finds cases where PHATGOOSE outperforms the Oracle routing strategy by using a more diverse set of modules.