LoRA Learns Less and Forgets Less
๐ Abstract
The paper compares the performance of Low-Rank Adaptation (LoRA) and full finetuning on two target domains, programming and mathematics, under both instruction finetuning and continued pretraining data regimes. The results show that in most settings, LoRA substantially underperforms full finetuning. However, LoRA exhibits stronger regularization, better maintaining the base model's performance on tasks outside the target domain. The paper also shows that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining the performance gaps. The paper concludes by proposing best practices for finetuning with LoRA.
๐ Q&A
[01] Comparison of LoRA and Full Finetuning
1. How does the performance of LoRA compare to full finetuning on code and math tasks? The results show that for code tasks, LoRA substantially underperforms full finetuning, whereas for math tasks, LoRA closes more of the gap, though it still remains less sample efficient.
2. What are the key differences in the learning-forgetting tradeoffs between LoRA and full finetuning? LoRA and full finetuning form a similar learning-forgetting tradeoff curve, with LoRA models generally learning less but forgetting less. However, the paper finds cases, especially for code tasks, where LoRA can learn comparably but forget less than full finetuning.
3. Why does LoRA underperform full finetuning? The paper shows that full finetuning finds high rank weight perturbations, with ranks 10-100X higher than typical LoRA configurations. This suggests that the low-rank assumption underlying LoRA may not hold for challenging domains like code and advanced math.
[02] Regularization Properties of LoRA
1. How does LoRA's regularization compare to common techniques like weight decay and dropout? The paper shows that LoRA provides stronger regularization compared to weight decay and dropout. LoRA better maintains the diversity of generated solutions, preventing the model from collapsing to a limited set of outputs.
2. How does LoRA help maintain performance on the source domain? The paper finds that LoRA better maintains the base model's performance on tasks outside the target domain, compared to full finetuning. This suggests LoRA acts as a stronger regularizer, constraining the finetuned model's behavior to remain closer to the base model.
[03] Practical Considerations for LoRA
1. What are the key hyperparameters that affect LoRA's performance? The paper finds that LoRA is highly sensitive to the learning rate, with the best LoRA learning rates being an order of magnitude higher than for full finetuning. The choice of target modules also has a larger impact than the rank hyperparameter.
2. What are the recommended best practices for training with LoRA? The paper recommends using LoRA for instruction finetuning (not continued pretraining), identifying the highest stable learning rate, targeting "All" modules, and choosing a relatively low rank (e.g., 16) based on memory constraints.