
From ๐ to ๐^*: Your Language Model is Secretly a Q-Function
๐ Abstract
Your Language Model is Secretly a Q-Function
๐ Q&A
[01] Theoretical Insights
1. How does DPO relate to the token-level MDP for large language models?
- The paper derives DPO within the token-level MDP setting, showing that DPO implicitly learns a token-level reward function, for which the language model's logits define the optimal Q-function or expected total future reward.
- The paper demonstrates that DPO is able to flexibly model any possible dense reward function within the token MDP.
2. What is the relationship between reward functions and optimal Q-functions in the token MDP?
- The paper proves that there is a bijection between reward functions and corresponding optimal Q-functions in the token MDP under mild assumptions.
- This means that an LLM is always the optimal soft Q-function for some reward function in the token MDP.
3. How does DPO learn the optimal advantage function in the token MDP?
- The paper derives a token-level version of DPO that aligns the implicit reward, induced by the Q-function represented by the language model, with the best estimate of the reward according to the Bradley-Terry preference model.
- This shows that DPO can learn any dense reward function in the token-level MDP by representing it as an optimal advantage function.
[02] Practical Insights
1. Does DPO learn credit assignment?
- The paper provides qualitative examples showing that the DPO-trained model is able to identify the tokens corresponding to erroneous statements in summaries, indicating that it can perform credit assignment.
2. How does likelihood-based search relate to DPO?
- The paper shows that likelihood search over a DPO model is analogous to searching over a reward function during decoding, as done by contemporary works.
- Empirically, the paper demonstrates that a simple beam search yields meaningful improvement over the base DPO policy.
3. What explains the phenomenon of decreasing likelihoods during DPO training?
- The paper provides a derivation showing that under the MaxEnt RL framing, the implicit rewards of both chosen and rejected responses should decrease on average during DPO training.
- This is due to the KL-divergence term between the learned policy and the reference policy necessarily increasing at convergence.
</output_format>
Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.