magic starSummarize by Aili

From ๐‘Ÿ to ๐‘„^*: Your Language Model is Secretly a Q-Function

๐ŸŒˆ Abstract

Your Language Model is Secretly a Q-Function

๐Ÿ™‹ Q&A

[01] Theoretical Insights

1. How does DPO relate to the token-level MDP for large language models?

  • The paper derives DPO within the token-level MDP setting, showing that DPO implicitly learns a token-level reward function, for which the language model's logits define the optimal Q-function or expected total future reward.
  • The paper demonstrates that DPO is able to flexibly model any possible dense reward function within the token MDP.

2. What is the relationship between reward functions and optimal Q-functions in the token MDP?

  • The paper proves that there is a bijection between reward functions and corresponding optimal Q-functions in the token MDP under mild assumptions.
  • This means that an LLM is always the optimal soft Q-function for some reward function in the token MDP.

3. How does DPO learn the optimal advantage function in the token MDP?

  • The paper derives a token-level version of DPO that aligns the implicit reward, induced by the Q-function represented by the language model, with the best estimate of the reward according to the Bradley-Terry preference model.
  • This shows that DPO can learn any dense reward function in the token-level MDP by representing it as an optimal advantage function.

[02] Practical Insights

1. Does DPO learn credit assignment?

  • The paper provides qualitative examples showing that the DPO-trained model is able to identify the tokens corresponding to erroneous statements in summaries, indicating that it can perform credit assignment.

2. How does likelihood-based search relate to DPO?

  • The paper shows that likelihood search over a DPO model is analogous to searching over a reward function during decoding, as done by contemporary works.
  • Empirically, the paper demonstrates that a simple beam search yields meaningful improvement over the base DPO policy.

3. What explains the phenomenon of decreasing likelihoods during DPO training?

  • The paper provides a derivation showing that under the MaxEnt RL framing, the implicit rewards of both chosen and rejected responses should decrease on average during DPO training.
  • This is due to the KL-divergence term between the learned policy and the reference policy necessarily increasing at convergence.

</output_format>

Shared by Daniel Chen ยท
ยฉ 2024 NewMotor Inc.