Summarize by Aili

From 𝑟 to 𝑄^*: Your Language Model is Secretly a Q-Function

🌈 Abstract

Your Language Model is Secretly a Q-Function

1. How does DPO relate to the token-level MDP for large language models?

The paper derives DPO within the token-level MDP setting, showing that DPO implicitly learns a token-level reward function, for which the language model's logits define the optimal Q-function or expected total future reward.
The paper demonstrates that DPO is able to flexibly model any possible dense reward function within the token MDP.

2. What is the relationship between reward functions and optimal Q-functions in the token MDP?

The paper proves that there is a bijection between reward functions and corresponding optimal Q-functions in the token MDP under mild assumptions.
This means that an LLM is always the optimal soft Q-function for some reward function in the token MDP.

3. How does DPO learn the optimal advantage function in the token MDP?

The paper derives a token-level version of DPO that aligns the implicit reward, induced by the Q-function represented by the language model, with the best estimate of the reward according to the Bradley-Terry preference model.
This shows that DPO can learn any dense reward function in the token-level MDP by representing it as an optimal advantage function.

1. Does DPO learn credit assignment?

The paper provides qualitative examples showing that the DPO-trained model is able to identify the tokens corresponding to erroneous statements in summaries, indicating that it can perform credit assignment.

2. How does likelihood-based search relate to DPO?

The paper shows that likelihood search over a DPO model is analogous to searching over a reward function during decoding, as done by contemporary works.
Empirically, the paper demonstrates that a simple beam search yields meaningful improvement over the base DPO policy.

3. What explains the phenomenon of decreasing likelihoods during DPO training?

The paper provides a derivation showing that under the MaxEnt RL framing, the implicit rewards of both chosen and rejected responses should decrease on average during DPO training.
This is due to the KL-divergence term between the learned policy and the reference policy necessarily increasing at convergence.

</output_format>

Shared by Daniel Chen ·