Summarize by Aili

# From ๐ to ๐^*: Your Language Model is Secretly a Q-Function

## ๐ Abstract

Your Language Model is Secretly a Q-Function

## ๐ Q&A

### [01] Theoretical Insights

**1. How does DPO relate to the token-level MDP for large language models?**

- The paper derives DPO within the token-level MDP setting, showing that DPO implicitly learns a token-level reward function, for which the language model's logits define the optimal Q-function or expected total future reward.
- The paper demonstrates that DPO is able to flexibly model any possible dense reward function within the token MDP.

**2. What is the relationship between reward functions and optimal Q-functions in the token MDP?**

- The paper proves that there is a bijection between reward functions and corresponding optimal Q-functions in the token MDP under mild assumptions.
- This means that an LLM is always the optimal soft Q-function for some reward function in the token MDP.

**3. How does DPO learn the optimal advantage function in the token MDP?**

- The paper derives a token-level version of DPO that aligns the implicit reward, induced by the Q-function represented by the language model, with the best estimate of the reward according to the Bradley-Terry preference model.
- This shows that DPO can learn any dense reward function in the token-level MDP by representing it as an optimal advantage function.

### [02] Practical Insights

**1. Does DPO learn credit assignment?**

- The paper provides qualitative examples showing that the DPO-trained model is able to identify the tokens corresponding to erroneous statements in summaries, indicating that it can perform credit assignment.

**2. How does likelihood-based search relate to DPO?**

- The paper shows that likelihood search over a DPO model is analogous to searching over a reward function during decoding, as done by contemporary works.
- Empirically, the paper demonstrates that a simple beam search yields meaningful improvement over the base DPO policy.

**3. What explains the phenomenon of decreasing likelihoods during DPO training?**

- The paper provides a derivation showing that under the MaxEnt RL framing, the implicit rewards of both chosen and rejected responses should decrease on average during DPO training.
- This is due to the KL-divergence term between the learned policy and the reference policy necessarily increasing at convergence.

</output_format>

Shared by Daniel Chen ยท

ยฉ 2024 NewMotor Inc.