# In-Context Learning State Vector with Inner and Momentum Optimization

## ๐ Abstract

The paper presents a comprehensive analysis of the compressed vectors derived from transformer models that encapsulate the functionality learned through in-context learning (ICL). The authors introduce the concept of a "state vector" that represents the processing state of ICL stored in the attention activation, and propose two optimization methods - inner optimization and momentum optimization - to progressively enhance the state vector. Additionally, they introduce a divide-and-conquer approach to aggregate state vectors for handling large numbers of examples. Extensive experiments on Llama-2 and GPT-J models demonstrate the effectiveness of the proposed methods in improving ICL performance across diverse tasks.

## ๐ Q&A

### [01] In-Context Learning State Vector with Inner and Momentum Optimization

**1. What is the key motivation behind the paper?**
The paper aims to bridge the gap in understanding the operational mechanisms and optimization strategies of the compressed vectors that encapsulate the functionality learned through in-context learning (ICL) in transformer models.

**2. How do the authors view the compressed ICL vectors in relation to parameters trained via gradient descent?**
The authors draw parallels between the compressed ICL vectors and parameters trained through gradient descent, suggesting that the ICL vectors can be viewed as parameters that are gradually updated through the demonstration examples.

**3. What are the two key optimization methods proposed in the paper?**
The paper proposes two optimization methods to enhance the state vector:

- Inner optimization: Applies uniform averaging to the state vectors extracted from each separate token in the demonstration.
- Momentum optimization: Applies a momentum-based gradient optimization algorithm to the differences between adjacent state vectors, which represent the influence of each demonstration example.

**4. What is the motivation behind the divide-and-conquer aggregation method?**
The divide-and-conquer aggregation method is proposed to address the challenge of handling large numbers of demonstration examples, which can exceed the context length limitations of current language models. The method divides the examples into groups, extracts a group state vector for each group, and then aggregates these group state vectors into a single comprehensive state vector.

### [02] Experimental Results and Analysis

**1. How do the proposed optimization methods perform compared to the baselines?**
The experimental results show that the inner optimization and momentum optimization methods significantly improve the performance of the state vector, outperforming the task vector and function vector baselines in both zero-shot and few-shot settings.

**2. What are the key findings from the layer selection analysis?**
The analysis on the impact of layer selection for state vector extraction reveals a dual-phase trend: increasing the number of layers initially improves performance, but beyond a certain point, additional layers tend to introduce noise and negatively impact performance.

**3. What insights are gained from the qualitative study on the state vector representations?**
The qualitative study using PCA visualization shows that the state vectors corresponding to the same position in the demonstration tend to form distinct clusters, suggesting a progressive enhancement in the model's ability to internalize the task-specific information. The study also reveals a notable separation between the state vectors from the first example and the others, indicating the model's effective learning from a few examples.

**4. How does the proposed approach compare to the efficiency of regular ICL?**
The efficiency analysis shows that the proposed optimization methods, while tripling the inference speed, achieve 89% of the regular ICL performance on Llama-2-7B and 78% on GPT-J-6B in the zero-shot setting. In the few-shot setting, the optimized state vector approaches the performance of regular ICL with minimal loss in inference speed.