State-Free Inference of State-Space Models: The Transfer Function Approach
๐ Abstract
The paper proposes a state-free inference algorithm for state-space models (SSMs) using a rational transfer function (RTF) representation. The key contributions are:
- RTF provides a complete representation of linear time-invariant SSMs, including those with dense matrices, unlike previous diagonal or low-rank SSM representations.
- The proposed parallel inference algorithm for RTF has state-free space and time complexities, in contrast to previous scan-based or Cauchy/Vandermonde-based algorithms.
- Experiments show RTF achieves state-of-the-art performance on the Long Range Arena benchmark and improved perplexity on language modeling compared to other attention-free approaches.
๐ Q&A
[01] State-Free Inference of State-Space Models
1. What are the key limitations of current state-space models (SSMs) that this paper aims to address?
- Many current SSM algorithms employ a modal (diagonal) representation, which can limit the model's expressive capacity.
- Scan-based parallel inference algorithms incur considerable memory costs at large state sizes.
- Algorithms like S4 and S4D that use fast Cauchy and Vandermonde matrix-vector products scale suboptimally.
2. How does the proposed rational transfer function (RTF) representation address these limitations?
- RTF encompasses the functional space of any linear time-invariant SSM, including those with dense matrices, unlike diagonal SSMs.
- The proposed parallel inference algorithm for RTF has state-free space and time complexities, avoiding the memory and computational bottlenecks of previous approaches.
- RTF solely relies on the widely optimized Fast Fourier Transform (FFT) algorithm for efficient parallel inference.
3. What are the key properties of the RTF representation?
- RTF is coordinate invariant, meaning that different state-space realizations of the same system will have the same RTF parameters.
- Any impulse response that can be represented using dense matrices can also be described using a rational transfer function with fewer parameters.
- The partial fraction decomposition of RTF provides insights into the expressivity of different SSM representations.
4. How does the proposed parallel inference algorithm for RTF work?
- The algorithm computes the truncated transfer function spectrum directly using the FFT, without the need to materialize the state.
- This results in state-free space and time complexities of O(N) and O(N log N) respectively, in contrast to the state-multiplicative or state-additive complexities of previous approaches.
5. What is the benefit of the fast companion recurrence for RTF?
- The companion form of the RTF state-space realization allows for a fast recurrent update with only O(1) time and space complexity per time step.
- This enables efficient autoregressive inference, which is important for applications like language modeling.
6. How does RTF ensure stable dynamics?
- Initializing the RTF parameters using a "zero initialization" scheme, which positions the poles as far as possible from violating the Montel stability constraint, improves training stability compared to other initialization methods.
[02] Experimental Results
1. How does the memory and latency profile of RTF compare to other SSMs like S5?
- The memory consumption of RTF scales linearly with sequence length, unlike S5 which exhibits state-multiplicative memory scaling.
- RTF also maintains consistent inference latency across different state sizes, while S4D and S4 experience slower speeds at higher state sizes.
2. How does RTF perform on the Long Range Arena (LRA) benchmark compared to other SSMs and attention-based models?
- RTF achieves state-of-the-art performance on the Retrieval task and the highest average score among attention-free approaches on the LRA benchmark.
- RTF's state-free parallel inference algorithm allows it to scale to larger state sizes without impacting memory or training speed, unlike other SSMs.
3. How does RTF perform on synthetic memorization tasks compared to S4?
- At higher state sizes, RTF is able to more accurately copy and delay sequences compared to S4, which struggles on the Copying task.
- The results suggest RTF has stronger memorization capabilities than S4, especially as the state size is increased.
4. How does incorporating RTF into the Hyena language model affect its performance?
- Replacing the Hyena filters with RTF, in a model called Hyena-RTF, improves perplexity on the WikiText103 dataset compared to the original Hyena baseline.
- This demonstrates the potential of RTF to enhance the language modeling capabilities of convolutional sequence models like Hyena.