# Efficient World Models with Context-Aware Tokenization

## ๐ Abstract

The article discusses the challenge of scaling up deep Reinforcement Learning (RL) methods and proposes a new agent called -iris that uses a world model architecture composed of a discrete autoencoder and an autoregressive transformer. The key ideas are:

- Encoding frames by conditioning on previous frames and actions to describe the stochastic deltas between time steps, rather than encoding frames independently. This reduces the number of tokens required to represent frames.
- Interleaving continuous "I-tokens" that summarize the current state of the world with the discrete "delta-tokens" in the sequence of the autoregressive transformer. This helps the transformer better model the stochastic dynamics.

The experiments show that -iris sets a new state-of-the-art on the Crafter benchmark, outperforming previous attention-based approaches while being an order of magnitude faster to train.

## ๐ Q&A

### [01] Disentangling deterministic and stochastic dynamics

**1. What is the key idea behind conditioning the autoencoder on previous frames and actions?**
The key idea is to encode frames by describing the delta, or change, between successive time steps, rather than encoding frames independently. This allows the autoencoder to focus on modeling the deterministic aspects of the dynamics, while the autoregressive transformer can focus on the stochastic components.

**2. Why is it important to disentangle the deterministic and stochastic aspects of world modeling?**
Disentangling the deterministic and stochastic aspects is important because the deterministic dynamics can be efficiently handled by the autoencoder, while the autoregressive transformer can focus on modeling the more complex stochastic dynamics. This allows the world model to be more computationally efficient, especially in visually challenging environments.

**3. How does -iris's autoencoder encode frames differently compared to iris?**
-iris's autoencoder encodes frames by conditioning on previous frames and actions, effectively describing the delta or change between successive time steps. This is in contrast to iris, which encodes frames independently without considering temporal redundancy.

### [02] Modelling stochastic dynamics

**1. What is the challenge in predicting future "delta-tokens" given past "delta-tokens" and actions?**
The challenge is that predicting future "delta-tokens" requires reasoning over multiple time steps and integrating the complex dependence structure between the "delta-tokens". This is more difficult than simply predicting future image tokens given past image tokens and actions, as done in iris.

**2. How does -iris address this challenge?**
-iris addresses this challenge by interleaving continuous "I-tokens" that summarize the current state of the world with the discrete "delta-tokens" in the sequence of the autoregressive transformer. This allows the transformer to reason about the current state of the world without having to integrate over multiple past "delta-tokens".

**3. What is the role of the "I-tokens" in the -iris architecture?**
The "I-tokens" provide a "soft" Markov blanket for the prediction of the next "delta-tokens", alleviating the need for the transformer to integrate over past "delta-tokens to form a representation of the current state. This makes the task of the autoregressive transformer easier.

### [03] Results

**1. How does -iris perform on the Crafter benchmark compared to the baselines?**
After 10M frames of data collection, -iris solves on average 17 out of 22 tasks in the Crafter benchmark, setting a new state-of-the-art. It consistently achieves higher returns than DreamerV3 beyond the 3M frames mark, and outperforms iris while training an order of magnitude faster.

**2. What is the importance of including the "I-tokens" in the sequence of the autoregressive transformer?**
Removing the "I-tokens" from the sequence of the autoregressive transformer drastically hurts the performance of -iris, demonstrating the importance of this design choice in enabling the model to effectively capture the stochastic dynamics.

**3. What are the remaining challenges for -iris to fully solve the Crafter benchmark?**
The authors hypothesize that the remaining unsolved tasks, which require discovering new tools in the presence of a crafting table and furnace, pose a hard exploration problem. With too few training samples, the world model may be unable to internalize these new mechanics and reflect them during the imagination procedure. A biased data sampling procedure could be a potential solution to unlock the missing achievements.

</output_format>