# Transformers Can Do Arithmetic with the Right Embeddings

## ๐ Abstract

The article discusses the poor performance of transformers on arithmetic tasks, which the authors attribute to their inability to keep track of the exact position of each digit within a long sequence of digits. To address this issue, the authors propose a novel positional embedding called "Abacus Embeddings" that encodes the position of each digit relative to the start of the number. They show that this simple modification, combined with architectural changes such as input injection and recurrent layers, can dramatically improve the performance of transformers on addition, multiplication, and sorting tasks, enabling them to generalize to problems with much longer sequences than those seen during training.

## ๐ Q&A

### [01] Abacus Embeddings

**1. What are Abacus Embeddings and how do they help transformers perform better on arithmetic tasks?**
Abacus Embeddings are a novel positional embedding proposed by the authors to address the difficulty transformers have in keeping track of the exact position of each digit within a long sequence of digits. The key idea is to encode the position of each digit relative to the start of the number, which provides an explicit signal to the model to align the digits of the same significance. The authors show that this simple modification, when combined with other architectural changes, can dramatically improve the performance of transformers on addition, multiplication, and sorting tasks, enabling them to generalize to problems with much longer sequences than those seen during training.

**2. How do Abacus Embeddings compare to other positional embedding techniques like FIRE and RoPE?**
The authors show that Abacus Embeddings complement other relative positional embedding techniques like FIRE and RoPE. While RoPE is weak for length generalization, the authors demonstrate that combining Abacus Embeddings with FIRE unlocks generalization well beyond what FIRE embeddings can achieve on their own. This suggests that Abacus Embeddings can be integrated into general-purpose models alongside other relative embeddings to maintain good performance on non-arithmetic tasks.

**3. How do the authors evaluate the performance of models with Abacus Embeddings?**
The authors evaluate the performance of models with Abacus Embeddings in three settings: in-distribution (ID), where models are tested on problems up to the maximum size seen during training; out-of-distribution (OOD), where models are tested on problems greater than the maximum size seen during training but both operands are at most 120 digits; and extreme out-of-distribution (120 digit OOD), where models are tested on problems where both operands are more than 20 digits and less than 120 digits. They report exact match accuracy, which is a strict metric that counts an example as correct only if all output digits are exactly correct.

### [02] Architectural Improvements

**1. What architectural changes do the authors explore beyond Abacus Embeddings?**
In addition to Abacus Embeddings, the authors explore two other architectural improvements:

- Input injection: Skip connections that propagate a copy of the input to each layer in the network.
- Recurrent layers: Transformer models with recurrent blocks, where the same parameters are re-used multiple times.

**2. How do these architectural changes interact with Abacus Embeddings?**
The authors find that combining Abacus Embeddings with input injection and recurrent layers (looped transformers) can further improve performance, increasing out-of-distribution accuracy from 95% to 99% on addition problems. This represents a 80% reduction in error compared to using Abacus Embeddings with standard architectures alone.

**3. How do the authors evaluate the impact of these architectural changes?**
The authors compare the performance of standard transformer models, standard transformer models with input injection, and looped transformer models (with recurrent layers) across different positional embedding techniques (Abacus, FIRE, and NoPE). They analyze the models' in-distribution and out-of-distribution performance on addition, as well as their performance on more complex tasks like multiplication and sorting.

### [03] Generalization Capabilities

**1. What are the key findings regarding the generalization capabilities of the proposed methods?**
The authors show that their methods, particularly the combination of Abacus Embeddings and architectural improvements like input injection and recurrent layers, can achieve dramatic length generalization on addition problems. They demonstrate that models trained on operands up to 20 digits can generalize to problems with operands up to 120 digits, representing a state-of-the-art generalization factor of 6x, compared to the previous state-of-the-art of 2.5x.

**2. How do the authors evaluate the generalization capabilities of their models?**
The authors evaluate the generalization capabilities of their models in three settings: in-distribution (ID), out-of-distribution (OOD), and extreme out-of-distribution (120 digit OOD). They report exact match accuracy in each of these settings, which provides a rigorous assessment of the models' ability to solve problems of increasing difficulty and length.

**3. Do the authors' findings extend beyond addition to other algorithmic tasks?**
Yes, the authors show that the techniques they developed for improving arithmetic reasoning, particularly Abacus Embeddings, also unlock improvements on other multi-step reasoning tasks like multiplication and sorting. They demonstrate length generalization on these more complex algorithmic problems, suggesting the broader applicability of their methods.

</output_format>