Motivation

This architecture comes from the following observations:

Self-attention performs well in long context but has quadratic complexity
RNN layers have linear complexity but their performance in long context is limited.

Source: https://arxiv.org/pdf/2407.04620. Forward time per token (latency) for batch size 16 as context length varies. All models are 1.3B (1.4B for Mamba).

Mamba cannot keep reducing perplexity as it condition on more tokens (> 16K tokens).

Source: https://arxiv.org/pdf/2407.04620

Sun et al. (2024) propose a new kind of layer called Test-Time Training layers where the hidden state are weights of a machine learning model and the update rule is a step of self-supervised learning.

Overview of a TTT layer

Source: https://arxiv.org/pdf/2407.04620

In the above picture, the input sequence $x_1, x_2, \dots, x_t$ is put through the layer to output $z_1, z_2, \dots, z_t$.