This architecture comes from the following observations:
Source: https://arxiv.org/pdf/2407.04620. Forward time per token (latency) for batch size 16 as context length varies. All models are 1.3B (1.4B for Mamba).
Mamba cannot keep reducing perplexity as it condition on more tokens (> 16K tokens).
Source: https://arxiv.org/pdf/2407.04620
Sun et al. (2024) propose a new kind of layer called Test-Time Training layers where the hidden state are weights of a machine learning model and the update rule is a step of self-supervised learning.
Source: https://arxiv.org/pdf/2407.04620
In the above picture, the input sequence $x_1, x_2, \dots, x_t$ is put through the layer to output $z_1, z_2, \dots, z_t$.