Motivation

This architecture comes from the following observations:

Source: https://arxiv.org/pdf/2407.04620. Forward time per token (latency) for batch size 16 as context length varies. All models are 1.3B (1.4B for Mamba).

Source: https://arxiv.org/pdf/2407.04620. Forward time per token (latency) for batch size 16 as context length varies. All models are 1.3B (1.4B for Mamba).

Sun et al. (2024) propose a new kind of layer called Test-Time Training layers where the hidden state are weights of a machine learning model and the update rule is a step of self-supervised learning.

Overview of a TTT layer

Source: https://arxiv.org/pdf/2407.04620

Source: https://arxiv.org/pdf/2407.04620

In the above picture, the input sequence $x_1, x_2, \dots, x_t$ is put through the layer to output $z_1, z_2, \dots, z_t$.