This lecture presents the Mixture of block attention architecture proposed by MoonshotAI and implemented in their Kimi model.

Why Mixture of Block Attention?

Researchers have noticed that:

Transformer architecture have a quadratic increase in computational complexity (w.r.t. to context length).
To prevent that, several techniques have been proposed to work with long context:
- Sink or window attention → May hinder the model’s generalizability.
- Linear models such as Mamba (refer to corresponding part of the lecture).
  
  Pros and Cons of selective state space models (SSMs)

Lu et al. (2025) proposed to use an architecture inspired by MoE to make more efficient inference and training.

What is Mixture of Block Attention?

We recall that in standard attention (for a single head):

$$ Attn(q,K,V) = Softmax(qK^T)V $$

where: