This lecture presents the Mixture of block attention architecture proposed by MoonshotAI and implemented in their Kimi model.
Researchers have noticed that:
Sink or window attention → May hinder the model’s generalizability.
Linear models such as Mamba (refer to corresponding part of the lecture).
Lu et al. (2025) proposed to use an architecture inspired by MoE to make more efficient inference and training.
We recall that in standard attention (for a single head):
$$ Attn(q,K,V) = Softmax(qK^T)V $$
where: