This lecture presents the Mixture of block attention architecture proposed by MoonshotAI and implemented in their Kimi model.

Why Mixture of Block Attention?

Researchers have noticed that:

Lu et al. (2025) proposed to use an architecture inspired by MoE to make more efficient inference and training.

What is Mixture of Block Attention?

We recall that in standard attention (for a single head):

$$ Attn(q,K,V) = Softmax(qK^T)V $$

where: