You may have seen models like Mistral’s Mixtral8x7b model which uses mixture-of-experts (MoE). But what is it and why is it used?

We note that MoE are not new and were introduced in the 1991 paper **Adaptive Mixture of Local Experts.**

What is Mixture-of-experts?

The mathematical details

At each layer $l$ of the network, instead of having only one network $E$ (and the output $E(x)$ for input $x$), we have a collection of $n$ expert networks $\{E_0, E_1, \dots, E_{n-1}\}$.

In the case of Mixtral, we have $n=8$ experts, thus the name 8x7b.

For given input $x$, the output of the expert layer is given by:

$$ \sum_{i=0}^{n-1} G(x)_i \cdot E_i(x) $$

where:

$G(x)_i$ is a $n$ dimensional vector output of the gating network for the $i$-th expert
$E_i(x)$ is the output of the $i$-th expert

In the case of Mixtral:

$$ G(x) := Softmax(TopK(x . W_g)) $$