You may have seen models like Mistral’s Mixtral8x7b model which uses mixture-of-experts (MoE). But what is it and why is it used?
We note that MoE are not new and were introduced in the 1991 paper **Adaptive Mixture of Local Experts.**
At each layer $l$ of the network, instead of having only one network $E$ (and the output $E(x)$ for input $x$), we have a collection of $n$ expert networks $\{E_0, E_1, \dots, E_{n-1}\}$.
In the case of Mixtral, we have $n=8$ experts, thus the name 8x7b.
For given input $x$, the output of the expert layer is given by:
$$ \sum_{i=0}^{n-1} G(x)_i \cdot E_i(x) $$
where:
In the case of Mixtral:
$$ G(x) := Softmax(TopK(x . W_g)) $$