The intuition

Self-attention

The advantage: parallelization

Multihead attention

Building a transformer block

Token and positional embedding

The Language Modelling Head

The decoder-only Transformer model summarized

Generation by Sampling

Beam search

Conclusion