The intuition

Self-attention

The advantage: parallelization

Multihead attention

Building a transformer block

Token and positional embedding

The Language Modeling Head

The decoder-only Transformer model summarized

Generation by Sampling

Beam search

Conclusion