The intuition
Self-attention
The advantage: parallelization
Multihead attention
Building a transformer block
Token and positional embedding
The Language Modeling Head
The decoder-only Transformer model summarized
Generation by Sampling
Beam search
Conclusion