In this short lecture, we are going to discuss how pre-training is performed in language models with different architectures.

The pre-training revolution

How to pre-train a foundation model?

Word structure and subword models

Motivating model pre-training from word embeddings

Where does the training data come from?

Pre-training for three architectures

The GPT architecture

Conclusion