In this short lecture, we are going to discuss how pre-training is performed in language models with different architectures.
The pre-training revolution
How to pre-train a foundation model?
Word structure and subword models
Motivating model pre-training from word embeddings
Where does the training data come from?
Pre-training for three architectures
The GPT architecture
Conclusion