Now that you are familiar with how strings of text are tokenized and transformed into embeddings, and the multi-head mechanism; we can delve into the GPT-2 model.

A GPT-2 model is composed of several elements:

Transformer blocks
- Layer nom 1
- Multi-head attention
- Dropout layer 1
- Residual connection 1
- Layer nom 2
- Feed forward with GELU activation
- Dropout layer 2
- Residual connection 2
Token and position embeddings
A final “inverse” embedding which converts back to vectors of the size of the vocabulary.

In this practical, we explain and implement all of these elements.

Understanding the layer norm

Let us consider 2 training examples with 5 dimensions (features) each.

torch.manual_seed(123)
batch_example = torch.randn(2, 5)

<aside> 👉🏻

Expected output

tensor([[-0.1115,  0.1204, -0.3696, -0.2404, -1.1969],
        [ 0.2093, -0.9724, -0.7550,  0.3239, -0.1085]])

</aside>

Pass the batch above into a linear layer with output dimension 6 and ReLU activation. Print the output.

<aside> 👉🏻

Expected output
```
tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
```
</aside>