Now that you are familiar with how strings of text are tokenized and transformed into embeddings, and the multi-head mechanism; we can delve into the GPT-2 model.

A GPT-2 model is composed of several elements:

In this practical, we explain and implement all of these elements.

Understanding the layer norm

Let us consider 2 training examples with 5 dimensions (features) each.

torch.manual_seed(123)
batch_example = torch.randn(2, 5) 

<aside> πŸ‘‰πŸ»

Expected output

tensor([[-0.1115,  0.1204, -0.3696, -0.2404, -1.1969],
        [ 0.2093, -0.9724, -0.7550,  0.3239, -0.1085]])

</aside>

  1. Pass the batch above into a linear layer with output dimension 6 and ReLU activation. Print the output.

    <aside> πŸ‘‰πŸ»

    Expected output

    tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
            [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
    

    </aside>