In this practical, we will explore how to fine-tune our GPT2 model for classification and instruction following.

Students are encouraged to use larger versions of GPT2 if their configurations allow.

Fine-tuning for text classification

We are going to train a GPT model to make binary classification of email objects (spam or not).

Download the following dataset, unzip, and load the TSV file using pandas.

sms_spam_collection.zip
Explore the data. How many datapoints does it have? Is it balanced?
Balance the dataset using under-sampling, i.e., we keep the same number of instances for each class corresponding to the number of instances of the class with the lowest number of instances.
Shuffle (use random_state**=**123 for reproducibility) and slit the dataset into a train, validation, and test datasets with 70%, 10%, and 20% respectively. Save those datasets into CSV files.
Create a class SpamDataset which inherits from Dataset. The dataset should be initialized with the following call SpamDataset(csv_file, max_length,tokenizer, pad_token_id) where:
- csv_file is the path to the CSV file of the dataset.
- max_length is the maximum length of the tokenized vector (and longer vectors are cropped). If this is set to None, then all vectors will be resized to the size of the longest vector.
- tokenizer is the tokenizer used to encode the strings
- pad_token_id is the token used to pad shorter sentences. We will set it to the index of the <|endoftext|> token.
Create SpamDataset objects for your train, validation, and test datasets. Set max_length to None for the train dataset and set max_length for the test and validation datasets to get vectors of the same size.
Create data loaders for each dataset. Use num_workers = 0 and batch_size = 8 for all of them. Note that we set shuffle=True and drop_last=True only for the train data loader. Print the number of batch for each data loader.

Reusing the load_weights_into_gpt (see previous practical) to load the weights into a new GPT model architecture with the following config:

{'vocab_size': 50257,
 'context_length': 1024,
 'emb_dim': 768,
 'n_heads': 12,
 'n_layers': 12,
 'drop_rate': 0.0,
 'qkv_bias': True}

Can the base GPT2 model perform the classification using only a prompt? Try using the following:

(
    "Is the following text 'spam'? Answer with 'yes' or 'no':"
    " 'You are a winner you have been specially"
    " selected to receive $1000 cash or a $2000 award.'"
)

To perform classification, we are going to freeze parts of the model and modify the structure of our model. Apply the following steps:
- Freeze all the parameters of the model
- Replace the inverse embedding with a classification head (the output dimension is 2)
- Un-freeze the last transformer block and the final layer norm.

Try the new model on several prompts, does it work correctly?

<aside> 👉🏻

Expected output (on "Do you have time")

Outputs:
 tensor([[[-1.5854,  0.9904],
         [-3.7235,  7.4548],
         [-2.2661,  6.6049],
         [-3.5983,  3.9902]]])
Outputs dimensions: torch.Size([1, 4, 2])

</aside>