In this practical, we will explore how to fine-tune our GPT2 model for classification and instruction following.

Students are encouraged to use larger versions of GPT2 if their configurations allow.

Fine-tuning for text classification

We are going to train a GPT model to make binary classification of email objects (spam or not).

  1. Download the following dataset, unzip, and load the TSV file using pandas.

    sms_spam_collection.zip

  2. Explore the data. How many datapoints does it have? Is it balanced?

  3. Balance the dataset using under-sampling, i.e., we keep the same number of instances for each class corresponding to the number of instances of the class with the lowest number of instances.

  4. Shuffle (use random_state**=**123 for reproducibility) and slit the dataset into a train, validation, and test datasets with 70%, 10%, and 20% respectively. Save those datasets into CSV files.

  5. Create a class SpamDataset which inherits from Dataset. The dataset should be initialized with the following call SpamDataset(csv_file, max_length,tokenizer, pad_token_id) where:

  6. Create SpamDataset objects for your train, validation, and test datasets. Set max_length to None for the train dataset and set max_length for the test and validation datasets to get vectors of the same size.

  7. Create data loaders for each dataset. Use num_workers = 0 and batch_size = 8 for all of them. Note that we set shuffle=True and drop_last=True only for the train data loader. Print the number of batch for each data loader.

  8. Reusing the load_weights_into_gpt (see previous practical) to load the weights into a new GPT model architecture with the following config:

    {'vocab_size': 50257,
     'context_length': 1024,
     'emb_dim': 768,
     'n_heads': 12,
     'n_layers': 12,
     'drop_rate': 0.0,
     'qkv_bias': True}
    
  9. Can the base GPT2 model perform the classification using only a prompt? Try using the following:

    (
        "Is the following text 'spam'? Answer with 'yes' or 'no':"
        " 'You are a winner you have been specially"
        " selected to receive $1000 cash or a $2000 award.'"
    )
    
  10. To perform classification, we are going to freeze parts of the model and modify the structure of our model. Apply the following steps:

  11. Try the new model on several prompts, does it work correctly?

    <aside> 👉🏻

    Expected output (on "Do you have time")

    Outputs:
     tensor([[[-1.5854,  0.9904],
             [-3.7235,  7.4548],
             [-2.2661,  6.6049],
             [-3.5983,  3.9902]]])
    Outputs dimensions: torch.Size([1, 4, 2])
    

    </aside>