In this practical, we will explore how to fine-tune our GPT2 model for classification and instruction following.
Students are encouraged to use larger versions of GPT2 if their configurations allow.
We are going to train a GPT model to make binary classification of email objects (spam or not).
Download the following dataset, unzip, and load the TSV file using pandas
.
Explore the data. How many datapoints does it have? Is it balanced?
Balance the dataset using under-sampling, i.e., we keep the same number of instances for each class corresponding to the number of instances of the class with the lowest number of instances.
Shuffle (use random_state**=**123
for reproducibility) and slit the dataset into a train, validation, and test datasets with 70%, 10%, and 20% respectively. Save those datasets into CSV files.
Create a class SpamDataset
which inherits from Dataset
. The dataset should be initialized with the following call SpamDataset(csv_file, max_length,tokenizer, pad_token_id)
where:
csv_file
is the path to the CSV file of the dataset.max_length
is the maximum length of the tokenized vector (and longer vectors are cropped). If this is set to None
, then all vectors will be resized to the size of the longest vector.tokenizer
is the tokenizer used to encode the stringspad_token_id
is the token used to pad shorter sentences. We will set it to the index of the <|endoftext|>
token.Create SpamDataset
objects for your train, validation, and test datasets. Set max_length
to None
for the train dataset and set max_length
for the test and validation datasets to get vectors of the same size.
Create data loaders for each dataset. Use num_workers = 0
and batch_size = 8
for all of them. Note that we set shuffle=True
and drop_last=True
only for the train data loader. Print the number of batch for each data loader.
Reusing the load_weights_into_gpt
(see previous practical) to load the weights into a new GPT model architecture with the following config:
{'vocab_size': 50257,
'context_length': 1024,
'emb_dim': 768,
'n_heads': 12,
'n_layers': 12,
'drop_rate': 0.0,
'qkv_bias': True}
Can the base GPT2 model perform the classification using only a prompt? Try using the following:
(
"Is the following text 'spam'? Answer with 'yes' or 'no':"
" 'You are a winner you have been specially"
" selected to receive $1000 cash or a $2000 award.'"
)
To perform classification, we are going to freeze parts of the model and modify the structure of our model. Apply the following steps:
Try the new model on several prompts, does it work correctly?
<aside> 👉🏻
Expected output (on "Do you have time"
)
Outputs:
tensor([[[-1.5854, 0.9904],
[-3.7235, 7.4548],
[-2.2661, 6.6049],
[-3.5983, 3.9902]]])
Outputs dimensions: torch.Size([1, 4, 2])
</aside>