GPT-2

This assignment will focus on the GPT-2 pretrained language model, as described in the Language Models are Unsupervised Multitask Learners paper, to compare results of your transformer model and a pretrained GPT-2 model on a language modeling task.

Getting Started

For this assignment, you will need to install the Huggingface Pytorch-Transformers package. To do so, you can run pip install transformers on pip. You will find information on the functions you would potentially need in this assignment here.

Stencil code (PyTorch) available in the course directory at /course/cs1460/asgn/gpt2. You can copy this folder into your current directory by running (s)cp -r /course/cs1460/asgn/gpt2 ./ (Replace the last argument with a different directory to copy the files to that directory.)

The dataset for this assignment is as follows (from Penn Treebank):
Recall that in the last assignment (Transformer) you are asked to create a language model using a masked Transformer encoder/decoder. This assignment aims to compare the performance of a Transformer language model trained from scratch and that of a pretrained GPT-2 model.

If you aren't sure if you're implementation in the last assignment is correct, you can also use the transformer modules that comes with PyTorch. You can also try to train GPT-2 from scratch for some extra credit.

Train From Scratch

In this part, you will be focusing on the performance of your transformer language model:

  1. Tokenize the train/test dataset with GPT-2 tokenizer.
  2. Train the language model to set a baseline.
  3. Evaluate the perplexity of your model on the test dataset and record your results.

Pre-trained GPT-2 Model

In this part, you will be focusing on evaluating the performance on the pretrained GPT-2 model:

  1. Tokenize the test dataset with GPT-2 tokenizer.
  2. Load a pretrained GPT-2 model.
  3. Evaluate the perplexity of your model on the test dataset and record your results.
Compare the results of the two models and answer the questions in the README.

Handin & Evaluation

Run cs146_handin gpt2 to submit all contents in your current directory as a handin. This directory should contain all relevant files, but for the sake of space, do not include the dataset in your handin. In addition to the Python scripts, include your README.

In your README, please note down:
  • The url of the training and testing comet.ml record,
  • The url of the pretrained GPT-2 model comet.ml record,
  • A brief description talking about your rationale behind the hyperparameters used,
  • Your perplexity scores for your model and the pretrained GPT-2 model.

As a sanity check, the model should have a perplexity of less than 400. Try to achieve a number as low as possible, and there is no GPU time limit for this assignment. We will not grade on the performance of your model, but on your reasonings of the results of the comparison.

If you were unable to get good results from the models developed, further write down in your README:
  • How have you verified that the data preprocessing (i.e. the input and labels) is correct?
  • How have you verified that the problem does not lie in the model? (Can you argue, with references, that your model exactly follows the transformer model? What lines correspond to which part of the model?)
  • How have you verified that it isn't just a simple hyperparameter issue? (Have you tried experimenting with them? Do you have a report / table / chart demonstrating that you comprehensively searched the space?)
  • Can you document to what degree you have followed the instructions for the pretrained model, with references?

Going through these steps should hopefully help you identify the potential issues in your code, if any. Even if you were unable to find the issue in time, the report that you have created at this point would demonstrate that you have spent substantial effort on this project, and you will thus likely receive partial credit for the performance score.

In accordance with the course grading policy, your assignment should not have any identifiable information on it, including your Banner ID, name, or cslogin.