CS146 | Brown University

Transformers

This assignment will focus on the transformer architecture, as described in the Attention Is All You Need paper, to create a new language model.

Getting Started

Stencil code (PyTorch) available in the course directory at /course/cs1460/asgn/transformer. You can copy this folder into your current directory by running (s)cp -r /course/cs1460/asgn/transformer ./ (Replace the last argument with a different directory to copy the files to that directory.)

The dataset for this assignment is as follows (from Penn Treebank):

Recall that the first assignment (Parse Reranker) required creating a language model using an LSTM. This assignment aims for the same (a language model), but using a different architecture (that does not use RNNs) called a Transformer. Make sure to carefully read through the paper (linked above), before continuing with the assignment.

In addition, you should NOT use the transformer modules in the PyTorch library for this assignment. If you are unclear if a specific module is allowed or not, please ask the TAs. Moreover, you may not simply copy the transformer code from CS1470; recall that much of this code was written by the TAs and not by you.

Self Attention

In this part, you will be focusing on the Scaled Dot-Product Attention (described in Section 3.2.1 of the paper.) In general, the steps are as follows:

Convert the input into input embeddings (to result in a window_sz x embedding_sz matrix.)
Compute the query, key, and value vectors. While not explicitly described in the paper, this can be achieved by multiplying the embedding with 3 separate learned weights (e.g. to compute the query vector, multiply the embedding with a w_query matrix.)
Compute the attention function, using these three vectors, to get the weighted input. (Refer to equation 1 in Section 3.2.1 of the paper.)
Apply softmax at the end and add two feed-forward layers.

Multi-Headed Attention

In this part, you will be focusing on Multi-Headed Attention (described in Section 3.2.2 of the paper.)

Refer to Figure 2 of the paper. In particular, the changes are as follows:

Instead of having 3 learned weights to create query/key/value vectors, have 3h learned weights to create h query/key/value vectors.
After computing the attention function for each of the heads, concatenate the resulting matrices together.
Finally, compute a linear transformation to condense the dimensions back to the original dimensions.

In PyTorch, you can use the torch.nn.ModuleList to create a list of modules. If you simply put your models in a Python list and use them, they would not work as intended.

Transformer Module

In this part, you will implement the encoder as a whole. (Note that we will not be implementing the decoder, as this is not required for a language model.) The remaining pieces to implement are:

Adding residual connections to the network. This is described in section 5.4 and figure 1 of the paper. Specifically, you will want to add and normalize the sentence embeddings with the attention output, and add and normalize the attention output with the feed-forward layer output. For more details on how residual connections work, refer to this paper.
Adding positional encoding. Refer to Section 3.5 of the paper. For this assignment, you can follow the paper and implement a fixed positional encoding (sine and cosine functions), or simply use a learned encoding.

Furthermore, in your README, please answer the following questions:

What are the benefits (and possibly disadvantages) of using a transformer network as opposed to an RNN network for language modeling?
What are the purposes of each of the three vectors (query, key, value) in Scaled Dot-Product Attention? (Hint: Think why they are named that way. How are they used to produce an attention-based output? Alternatively, if you are unsure about the query/key/value naming scheme but are instead more familiar with the traditional notation for attention mechanisms, you can explain in terms of those as well.)
What is the purpose of using multiple heads for attention, instead of just one? (Hint: the paper talks about this!)
What is the purpose of positional encoding, and why would a sinusoid function work for this purpose? Are there other functions that may work for this purpose as well?

Once you have developed your model, train your model and output the perplexity of the test set.

Handin & Evaluation

Run cs146_handin transformer to submit all contents in your current directory as a handin. This directory should contain all relevant files, but for the sake of space, do not include the dataset in your handin. In addition to the Python scripts, include your README.

In your README, please note down:

A brief description talking about your rationale behind the hyperparameters used,
Your perplexity scores for the model.

As a sanity check, the model should have less than perplexity of 400 (note however that you should also be suspicious if it goes too low, such as below 50.) The model in total should last at maximum 20 minutes on the GPU.

If you were unable to get good results from the models developed, further write down in your README:

How have you verified that the data preprocessing (i.e. the input and labels) is correct (including how you masked the data)?
How have you verified that the problem does not lie in the model? (Can you argue, with references, that your model exactly follows the transformer model? What lines correspond to which part of the model?)
How have you verified that it isn't just a simple hyperparameter issue? (Have you tried experimenting with them? Do you have a report / table / chart demonstrating that you have comprehensively searched the space, yet still unable to reach the target score?)
With the above, are you confident in able to make a case that the paper is simply not replicable? If you believe that your implementation is the issue instead (but chose to submit anyway due to reasons such as running out of time), what do you believe you could have done / experimented to further improve or fix your implementation?

Going through these steps should hopefully help you identify the potential issues in your code, if any. Even if you were unable to find the issue in time, the report that you have created at this point would demonstrate that you have spent substantial effort on this project, and you will thus likely receive partial credit for the performance score.

In accordance with the course grading policy, your assignment should not have any identifiable information on it, including your Banner ID, name, or cslogin.