CS146 | Brown University

BERT

This assignment will extend from the transformer architecture to build a simplified version of BERT (Bidirectional Encoder Representations from Transformers), as described in the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper.

Getting Started

Stencil code available in the course directory at /course/cs1460/asgn/bert. You can copy this folder into your current directory by running (s)cp -r /course/cs1460/asgn/bert ./ (Replace the last argument with a different directory to copy the files to that directory.)

In your stencil, you will need to modify all four python files. The bert.py file is the main script to run your model as you go along writing the assignment.

The dataset for this assignment is the same as the previous assignment (from Penn Treebank):

If you are developing from a department machine, you will need to use a virtual environment to import the deep learning libraries, by typing source /course/cs1460/public/cs146-env/bin/activate or source /course/cs146/public/cs146-conda. To deactivate this virtual environment, simply type deactivate or conda deactivate respectively.

Train BERT

Since we have finished implementing your own Transformer model, you can use PyTorch's built-in transformer modules in this project. However, you cannot use the Huggingface transformer library.

In this project, you will be reimplementing the BERT paper:

Create a transformer-based model with multiple stacked encoder blocks.
Implement Task #1 (masked LM), as described in Section 3.3.1. Specifically, you will be implementing the mask differently from the Transformer assignment, randomly masking out 15% of the tokens instead of creating a triangular matrix. See the paper for details on how to mask the tokens.
Once trained, save the model. (You will have to load the model in the next checkpoint.)

Note that in this assignment you are not implementing task #2 of pre-training (Section 3.3.2). When preparing the dataset, you can concatenate and split the entire corpus into windows or pad the dataset to the same length.

Once you have developed your model, train your model and output the perplexity and accuracy of the test set. Make sure you save your trained model's weights. To do this, you can run python bert.py -Tts TRAIN_FILE TEST_FILE.

BERT Embedding Analysis

In this part, you will be focusing on the evaluation of the model. Here, diverging from the paper, we will use a more straightforward approach: analyzing the embedding vector of words with ambiguous context.

Begin by fetching sentences in the entire corpus that have the following words (these words appear frequently in the corpus, and have multiple meanings):

Figure (as well as figured, figures)
State (as well as stated, states)
Bank (as well as banks)

For a given word and its list of sentences, input the sentences into the BERT model, and retrieve the output embedding vectors of the given word. Finally, in embedding_analysis.py, use the sentences and output embeddings to plot the embeddings in 2D space. To do this, you can run python bert.py -la TRAIN_FILE TEST_FILE.

You should also provide at least one other word (with its other forms) that has multiple meanings.

In your README, please answer the following questions:

What are the advantages (and possible disadvantages) of using BERT to create word representations, compared to other methods such as the embeddings matrix that we have used throughout the semester?
What is the purpose of masking in the way described in the paper (as compared to the masking that was done in the previous assignment?) Furthermore, why do we replace words with the mask token only 80% of the time?
Suppose that you will adapt your model for the SWAG (Situations With Adversarial Generations) dataset, that is, deciding upon a multiple choice question given an input sentence (for more details, see Section 4.4.) List the steps on what modifications are necessary (in terms of the model architecture, data preprocessing, and the training process) to achieve this. (Hint: begin by considering Task #2 of the original BERT model, described in Section 3.3.2.)

Handin & Evaluation

Run cs146_handin bert to submit all contents in your current directory as a handin. This directory should contain all relevant files, but for the sake of space, do not include the dataset in your handin. In addition to the Python scripts, include your README.

In your README, please note down:

The link to your comet.ml project page, the hashes to your two submission experiments, both the training/testing and the analysis. Make sure that you set the premissions of your comet.ml project to public (it should be if you are using your anonymous id workspace rather than your default workspace).
A brief description talking about your rationale behind the hyperparameters used.
A discussion on the embedding plots, for each word. Are there any discernible patterns? Does the distribution seem random?

The model in total should last at maximum 20 minutes.

If you were unable to get good results from the models developed (e.g. the model loss is not decreasing), further write down in your README:

How have you verified that the data preprocessing (i.e. the input and labels) is correct?
How have you verified that the problem does not lie in the model? (Can you argue, with references, that your model exactly follows the paper? What lines correspond to which part of the model?)
How have you verified that it isn't just a simple hyperparameter issue? (Have you tried experimenting with them? Do you have a report / table / chart demonstrating that you have comprehensively searched the space, yet still unable to reach the target score?)
With the above, are you confident in able to make a case that the paper is simply not replicable? If you believe that your implementation is the issue instead (but chose to submit anyway due to reasons such as running out of time), what do you believe you could have done / experimented to further improve or fix your implementation?

Going through these steps should hopefully help you identify the potential issues in your code, if any. Even if you were unable to find the issue in time, the report that you have created at this point would demonstrate that you have spent substantial effort on this project, and you will thus likely receive partial credit for the performance score. Please also provide us with the hashes to these experiments you described in the README.

In accordance with the course grading policy, your assignment should not have any identifiable information on it, including your Banner ID, name, or cslogin. This means that you should use your anonymous ID to name your workspace.