CS146 | Brown University

Parse Reranker

For this assignment, you will be implementing the LSTM-LM model, as described in the Parsing as Language Modeling paper, to create a parse reranking model. You will first train the LSTM model as a language model, and then adapt it to be used as a reranker.

Getting Started

Stencil code (PyTorch) available in the course directory at /course/cs1460/asgn/parse_reranker. You can copy this folder into your current directory by running (s)cp -r /course/cs1460/asgn/parse_reranker ./ (Replace the last argument with a different directory to copy the files to that directory.)

In your stencil, you will need to modify all three python files. The parse_reranker.py file is the main script to run your model as you go along writing the assignment.

Here are some data files you would need:

Before beginning the assignment, make sure to read and fully understand the paper. Although the paper covers a lot, for this assignment we will only be implementing a subset of it. In particular, you will be expected to just implement the LSTM-LM model, described in Section 2.1 and Section 3. Furthermore, instead of the data described in Section 4, you will be using the smaller dataset mentioned in the above paragraph.

Environment Disclaimer

For this (and all future) assignments, we ask that you use our conda/pip environment (PyTorch 1.4, TensorFlow 2.1). If you want to use PyTorch, you can go over the Google Colab of the PyTorch lab.

If you are developing from a department machine, you will need to use a virtual environment to import the deep learning libraries, by typing source /course/cs1460/public/cs146-env/bin/activate or source /course/cs146/public/cs146-conda. To deactivate this virtual environment, simply type deactivate or conda deactivate respectively.

If you are developing on your own machine or on GCP, please use the environment setup we provide you with. Please refer to the environment guide for further instructions.

Train LM

You should first develop your LSTM-LM model. This should be the classical LSTM model that you are familiar with. When you have your model structure set, you should start preparing your data for training.

Randomly split the dataset such that 90% of the data is used for training, and 10% of the data is used for validation. Train your model, and print out the perplexity of the model in the validation set.

As a note, while you are welcome to implement the LSTM-LM model as described in the paper, you are encouraged to try out different versions of the model using different hyperparameters. Some hyperparameters you can try tweaking are:

The number of RNN layers to use (hint: we were able to achieve good, if not better, results with just a single LSTM layer),
Changing the LSTM parameters or replacing it with other layers (e.g. GRU, or something even more radical?),
learning rates, dropout rates, RNN size, embedding size, etc.

Once you finished structuring your model, train your model and check perplexity. Make sure to save the model weights. To do so, you can run python parse_reranker.py -Tvs TRAIN_FILE PARSE_FILE GOLD_FILE

Evaluate Reranker

For this part, you will be adapting the model developed in the previous assignment, to evaluate your F1 score for the model.

You can obtain the dataset of different parses of the same sentence here. This dataset has a total of 213 sentences, 212 of these sentences having 50 different variations of parsing trees (one of the sentences does not have 50 variations, because it is really short.) Here is how to interpret the data:

The first line (with a single number) represents the number of parse trees for this sentence (50 in most cases.)
The following sentences will contain 1) the number of correct constituent tags in this parse (compared to the gold standard), 2) the total number of constituent tags in this parse, and 3) the actual parse itself.
You can find the gold parses here.

You will be using your model to calculate the probability of each of the parse tree sentences (to see how, recall parts 1.1 and 1.2 from the paper.) Once you have computed all the probabilities of each parse tree for a given sentence, find the parse tree that the model has deemed to be the "correct" one (the one with the highest probability.)

To run testing on the model, you can run python parse_reranker.py -lt TRAIN_FILE PARSE_FILE GOLD_FILE

Once you have found the model's predictions for all 213 sentences, you can then compute the precision and recall. If you can 'recall' the previous lectures, precision is the "number of correct constituent tags / number of constituent tags in the predicted sentence", while the recall is the "number of correct constituent tags / number of constituent tags in the gold sentence". Once you have computed precision and recall, finally compute and print out your F1 score for this model.

Handin & Evaluation

Run cs146_handin parse_reranker to submit all contents in your current directory as a handin. This directory should contain all relevant files, but for the sake of space, do not include the dataset in your handin. In addition to the Python scripts, include your README.

In your README, please note down a brief description talking about your rationale behind the hyperparameters used. You should also write down a brief description talking about how you used your model to compute the probability for each parse tree (i.e. what modifications / additional code did you have to make?) If you have tried experimenting with hyperparameters, please note down:

The link to your comet.ml project page, the hashes to your two submission experiments, both the training/validation and the testing/reranking experiments. Make sure that you set the permissions of your comet.ml project to public (it should be if you are using your anonymous id workspace rather than your default workspace).
What hyperparameter experiments you have tried.
For each experiment, what the results were.
A brief discussion comparing and contrasting the results between the experiments, and why the results are the way they are.

The professor was able to achieve a perplexity of ~10 and an F1 score of ~85 (out of 100), so try to aim for that number. Your model should finish training within 30 minutes. However, even if you were not able to achieve good results with any of the models developed, if you provide a substantial report describing multiple experiments, rationale behind why the experiments were attempted and why they did/didn't work, and discussion on why some worked better than others, etc. you will be able to receive substantial partial credit. Please also provide us with the hashes to these experiments you described in the README.

In accordance with the course grading policy, your assignment should not have any identifiable information on it, including your Banner ID, name, or cslogin.