Machine Translation: BPE + Multilingual

In this assignment, you will:
  • Using a joint Byte Pair Encoding, as described in the Neural Machine Translation of Rare Words with Subword Units paper, to generate an extended vocabulary list given a corpus.
  • Train and evaluate a sequence-to-sequence model of machine translation that translates French to English sentences using this newly generated vocabulary list.
  • Compare the results of training using the Byte Pair Encoded dataset to training using only standard preprocessing techniques.
  • Use the sequence-to-sequence model you have implemented to create a many-to-one translation model, as described in Google's Multilingual Neural Machine Translation System paper.

Getting Started

Stencil code (PyTorch) available in the course directory at /course/cs1460/asgn/multilingual. You can copy this folder into your current directory by running (s)cp -r /course/cs1460/asgn/multilingual ./ (Replace the last argument with a different directory to copy the files to that directory.)

In your stencil, you will need to modify all three python files:
  • model.py, where you will define the sequence-to-sequence model(s) needed for this assignment
  • preprocess.py, where you will implement the Byte Pair Encoding algorithm, as well as the standard preprocessing for the dataset(s) you will be using
  • multilingual.py, where you will create you experiments

Here are some data files you would need:
All datasets are tab separated text files of the format: "English translation\tOriginal sentence".

Byte Pair Encoding

One of the benefits of Byte Pair Encoding is that you can eliminate UNKs. Due to the nature of phonetic languages, you will not get many rare tokens if you break rare words into subword units. For machine translation, you will want to first parse the dataset such that you have the inputs (french sentences) and the outputs (english sentences.)

Then, follow Section 3.2 of the paper to learn a joint BPE from the corpus. While the paper describes this process in detail, in general, the steps are as follows:
  1. Create a dictionary of all words in both the French and English corpus, counting the number of times the word appears on the corpus. Record each word as a sequence of characters.
  2. Count all symbol pairs, and find the most frequent symbol pair. Concatenate these two symbol pairs to create a new symbol pair.
  3. Replace all instances of this symbol pair in the word list with the new symbol pair.
  4. Repeat steps 2 and 3 many, many times.
It would take hours to run BPE on the dataset to spec with the paper using a naive implementation, so we ask you to run 100 iterations, which should take less than a minute to run on the French to English dataset. The result you generated will be used as grading, and you will be provided dataset files that was Byte Pair Encoded to the specs in the paper. Extra credit will be given by improving the efficiency of this implementation (HINT: Can you avoid re-computing the frequency of all pairs at each iteration? This is difficult, so consider this after you are done with the whole assignment. )

An implementation detail we are deviating from the original paper: rather than adding "</w>" to the vocabulary when learning BPE, it should be added later to maintain a consistent handling of end of word tokens. This is an improvement they made after the paper was published. For example, you would learn the word "e x a m p l e", and may get "ex am ple", which would be transformed to "ex am ple </w>". It should not learn something like "ex am ple </w>" and "ex am ple s</w>"

There are some Unicode weirdness in the datasets, so you would need to normalize them. In unicode, there are different characters that stands for different kinds of "space", and the vanilla dataset used multiple of them. To address them you would need to run unicodedata.normalize("NFKC", string) for all your strings that are read from files. Also, to make sure your output format will match ours, we are providing you a apply_bpe function.

You can create your BPE corpus by running python ./preprocess.py VANILLA_IN_FILE ITERATIONS BPE_OUT_FILE. Make sure you include the output file in your handin for us to grade.

We have provided you a dataset that we have already performed 100 iterations of BPE on. If the output of diff YOUR_OUTPUT_FILE bpe-eng-fraS-100iters.txt isn't empty, you have issues with your BPE implementation. Check if you followed the paper's algorithm, and also make sure you don't count "</w>" when you merge BPE vocabularies.

Sequence-to-sequence Model

To test the effectiveness of this new dataset, you will also want to build a sequence-to-sequence model. Some resources to help you with this are:
No formal attention mechanism is required for this assignment, although some form of positional attention mechanism may help with your performance. If you have forgotten about attention, you can find a brief overview in the review slides.

Once you have implemented the model, try running your model with the BPE-preprocessed French to English dataset. We have provided you with a read_from_corpus function in preprocess.py. Make sure you use our version, or your accuracy may be 5% lower in your results. Make sure you split the shuffled dataset into 90% for training, and 10% for validation. Shuffling here is important since the dataset is ordered by the length of the sentences. You can do so with random_split (shuffle the data and then split it) and the shuffle parameter on DataLoader (shuffle data at the beginning of each epoch). Once you have finished training the model, evaluate the model using the validation set, and log both the final perplexity score and the final accuracy score.

The training and validating script is similar to that of parse_reranker. Run python ./multilingual.py -Ttb -f BPE_CORPUS to train and test. Note that the -l and -s options would also work for this assignment, though you most likely would not need them.

You will also want to run your model with the "standard" dataset, i.e. the dataset not preprocessed using BPE, to see how well it does in comparison to the BPE-preprocessed dataset. This dataset still needs preprocessing to set infrequent words as UNK. Fortunately, this functionality is already provided for you as preprocess_vanilla to "unk" dataset (you may want to tweak the UNK threshold.) Once generated, train and validate the model and log the perplexity score and the accuracy score. Run python ./multilingual.py -Tt -f VANILLA_CORPUS to train and test.

Please record the hash to the BPE and vanilla translation experiments you wish us to grade on in your README file.

Note: Because you have two variants of the dataset, you will have to create two variants of the sequence to sequence model as well. For the model that uses the vanilla dataset, you want to have an embedding for French words and output a probability distribution for English words. For the model that uses the BPE dataset, you want to create a single embedding for all BPE subword representations and output a probability distribution for that representation, since both French and English words were encoded through the joint BPE. You should write both versions under the same model class with a bpe toggle. PyTorch allows you to use if statements with its dynamic compute graph.

As a note, you are encouraged to try out different versions of the model. Some things you can try are:

  • The number of RNN layers to use,
  • Changing the LSTM parameters or replacing it with other layers (e.g. GRU, bidirectional, etc.),
  • Different preprocessing methods (e.g. reversing the encoder input),
  • Adding attention mechanisms,
  • Tweaking hyperparameters such as learning rates, dropout rates, RNN size, embedding size, etc.

This is a fairly computation-intensive homework, so please use the GCP GPUs for training.

Multilingual Translation

In this part of the project, you will use the same model to train a German to English Translator, a French to English Translator, and a German or French to English Translator. This corresponds to the "many to one" experiments in the Google paper.

You will be provided two datasets, and you shall combine them into one dataset used to train a multilingual translator. Due to similarities in training and the limited time we have, unlike in the paper, we won't require you to train individual German to English or French to English models. While preprocessing the dataset, you may want to include the target language token (i.e. "<2en>") in the beginning of the input sentences, as described in Section 3 of the paper, since in the following assignment you will need to use this switch. You should also make sure that the dataset is shuffled and split exactly like what you have done in the BPE part.

To concatenate datasets, you can use ConcatDatasets. Make sure your datasets' dimensions match before concatenating, or you will have issues with Dataloader. Run python ./multilingual.py -Ttb -f BPE_CORPUS_1 BPE_CORPUS_2 -m "TAG1" "TAG2" to train and test.

Handin & Evaluation

Each model should last at maximum 30 minutes on the GPU. Aim to get more than 60% accuracy on the vanilla model, and 70% on all BPE models. Run cs146_handin multilingual to submit all contents in your current directory as a handin. This directory should contain all relevant files, but for the sake of space, do not include dataset in your handin. In addition to the Python scripts, include your README and all generated files.

In your README, please note down:
  • A brief description talking about your rationale behind the hyperparameters used,
  • A discussion on the advantages and disadvantages of using byte-pair encoding over traditional methods.

If you have tried experimenting with hyperparameters, also note down:
  • What are the hashes to your experiments,
  • A brief discussion comparing and contrasting the results between the experiments, and why the results were the way they were.

You should target above 70% accuracy for both French to English translation with BPE-vocabulary and multilingual translation. You should also observe that the BPE model work better than the vanilla model. Even if you were not able to achieve good results with any of the models developed, if you provide a substantial report describing multiple experiments, rationale behind why the experiments were attempted and why they did/didn't work, and discussion on why some worked better than others, etc. you will be able to receive substantial partial credit.

In accordance with the course grading policy, your assignment should not have any identifiable information on it, including your Banner ID, name, or cslogin.