ZeroShot

In this assignment, you will update the model to create a one-to-many translation model, and finally a many-to-many translation model, as described in the Google's Multilingual Neural Machine Translation System paper.

Getting Started

Stencil code (PyTorch) available in the course directory at /course/cs1460/asgn/zeroshot. You can copy this folder into your current directory by running (s)cp -r /course/cs1460/asgn/zeroshot ./ (Replace the last argument with a different directory to copy the files to that directory.)

Begin with the one-to-many model. The dataset for this section is the same as the previous assignment:
All datasets are tab separated text files of the format: "English sentence\tTarget translation".

You should preprocess your dataset files to prepare for one-to-many and zeroshot translation. Make sure you shuffle them and split 90% for training and 10% for testing.

Compared to the many-to-one multilingual translation project, we added a "-z", or "--zeroshot" flag to the script to toggle zeroshot translation. Othewise,the command you should use is exactly the same.

One-to-Many Translation

In this step, you should train three models: an English to French model, an English to German model, and a multilingual English to French/German model. You should expect very minor performance drops from one-to-one translation models to one-to-many translation models. For each model, once you have finished training, evaluate the model using the validation set, and print out both the final perplexity score and the final accuracy score. You may want to read Section 4 of the paper for greater clarity in this step.

Zero Shot Translation

As the paper states, "An interesting benefit of our approach is that it allows to perform directly implicit bridging (zero-shot translation) between a language pair for which no explicit parallel training data has been seen without any modification to the model." That is, your model should be able to predict the french translation of german words without having ever seen those examples!

For this part of the assignment, we are providing you with a preprocessed dataset. (This dataset already has BPE-words, already shuffled and split into train and test files. Each line is composed of an english line, a tab, then the corresponding non-english line.) You will want to use the two test dataset files to generate a line-to-line correspondence between the french to german sentences.

The paper specifies two models, but we will be implementing just Model 1 (Train on German to English and English to French.)

You will need to make sure that when the zeroshot flag is turned on, the datasets will be processed correctly in data processing. It doesn't really matter how you want to do this, and you can modify the stencil as much as you need to. As long as your scripts will be able to run both one-to-many and zeroshot only through modifying the command line arguments, you are good to go. Make sure to provide us with the corresponding commands to run both scenarios in the README file.

Handin & Evaluation

Run cs146_handin zeroshot to submit all contents in your current directory as a handin. This directory should contain all relevant files, but for the sake of space, do not include dataset in your handin. In addition to the Python scripts, include your README.

In your README, please note down:
  • Comet.ml url records of your best experiments,
  • A brief description talking about your rationale behind the hyperparameters used,
  • Your accuracy and perplexity scores for the one-to-many model and many-to-many model.

As a sanity check, the one-to-many model and the one-to-one models should have a 60+% accuracy score, while the zero shot model should have an accuracy score greater than 0%. The seq2seq models should last at maximum 30 minutes on the GPU.

The benchmark for this assignment is a ~70% accuracy score for the one-to-many model and one-to-one models, and a ~10% accuracy score for the zero shot model. (Some lenience will be given if you are close to the benchmark - at that point it is likely a hyperparameter issue. If your scores are closer or lower than the sanity check, though, it imples there is a problem.)

Regardless of the final score you are able to achieve for this model, write down in your README:
  • A discussion on the results of the one-to-many model. What are the possible reasons behind the one-to-many model doing better/worse than the one-to-one models?
  • A discussion on whether you were able to replicate the zero-shot translation as written in the paper. If you were able to, how is this possible in a seq2seq model when examples were never given?

If you were unable to get good results from the models developed, further write down in your README:
  • How have you verified that the data preprocessing (i.e. the final encoder input / decoder input / decoder output) is correct?
  • How have you verified that the problem does not lie in the model? (Can you argue, with references, that your model exactly follows the seq2seq model? What lines correspond to which part of the model? Have you verified that the loss function (with masking) and backpropagation step is correct?)
  • How have you verified that it isn't just a simple hyperparameter issue? (Have you tried experimenting with them? Do you have a report / table / chart demonstrating that you have comprehensively searched the space, yet still unable to reach the target score?)
  • With the above, are you confident in able to make a case that the paper is simply not replicable? If you believe that your implementation is the issue instead (but chose to submit anyway due to reasons such as running out of time), what do you believe you could have done / experimented to further improve or fix your implementation?

Going through these steps should hopefully help you identify the potential issues in your code, if any. Even if you were unable to find the issue in time, the report that you have created at this point would demonstrate that you have spent substantial effort on this project, and you will thus likely receive partial credit for the performance score.

In accordance with the course grading policy, your assignment should not have any identifiable information on it, including your Banner ID, name, or cslogin.