Programming with
Data Structures and Algorithms

Parallelism

In this assignment we'll explore two variations on parallelism. In the first, we'll define and use a data structure that is better suited to parallel decomposition of tasks. In the second, we'll implement and use the essence of Google's parallel computing primitive, MapReduce.

Join Lists

Lists, as you've seen them until now, permit only sequential access: you can't get to the third element until you've been through the first two, and so on. This means that when you recur, you do so on a list that is only one element smaller. If you could recur on sub-lists that were significantly smaller (by half, say), and could run each recursive call on a different processor core, you might obtain significant speedup. That requires a different data structure, which is what we're going to explore.

A JoinList$ is either

Please note that JoinList$s will only contain Number$s, but your code should not rely on this fact. Just like normal Racket lists can contain elements of any type, so can join lists.

API

By downloading join-lists-support.rkt, and then requiring it you will have access to the following functions:

Testing

When testing your solutions, you will need to compare two JoinList$s. However, we have introduced a new type and check-expect cannot check equality of JoinList$s. You can either use join-list=? to compare two JoinList$s or you can use join-list->list to convert a JoinList$ into a regular Racket list. You may also find it tedious to construct larger JoinList$s by explicitly making singletons and joining them together. Instead, you may use the list->join-list procedure to construct larger JoinList$s.

Important: While you are free to use list->join-list and join-list->list for testing and debugging, you may not use either of these procedures in your solutions in any way!

The Problems

With the API above, your task is to write the following functions for JoinList$s.

Note: It may be tempting to write j-first and j-rest, then implement the other functions with j-first and j-rest as you would for normal Racket lists. The functions you write for JoinList$s should take advantage of the unique structure of JoinList$s. It may be necessary to use j-first and j-rest in some situations, but you should use split instead where possible.

Also note: While we are not grading on the efficiency of your code per-se, if your implementation is extremely inefficient, that is probably a sign that you are not using the structure of JoinList$s properly, and you might want to rethink your implementation.

A Question About j-reduce

What should the value of

(j-reduce - (list->join-list (list 1 2 3)))
be? Hint: There is more than one possible answer. List all.

Unfortunately, having more than one answer violates the expectation that j-reduce is a function. The problem is not with j-reduce but with the use of - (minus) as an argument. What property should the first argument given to j-reduce demonstrate? Can you argue this from the description of j-reduce above?


MapReduce

For the second part of this assignment, you will have to implement and then use your own version of Google's MapReduce. One of the things you will discover is that their operator is horribly mis-named: it isn't the composition of map and reduce. We'll anyway persist with the name, though we'll spell it “map-reduce” since ours has some small differences.

The map-reduce pattern allows you to divide problems into a series of independent, equivalent tasks that can be parallelized. Specifically, all of the map tasks can be run at the same time, as can all of the reduce tasks, because the results of each task does not depend on any of the other tasks.

First, you will have to write a map-reduce framework. This framework will take a mapper and a reducer, along with input. It will apply the mapper to each input, collect the results, and pass them to the reducer. Require map-reduce-support.rkt to get access to the following data structure:

(define-struct: (a b) tv ([tag : a] [value : b]))
where tag is of type a and value is of type b.

This represents a tagged-value, the fundamental unit of data for all map-reduce programs.

(Note that we are using Typed Racket syntax for contracts on the assignment page. If you are not using Typed Racket, you should to convert these to the cs019 language signature syntax. We use the Typed Racket syntax here to demonstrate the relationships between the types consumed and produced by the various functions.)

Using this, you must define:

(: map-reduce (All (a b m n o)
                    ((Listof (tv a b))                  ;input data
                    ((tv a b) -> (Listof (tv m n))      ;mapper function
                    ((tv m (Listof n)) -> (tv m o))     ;reducer function
                    -> (Listof (tv m o))))))            ;output

Note that the mapper function produces a (Listof (tv m n)), while the reducer function takes a (tv m (Listof n)). You will have to figure out how to collect all of the ns for each m so you can pass them to a reducer task.

To help you with testing, we provide the mapper and reducer for one example: word count. For this, the input data are of type (tv String String), where the tag is the name of the file, and the value is the contents of the file. The support code also provides:

Once you're satisfied with your map-reduce, use it to solve these problems:

  1. Anagrams:

    Your input consists of a (Listof tv) in which the tag is the file name, and the value is the string contents of the file (like in the word count example). Your output should be a (Listof tv) in which each tag represents a unique anagram group, and the value is a list of unique words that appear in the input that are all anagrams of each other. We will define anagrams to be strings made up of the exact same characters (as opposed to letters). Name your functions anagram-map and anagram-reduce.

  2. Nile 2.0:

    We're going to do Nile again, this time with parallelism for free. However, if you thought we were going to let you get away with implementing it in Java, you must be in de-Nile! Instead, implement:

    (: recommend (String (Listof (tv String String)) -> (tv Integer (Listof String))))

    which takes a book title and a list of tagged files (like in Anagrams, except formatted according to the Nile spec) and produces a tv with the highest number of times the book was paired with one or more other books, and a list of those books.

    (: popular-pairs ((Listof (tv String String)) -> (tv Integer (Listof String))))

    which takes a list of tagged files and produces a tv with the highest number of times any two books were paired with each other and a list of the corresponding pairs.

    Pairs should be formatted according to the original Nile spec.

What to turn in: