In this assignment we'll explore two variations on parallelism. In the first, we'll define and use a data structure that is better suited to parallel decomposition of tasks. In the second, we'll implement and use the essence of Google's parallel computing primitive, MapReduce.
Lists, as you've seen them until now, permit only sequential access. You can't get to the third element until you've been through the first two. This means that when you recur, you do so on a list that is only one element smaller. If you could recur on sub-lists that were significantly smaller (by half, say), and could run each recursive call on a different processor core, you might obtain significant speedup. That requires a different data structure, which is what we're going to explore.
By downloading join-lists-support.rkt
, and then requiring it
you will have access to:
Join-List
, our new data-structure, defined as follows:
(define-type (Join-List 'a)
[empty-join-list]
[one (elt : 'a)]
[join-list (lst1 : (Join-List 'a)) (lst2 : (Join-List 'a)) (size : number)])
Note: just like normal Racket lists can contain elements of any type, so can join lists.
Due to the nature of the assignment, you are limited to using ONLY the following functions from the support code:
(join (lst1 : (Join-List 'a)) (lst2 : (Join-List 'a))) : (Join-List 'a)
(Join-List 'a)
s and joins them together into a
single list. If one or both of the input lists are empty-join-list
,
join
will simply return the other list unmodified.
Otherwise it will return a (Join-List 'a)
containing the first
list prepended onto the second.
(split (lst : (Join-List 'a))
(proc : ((Join-List 'a) (Join-List 'a) -> 'b))) : 'b
(Join-List 'a)
with two or more elements as the
first argument. As the second argument, it takes a handler which
consumes two (Join-List 'a)
s. First split
divides
its first argument into a prefix and suffix which are
non-empty (Join-List 'a)
s that will result in the original list
if joined together. Then split
invokes its second argument
on this prefix and suffix and returns what that invocation returns.
There is no guarantee that split
will divide the list at
the most recent join (indeed, our implementation does not).
(join-list=? (lst1 : (Join-List 'a))
(lst2 : (Join-List 'a))) : boolean
(Join-List 'a)
s are equivalent. Two
(Join-List 'a)
s are considered equivalent if they have the same
elements in the same order.
(one (elt : 'a)) : (Join-List 'a)
(Join-List 'a)
with that datum as the element.
(one-elt (Join-List : 'a)) : 'a
(Join-List 'a)
and returns the one element
in that list.
(empty-join-list) : (Join-List 'a)
Note: type-case is allowed as long as you do not access the fields of a multiple element join-list.
When testing your solutions, you will need to compare two (Join-List 'a)
s.
However, we have introduced a new type and check-expect
cannot check
equality of (Join-List 'a)
s. You can either use join-list=?
to compare two (Join-List 'a)
s or you can use join-list->list
to convert a (Join-List 'a)
into a regular Racket list. You may also find it
tedious to construct larger (Join-List 'a)
s by explicitly making singletons
and joining them together. Instead, you may use the list->join-list
procedure to construct larger (Join-List 'a)
s.
Important: While you are free to use list->join-list
and join-list->list
for testing and debugging, you may not
use either of these procedures in your solutions in any way!
With the API above, your task is to write the following functions for (Join-List 'a)
s.
Note: It may be tempting to write j-first
and j-rest
,
then implement the other functions with j-first
and j-rest
as you would for normal Racket lists. The functions you write for (Join-List 'a)
s should
take advantage of the unique structure of (Join-List 'a)
s. It may be necessary to use
j-first
and j-rest
in some situations, but you should use
split
instead where possible.
Also note: While we are not grading on the efficiency of
your code per-se, if your implementation is extremely inefficient, that is
probably a sign that you are not using the structure of (Join-List 'a)
s
properly, and you might want to rethink your implementation.
(j-max (lst : (Join-List 'a)) (proc : ('a 'a -> boolean))) : 'a
j-max
returns the maximum 'a in a non-empty list. The second argument to j-max
is a comparator on 'a
, the type of elements contained in the list, as discussed in class. If the comparator returns true, the first argument to the comparator is 'greater' than the second argument.
(j-length (lst : (Join-List 'a)) : number
j-length
returns the length of a list.
(j-first (lst : (Join-List 'a)) : 'a
j-first
returns the first element of a non-empty list.
(j-rest (lst : (Join-List 'a))) : (Join-List 'a)
j-rest
returns a list containing all elements but the first of a non-empty list.
(j-nth (lst : (Join-List 'a)) (i : number)) : 'a
j-nth
returns the nth element (using a 0 based index) of a list containing at least n elements. For example, j-nth
called with second argument 0
should return the first element of the list.
(j-map (proc : ('a -> 'b)) (lst : (Join-List 'a))) : (Join-List 'b)
j-map
applies an operator to each element of a list and returns the list of resulting values. The input operator must accept exactly one argument.
(j-filter (proc : ('a -> boolean)) (lst : (Join-List 'a)) : (Join-List 'a)
j-filter
applies an operator to each element of a list and returns the list of elements for which the operator returned true. The input operator must accept exactly one argument and return a boolean value.
(j-sort (proc : ('a 'a -> boolean))
(lst : (Join-List 'a))) : (Join-List 'a)
j-sort
sorts a list in increasing order. The second argument to j-sort
is a comparator on 'a
, the type of elements contained in the list, as discussed in class. If the comparator returns true, then the first argument to the comparator should come before the second argument in the outputted list.
(j-reduce (proc : ('a 'a -> 'a)) (lst : (Join-List 'a)) : 'a
j-reduce
distributes an operator across a non-empty list. That
is, given the list of elements e1, e2, ..., e
n
, and the operator op
, j-reduce
computes the equivalent of e1 op e2 op ... op e
n
.
For instance,
(j-reduce + (list->join-list (list 1 2 3))) => 6 (j-reduce max (list->join-list (list 3 1 4 6 2))) => 6
What should the value of
(j-reduce - (list->join-list (list 1 2 3)))be? Hint: There is more than one possible answer. List all.
Unfortunately, having more than one answer violates the expectation
that j-reduce
is a function. The problem is not with j-reduce
but
with the use of -
(minus) as an argument. What property should the
first argument given to j-reduce
demonstrate? Can you argue this from
the description of j-reduce
above?
For each j-function that you write (j-max
, j-length
...),
you also need to write a corresponding oracle. In your oracles, you are allowed
to use join-list->list
and list->join-list
. You should
consider how you can reuse code for several of your Oracle functions.
For the second part of this assignment, you will have to implement
and then use your own version of Google's
MapReduce. One of the things you will discover is that their
operator is horribly mis-named: it isn't the composition of
map
and reduce
. We'll anyway persist with
the name, though we'll spell it “map-reduce” since ours
has some small differences.
The map-reduce pattern allows you to divide problems into a series of independent, equivalent tasks that can be parallelized. Specifically, all of the map tasks can be run at the same time, as can all of the reduce tasks, because the results of each task does not depend on any of the other tasks.
First, you will have to write a map-reduce framework. This framework will take a mapper and a reducer, along with input. It will apply the mapper to each input, collect the results, and pass them to the reducer. Require map-reduce-support.rkt to get access to the following data structure:
(define-type: (tv-pair 'a 'b) [tv (tag : 'a) (value : 'b)])
This represents a tagged-value, the fundamental unit of data for all map-reduce programs.
For your framework, you'll need to define:
(map-reduce (input : (listof (tv-pair 'a 'b))) (mapper : ((tv-pair 'a 'b) -> (listof (tv-pair 'm 'n))) (reducer : ((tv-pair 'm (listof 'n)) -> (tv-pair 'm 'o))) : (listof (tv-pair 'm 'o))
Note that the mapper function produces a (listof (tv-pair 'm
'n))
, while the reducer function takes a (tv-pair 'm (listof
'n))
. You will have to figure out how to collect all of the
'n
s for each 'm
so you can pass them to a
reducer task.
Let's consider this contract in terms of a concrete example. To help you with testing,
we provide the mapper and reducer for one example: word count. In this case,
'a
is a string containing the name of a file and 'b
is a
string containing the contents of a file. 'm
is a string containing
a word, 'n
is an integer representing the frequency with which a word
occurs in a single file, and 'o
is an integer that represents the frequency
with which a word occurs in all files.
More specifically, the support code provides:
(wc-map (file : (tv-pair string string))) : (listof (tv-pair string number))
wc-reduce
, to test your framework. It takes a (tv-pair string string)
representing a file, as described above, and produces a (listof (tv-pair string number))
in which the tag represents a word, and the value (in this case 1),
represents the number of times the word appeared (which for each word is of
course 1 at first).
(wc-reduce (word-counts : (tv-pair string (listof number))) : (tv-pair string number)
(tv-pair string (listof number))
in which the tag represents a word, and the value is a list of word counts for that word (in
this case, as we know, a list of 1s, though it won't always be so simple!).
It produces a (tv-pair string number)
in which the tag is a word, and the value is
the number of times that word appeared in all of the input files.
Once you're satisfied with your map-reduce, use it to solve these problems:
Your input consists of a (listof (tv-pair string string))
in which the tag is the file name, and the value is the string contents
of the file (like in the word count example). For example, an input tv could
be (tv "words.txt" "star rats tarts")
. Your output should be a
(listof (tv-pair 'm (listof string)))
in which each tag
represents a unique anagram group, and the value is a list of unique words
that appear in the input that are all anagrams of each other.
We will define anagrams to be strings made up of the exact same characters
(as opposed to letters). Name your functions anagram-map
and anagram-reduce
.
You and your friends are creating Nile, which you hope will be the next big thing in on-line bookstores. You know that you can save money by anticipating what books people will buy; your brainstorm is that you will pass these savings on to your users by offering a discount if they buy books that Nile recommends.
To do this, you offer an incentive for people to upload their lists of recommended books. From their lists, you can establish suggested pairs. A pair of books is a suggested pair if both books appear on one person's recommendation list. Of course, some suggested pairs are far more popular than others; and for a given book, it is paired with some books much more frequently than with others.
You need to organize the list of recommended books to support two tasks:
Each tagged file contains a single input list, with single book descriptions
on each line. That is, book descriptions will be separated by "\n".
Each book has a unique and unambiguous description;
that is, if two lines in two different input lists are identical,
they refer to the same book. Otherwise they refer to different books. Input lists
will always contain at least two books, and they will never contain duplicates.
For example, an input tv could be (tv "boblist.txt" "Harry Potter\nTwilight\nAnna Karenina")
.
You will be implementing the following functions:
(recommend (title : string) (book-records : (listof (tv-pair string string))) : (tv-pair number (listof string)))
which takes a book title and a list of tagged files (like in Anagrams,
except formatted according to the Nile spec) and produces a
tv-pair
with the highest number of times the book was paired
with one or more other books, and a list of those books.
(popular-pairs (book-records : (listof (tv-pair string string))) : (tv-pair number (listof string)))
which takes a list of tagged files and produces a tv-pair
with
the highest number of times any two books were paired with each other
and a list of the corresponding pairs. The two books in a pair should be
separated by a "+", for example "book1+book2". For your output, each pair of
books should only appear once, and order is irrelevant.
You are expected to test anything appropriate with an oracle. The TA staff will be very impressed if you can come up with a generalized oracle for the map reduce framework, however this is not necessary.
How would join-list
s be useful for map-reduce? (One paragraph or less)
Give an example of another problem that you could use map-reduce to solve beyond Anagrams and Nile? Provide the contracts for your input data, mapper, and reducer.
Give a written explanation of how you designed your oracle. Be sure to include explanations for what each of your functions related to oracle testing is responsible for, and how they work together to prove (or get close to proving) your program is correct.
join-lists.rkt
, with code implementing
each of the stated functions with the given type signatures.map-reduce.rkt
, with code for your
framework and the two problems.analysis.pdf
with your answers to analysis questions
for both join-lists and map-reduce.
Your analysis must be in a pdf
file on Letter-sized paper.
Mathematical content must be formatted appropriately. Please, no Arial.