In this assigngment we'll explore two variations on parallelism. In the first, we'll define and use a data structure that is better suited to parallel decomposition of tasks. In the second, we'll implement and use the essence of Google's parallel computing primitive, MapReduce.
Lists, as you've seen them until now, permit only sequential access: you can't get to the third element until you've been through the first two, and so on. This means that when you recur, you do so on a list that is only one element smaller. If you could recur on sub-lists that were significantly smaller (by half, say), and could run each recursive call on a different processor core, you might obtain significant speedup. That requires a different data structure, which is what we're going to explore.
A join-list-of
'a
is either
(one elt)
where elt is of type 'a
, or(join lst1 lst2)
where lst1
and lst2
are both join-lists of 'a
.
By installing the join-lists-teach.ss
Teachpack,
you will have access to the following functions:
empty
empty
is used as the empty join-list.
empty?: any → boolean
consumes a datum and returns true
if the datum is the empty join-list (false
otherwise).
one: 'a → (join-list-of
'a
)
consumes a single datum and returns a one-element join-list
with that datum as the element.
one?: any → boolean
consumes a datum and returns true
if the datum is
a singleton join-list (false
otherwise).
get: (join-list-of
'a
) → 'a
consumes a one-element join-list and returns the one element in that list.
join: (join-list-of
'a
) (join-list-of 'a
) → (join-list-of 'a
)
consumes two join-lists and joins them together into a single list. If one or both
of the input lists are empty, join
will simply return the other list unmodified.
Otherwise it will return a join-list containing the first list prepended onto the second.
join?: any → boolean
consumes a datum and returns true
if the datum is a join-list
with multiple elements (false
otherwise).
split: (join-list-of 'a) ((join-list-of 'a) (join-list-of 'a) → 'b) → 'b
consumes a join-list with two or more elements as a first argument.
As a second argument, it takes a handler which consumes two join-lists.
First split
divides its first argument into a prefix and suffix which are
non-empty join-lists that will result in the original list if joined together.
Then split
invokes its second argument on this prefix and suffix and returns
what that invocation returns. There is no guarantee that split
will divide
the list at the most recent join (indeed, our implementation does not).
join-list?: any → boolean
consumes a datum and returns true
if the datum is a join-list (false
otherwise).
join-list=?: (join-list-of 'a) (join-list-of 'a) → boolean
checks whether two join-lists are equivalent. Two join-lists are considered equivalent
if they have the same elements in the same order.
When testing your solutions, you will need to compare two join-lists.
However, we have introduced a new type and check-expect
cannot check
equality of join-lists. You can either use join-list=?
to compare two join-lists or you can use join-list->list
to convert a join-list into a regular Racket list. You may also find it
tedious to construct larger join-lists by explicitly making singletons
and joining them together. Instead, you may use the list->join-list
procedure to construct larger join-lists.
Important: While you are free to use list->join-list
and join-list->list
for testing and debugging, you may not
use either of these procedures in your solutions in any way!
With the API above, your task is to write the following functions for join-lists.
Note: It may be tempting to write j-first
and j-rest
,
then implement the other functions with j-first
and j-rest
as you would for normal Racket lists. The functions you write for join-lists should
take advantage of the unique structure of join-lists. It may be necessary to use
j-first
and j-rest
in some situations, but you should use
split
instead where possible.
Also note: While we are not grading on the efficiency of your code per-se, if your implementation is extremely inefficient, that is probably a sign that you are not using the structure of join-lists properly, and you might want to rethink your implementation.
j-max: (join-list-of 'a) ('a 'a → boolean) → 'a
j-max
returns the maximum number in a non-empty list. The second argument to j-max
is a comparator on 'a
, the type of elements contained in the list, as discussed in class. If the comparator returns true, the first argument to the comparator is 'greater' than the second argument.
j-length: (join-list-of 'a) → number
j-length
returns the length of a list.
j-first: (join-list-of 'a) → 'a
j-first
returns the first element of a non-empty list.
j-rest: (join-list-of 'a) → (join-list-of 'a)
j-rest
returns a list containing all elements but the first of a non-empty list.
j-nth: (join-list-of 'a) number → 'a
j-nth
returns the nth element of a list containing at least n elements. For example, j-nth
called with second argument 1
should return the first element of the list.
j-map: ('a → 'b) (join-list-of 'a) → (join-list-of 'b)
j-map
applies an operator to each element of a list and returns the list of resulting values. The input operator must accept exactly one argument.
j-filter: ('a → boolean) (join-list-of 'a) → (join-list-of 'a)
j-filter
applies an operator to each element of a list and returns the list of elements for which the operator returned true. The input operator must accept exactly one argument and return a boolean value.
j-sort: (join-list-of 'a) ('a 'a → boolean) → (join-list-of 'a)
j-sort
sorts a list in increasing order. The second argument to j-sort
is a comparator on 'a
, the type of elements contained in the list, as discussed in class. If the comparator returns true, then the first argument to the comparator should come before the second argument in the outputted list.
j-reduce: ('a 'a → 'a) (join-list-of 'a) → 'a
j-reduce
distributes an operator across a non-empty list. That is, given the list of elements e1, e2, ..., en
, and the operator op
, j-reduce
computes the equivalent of e1 op e2 op ... op en
.
For instance,
(j-reduce + (list->join-list (list 1 2 3))) => 6 (j-reduce max (list->join-list (list 3 1 4 6 2))) => 6 (j-reduce join (list->join-list (list (list->join-list (list 1 2)) (list->join-list (list 3 4 5)) (list->join-list (list 6))))) => (list->join-list (list 1 2 3 4 5 6))
j-reduce
What should the value of
(j-reduce - (list->join-list (list 1 2 3)))be? Hint: There is more than one possible answer. List all.
Unfortunately, having more than one answer violates the expectation
that j-reduce
is a function. The problem is not with j-reduce
but
with the use of -
(minus) as an argument. What property should the
first argument given to j-reduce
demonstrate? Can you argue this from
the description of j-reduce
above?
For the second part of this assignment, you will have to implement
and then use your own version of Google's
MapReduce. One of the things you will discover is that their
operator is horribly mis-named: it isn't the composition of
map
and reduce
. We'll anyway persist with
the name, though we'll spell it “map-reduce” since ours
has some small differences.
The map-reduce pattern allows you to divide problems into a series of independent, equivalent tasks that can be parallelized. Specifically, all of the map tasks can be run at the same time, as can all of the reduce tasks, because the results of each task does not depend on any of the other tasks.
First, you will have to write a map-reduce framework. This framework will take a mapper and a reducer, along with input. It will apply the mapper to each input, collect the results, and pass them to the reducer. Require map-reduce-support.rkt to get access to the following data structure:
(define-struct: (a b) tv ([tag : a] [value : b]))
This represents a tagged-value, the fundamental unit of data for all map-reduce programs.
Using this, you must define:
(: map-reduce (All (a b m n o) ((Listof (tv a b)) ; input data ((tv a b) -> (Listof (tv m n))) ; mapper function ((tv m (Listof n)) -> (tv m o)) ; reducer function -> (Listof (tv m o))))) ; output
Note that the mapper function produces a (Listof (tv m
n))
, while the reducer function takes a (tv m (Listof
n))
. You will have to figure out how to collect all of the
n
s for each m
so you can pass them to a
reducer task.
To help you with testing, we provide the mapper and reducer for one
example, word-count. For this, the input data are of type (tv
String String)
, where the tag is the name of the file, and the
value is the contents of the file. The support code also provides:
(: wc-map ((tv String String) -> (Listof (tv String Integer)))
wc-reduce
, to test your framework. It takes a tv
representing a file, as described above, and produces a (Listof tv)
in which the tag represents a word, and the value (in this case 1), represents the number of times the word appeared (which for each word is of course 1 at first).
(: wc-reduce ((tv String (Listof Integer)) -> (tv String Integer))
tv
in which the tag represents a word, and the value is a list of word counts for that word (in this case, as we know, a list of 1s, though it won't always be so simple!). It produces a tv
in which the tag is a word, and the value is the number of times that word appeared in all of the input files.
Once you're satisfied with your map-reduce, use it to solve these problems:
Your input consists of a (Listof tv)
in which the tag is
the file name, and the value is the string contents of the file (like
in the word count example). Your output should be a (Listof
tv)
in which each tag represents a unique anagram group, and
the value is a list of unique words that appear in the input that are all
anagrams of each other. We will define anagrams to be strings made up of
the exact same characters (as opposed to letters). Name your functions
anagram-map
and anagram-reduce
.
We're going to do Nile again, this time with parallelism for free. However, if you thought we were going to let you get away with implementing it in Java, you must be in de-Nile! Instead, implement:
(: recommend (String (Listof (tv String String)) -> (tv Integer (Listof String))))
which takes a book title and a list of tagged files (like in Anagrams,
except formatted according to the Nile spec) and produces a
tv
with the highest number of times the book was paired
with one or more other books, and a list of those books.
(: popular-pairs ((Listof (tv String String)) -> (tv Integer (Listof String))))
which takes a list of tagged files and produces a tv
with
the highest number of times any two books were paired with each other
and a list of the corresponding pairs.
Pairs should be formatted according to the original Nile spec.
join_lists.rkt
, with code implementing
each of the stated functions with the given type signatures.reduce.[txt,pdf,doc]
with your answers to the questions about j-reduce
. br>
As always, these files must be in plain text, pdf, or, if you really must, Word format. If you
use Word, we will accept only Word 2003 (.doc
) files. If you are
using Word 2007, you should export to a pdf.
We will not accept Word 2007 (.docx
) files because we cannot read them.map-reduce.rkt
, with code for your framework and the two problems.