On this page:
1 Objective
2 Format
3 Theme Song
4 Problem 1
5 Problem 2
6 Problem 3
7 Starter
8 Computing and Society

TweeSearch

    1 Objective

    2 Format

    3 Theme Song

    4 Problem 1

    5 Problem 2

    6 Problem 3

    7 Starter

    8 Computing and Society

1 Objective

To give you a feel for software development efforts, you will do this project in stages. At each stage you will be given slightly different and changing requirements, and must turn in a solution that matches that stage. The link to the next stage will be provided when you complete the previous stage. There are three stages in all, ordered in increasing difficulty—although your mileage may vary, it pays to start early.

2 Format

You should treat this assignment like an exam. Therefore, you should do this work without consulting course staff except for critical issues (broken links, possible assignment typo, etc.).

You should not try to peek ahead to the next stages through any means; doing so would be a violation of the Academic Code.

Unlike in previous assignments, you will be given only one chance to submit at each stage (like a company making a code release of a new version). So please submit with care.

We will be evaluating your code in part based on how effectively it can be reused in more general contexts.

3 Theme Song

Naruto Shippuden Opening 3 | Blue Bird by Ikimono-gakari

4 Problem 1

Imagine you have a collection of past tweets, and are given a new tweet. You want to find all the old tweets that are similar to the current one.

How do we test for “similarity”? Easy: we use DocDiff! That is, we will use overlap with some specified threshold (which will be in the range \([0, 1]\)).

The assignment depends on the following data definition:

data Tweet: tweet(author :: String, content :: String) end

(which is provided for you in the support code), where both fields are expected to be non-empty strings.

In this assignment, we will:
  • consider only the alphanumeric characters (retaining both capital and lower-case letters) and spaces of each Tweet, and

  • keep the tweet intact, so when returning searched tweets, the entire text is present.

Comparison is done with the content field; the author field is ignored.

Define the function

search :: Tweet, List<Tweet>, Number -> List<Tweet>

The first argument is the new tweet, the second argument is the collection of past tweets, and the third is a threshold for overlap.

Your overlap function from DocDiff takes two lists of strings. To create the lists, split the content of a tweet into words. In the tweet, a space separates one word from the next.

The result is all past tweets that have an overlap of at least threshold (inclusive). The result should be sorted from the tweets with highest to those with lowest (up to threshold) overlap.

Here is your Template for Problem 1.

꧁༺ Stop Here! ༻꧂

Submit before continuing to read!

5 Problem 2

Previously, each old tweet was treated in isolation. In reality, many tweets are parts of threads, i.e., they have a parent tweet. Therefore, the Tweet structure now has a field, parent, of type Option<Tweet>. The none option indicates that the tweet has no parent.

You will again define search, with the same parameters. However, the way we perform the search is somewhat different, as defined below.

First, we define the notion of a tweet’s relevance. If a tweet has no parent, then its relevance is exactly its overlap. If a tweet does have a parent, then its relevance is \[(0.25 \times r_p) + (0.75 \times o_t)\] where \(r_p\) is the relevance of its parent and \(o_t\) is the overlap of the tweet itself with the new tweet.

The third argument to search is the relevance threshold (inclusive).

All tweets that satisfy the relevance threshold should be included in the result, which is again sorted in order of most to least relevant.

Finally, note that multiple tweets can have the same parent (because two tweets may have responded to the parent tweet). This should remind you of our discussion about DAGs. You need to eliminate duplicates in your response, where duplicates are defined by identical.

꧁༺ Stop Here! ༻꧂

Submit before continuing to read!

Here is your Template for Problem 2.

6 Problem 3

In the previous version, we had a somewhat awkward way of dealing with the true structure of tweets. This is because we defined tweets as ascending chains, which meant we could end up inadvertently traversing duplicates, which we had to then filter. A better structure would be to have a descending tree.

Therefore, we will change the definition of a Tweet, eliminating the parent field and instead having the field children, which is of type List<Tweet>.

You will again define search, again with the same parameters, but with two differences:
  • The given list of old tweets is guaranteed to contain only and all the roots.

  • Below, we redefine what relevance means.

A tweet’s relevance is now conditionally defined depending on whether or not it has a parent.

If the tweet does not have a parent, its relevance is \[(0.75 \times o_t) + (0.25 \times s)\] where \(o_t\) is the overlap of the tweet itself with the new tweet, and \(s\) is a size ratio (defined below).

If the tweet does have a parent, its relevance is \[(0.20 \times r_p) + (0.60 \times o_t) + (0.20 \times s)\] where \(r_p\) is the relevance of its parent, \(o_t\) is the overlap of the tweet itself with the new tweet, and \(s\) is its size ratio.

What is the size ratio? It’s the ratio of the size of the current tweet’s subtree to the total size of all the trees provided. The size of a tweet’s tree is defined as 1 (for the tweet itself) plus the sum of the sizes of each of the child tweets. (The parent and other tweets do not count.)

Here is your Template for Problem 3.

7 Starter

Template for Problem 1

Template for Problem 2

Template for Problem 3

8 Computing and Society

Read/View

Read (or watch) this.danah boyd is a Brown CS alum.

Write

Come up with a few pairs of queries that search for the same content on Youtube but, with subtle changes in wording, come up with results quite different in character. Since we can’t be sure to get the same results you do, tell us what the queries produce and why you think those outcomes are significantly different.