DocDiff

1 Introduction

Consider the problem of comparing two text documents. Why might you want to do this? Perhaps you want to check for plagiarism; search for articles similar to a particular one you’re studying; or have uncovered a new manuscript and want to know whether it’s a legitimate Shakespeare or a fake. All these require being able to determine the similarity between documents. One way to model this similarity is as a distance metric, analogous to how we compute the distance between points in space.

With each document, we associate a vector. The indices of the vector are the words that are found in either document. The value at each index is how many times that word occurs in the document. Because we are comparing documents, we will assume that two words are the same if they have the same characters in the same order, ignoring case.A smarter version of this program would ignore case for some words but not for ones that might also be proper nouns. Our distance measure is defined to be proportional to the dot product of these two document vectors:

\[dist(\vec{d_1}, \vec{d_2}) \propto \vec{d_1} \cdot \vec{d_2}\]

To obtain a formula, we normalize this dot-product. We will choose a simple method which is to divide by the squared magnitude of the larger vector:

\[dist(\vec{d_1}, \vec{d_2}) = \frac{\vec{d_1} \cdot \vec{d_2}}{max(\|\vec{d_1}\|^2,\|\vec{d_2}\|^2)}\]

This means that every document will have a distance of 1 from itself, and any two documents that have no words in common will have distances of 0 from each other.

2 Assignment

Define a function

distance(doc1 :: List<String>, doc2 :: List<String>) -> Number

that computes the distance between two non-empty documents. Each document is a list of strings, with each string representing a word. Here’s an example of a document:

[list: "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

3 Template Files

Initial Tests

Implementation Code

Final Tests

Implementation-dependent testing should be in the implementation file. The final tests file should contain your tests for distance.

4 Handing In

4.1 Initial Test Sweep

As with previous assignments, you will submit an initial test sweep. This is due 11:59 PM, Thursday October 30.

https://www.captain-teach.org/brown-cs019/assignments/

4.2 Implementation and Final Tests

To submit your implementation, return to the Captain Teach assignments page:

https://www.captain-teach.org/brown-cs019/assignments/

and click “Next Step” again. This time, save and then upload a zip file of both docdiff-tests.arr and docdiff-code.arr. You can include as many tests as you want (beyond 10) for this final submission, and you can include tests you saw while reviewing (but copy with care!—and please do attribute the ones you copied to to peer-reviewing, even though you won’t be able to name the author).

After you submit your implementation, you’ll have one further step to complete. We want you to answer a quick two-question survey about what (if any) impact the peer review process had on your final submission. The interface will look similar to the review interface, and once you submit your answers to the two questions there, you’re done. Again, this feedback will have no effect on your grade—we’re interested in hearing about your experience with the review process.