DocDiff
1 Introduction
[list: "The", "quick", "brown", "fox", "jumps"]
In order to compute the similarity between two documents, we associate each document with a mathematical vector, which here we will represent using a list of numbers. The indices of the vector correspond to words that are found in either document. The value at each index is how many times the corresponding word occurs in the document.
We assume that two words are the same if they have the same characters in the same order, ignoring upper- and lower-case (Pyret has functions to upper- or lower-case a string, and for sorting; you can look up these functions in the string and list libraries.)A smarter version of this program would ignore case for some words but not for ones that might also be proper nouns. Our distance measure is defined to be proportional to the dot product of these two document vectors:
\[distance(\vec{d_1}, \vec{d_2}) \propto \vec{d_1} \cdot \vec{d_2}\]
To obtain a formula, we normalize this dot-product. We will choose a simple method which is to divide by the squared magnitude of the larger vector:
\[distance(\vec{d_1}, \vec{d_2}) = \frac{\vec{d_1} \cdot \vec{d_2}}{max(\|\vec{d_1}\|^2,\|\vec{d_2}\|^2)}\]
where the magnitude of a vector \(\vec{x}\), denoted as \(\|\vec{x}\|\), is given by \(sqrt(\vec{x} \cdot \vec{x})\). Observe that this means every document will have a distance of 1 from itself, and any two documents that have no words in common will have distances of 0 from each other.
2 Assignment
fun distance(doc1, doc): ... end |
Note that we are not asking you to consider efficiency of implementation.
3 Background
You may find this chapter useful in learning to program with lists in Pyret.
4 Code Template
Please use this.
5 Handing In
Your file must be named docdiff.arr.