Exercise Sheet 3 17. Mai 2014 Exercise 3-1: Min-hash Exercise 3-2

Large Scale Information Processing
SoSe 2014
.
TU Darmstadt
Prof. Dr. U. Brefeld
E. Tzouridis
Exercise Sheet 3
17. Mai 2014
...................................................................................................... due date: 22. May 2014
Exercise 3-1: Min-hash
i) Compute the Jaccard similarities of each pair of the following three sets: {1, 2, 3, 4},
{2, 3, 5, 7}, and {2, 4, 6}.
ii) Compute the approximate Jaccard similarity by using min-hash. Use the following
hash functions: h1 (i) = 2i + 3 mod 7, h2 (i) = 3i mod 7, h3 (i) = 4i + 1 mod 7.
iii) Prove that if the Jaccard similarity of two sets is 0, then the approximated similarity
computed by the min-hash is correct.
Exercise 3-2: Distance measures
i) Edit distance
This distance makes sense when points are strings. The distance between two strings
x = x1 x2 ...xn and y = y1 y2 ...ym is the smallest number of insertions and deletions of
single characters that will convert x to y
Prove that edit distance is a distance measure.
ii) Cosine distance and approximation
Given the following “random” vectors (normals to “random” hyperplanes):
v1 = [+1, +1, +1, −1]
v2 = [+1, +1, −1, +1]
v3 = [+1, −1, +1, +1]
v4 = [−1, +1, +1, +1]
Use the random hyperplane algorithm to compute the approximated cosine distance
between the following vectors:
a = [1, 2, 1, 4]
b = [3, 2, −1, 3]
c = [1, −2, 2, 1]
1
Compare it with the true distance. In cases of 0s, assign a negative sign
Exercise 3-3: LSH with AND/OR construction
What is the effect on probability of starting with the family of minhash functions and
applying
i) A 2-way AND construction followed by a 3-way OR construction.
ii) A 3-way OR construction followed by a 2-way AND construction.
iii) A 2-way AND construction followed by a 2-way OR construction, followed by a 2-way
AND construction.
Exercise 2-4: MapReduce
i) Minhash
Assume that your documents are preprocessed and every document is represented as a
file where every line has the id of a shingle. How would you implement minhash for constructing the signature matrix in a mapreduce way? Write the appropriate pseudocode.
i) LSH
Assume that you have the signature matrix in the following form:
Filename_1 mhash_1_1 mhash_1_2 mhash_1_3 ... mhash_1_N
Filename_2 mhash_2_1 mhash_2_2 mhash_2_3 ... mhash_2_N
...
Filename_M mhash_M_1 mhash_M_2 mhash_M_3 ... mhash_M_N
How would you implement LSH in mapreduce for duplicate detection? Write the pseudocode.
2