1-1
B490 Mining the Big Data
Qin Zhang
§1 Finding Similar Items
2-1
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.Application: Don’t want to show both when Google.
2-2
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.Application: Don’t want to show both when Google.
Plagiarism, large quotationsApplication: I am sure you know
2-3
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.Application: Don’t want to show both when Google.
Plagiarism, large quotationsApplication: I am sure you know
Similar topic/articles from various placesApplication: Cluster articles by “same story”.
2-4
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.Application: Don’t want to show both when Google.
Plagiarism, large quotationsApplication: I am sure you know
Similar topic/articles from various placesApplication: Cluster articles by “same story”.
Google image
2-5
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.Application: Don’t want to show both when Google.
Plagiarism, large quotationsApplication: I am sure you know
Similar topic/articles from various placesApplication: Cluster articles by “same story”.
Google image
Social network analysis
2-6
Motivations
Finding similar documents/webpages/images
(Approximate) mirror sites.Application: Don’t want to show both when Google.
Plagiarism, large quotationsApplication: I am sure you know
Similar topic/articles from various placesApplication: Cluster articles by “same story”.
Google image
Social network analysis
Finding NetFlix users with similar tastes in movies,e.g., for personalized recomendation systems.
3-1
How can we compare similarity?
Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.
3-2
How can we compare similarity?
Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.
What abstract space?
3-3
How can we compare similarity?
Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.
What abstract space?
For example, the Euclidean space over Rd
3-4
How can we compare similarity?
Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.
What abstract space?
For example, the Euclidean space over Rd
L1 space, L∞ space, . . .
4-1
Three essential techniques
• k-gram: convert documents to sets.
4-2
Three essential techniques
• k-gram: convert documents to sets.
• Min-hashing: convert large sets to short signatures,while preserving similarity
4-3
Three essential techniques
• k-gram: convert documents to sets.
• Min-hashing: convert large sets to short signatures,while preserving similarity
• Locality-sensitive hashing (LSH) map the sets tosome space so that similar sets will be close indistance.
4-4
Three essential techniques
• k-gram: convert documents to sets.
• Min-hashing: convert large sets to short signatures,while preserving similarity
• Locality-sensitive hashing (LSH) map the sets tosome space so that similar sets will be close indistance.
k-gramMin-hashing
LSHDocs
A bag ofstrings oflength kfrom the doc
A “signature”representingthe sets “points”
in somespace
The usual way of finding similar items
5-1
Jacccard Distance
and Min-hashing
6-1
Convert documents to sets
k-gram: a sequence of k characters that appears in the doc
E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}
6-2
Convert documents to sets
k-gram: a sequence of k characters that appears in the doc
E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}
Question: What is a good k?
6-3
Convert documents to sets
k-gram: a sequence of k characters that appears in the doc
E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}
Question: What is a good k?
Question: Shall we count replicas?
6-4
Convert documents to sets
k-gram: a sequence of k characters that appears in the doc
E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}
Question: What is a good k?
Question: Shall we count replicas?
Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?
6-5
Convert documents to sets
k-gram: a sequence of k characters that appears in the doc
E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}
Question: What is a good k?
Question: Shall we count replicas?
Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?
Question: Capitaization? Punctuation?
6-6
Convert documents to sets
k-gram: a sequence of k characters that appears in the doc
E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}
Question: What is a good k?
Question: Shall we count replicas?
Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?
Question: Capitaization? Punctuation?
Question: Characters vs. Words?
6-7
Convert documents to sets
k-gram: a sequence of k characters that appears in the doc
E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}
Question: What is a good k?
Question: Shall we count replicas?
Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?
Question: Capitaization? Punctuation?
Question: Characters vs. Words?
Question: How about different languages?
7-1
Jaccard similarity
Intersection size: 3Union size: 8Jaccard similarity: 3/8
Formally, the Jaccard similarity of two sets C1,C2 is
JS(C1,C2) =|C1 ∩ C2||C1 ∪ C2|
.
It can be thought as a distance function, if we definedJS(C1,C2) = JS(C1,C2). Actually, any similaritymeasure can be converted to a distance function
8-1
Jaccard similarity with clusters
Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}.What is the Jaccard similartiy of A and B?
8-2
Jaccard similarity with clusters
Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}.What is the Jaccard similartiy of A and B?
With clusters: We may have some items whichbasically represent the same thing. We group objectsto clusters. E.g.,
C1 = {0, 1, 2},C2 = {3, 4},C3 = {5, 6},C4 = {7, 8, 9}For instance: C1 may represent action movies, C2 may representcomedies, C3 may represent documentaries, C4 may representhorror movies.
Now we can represent Aclu = {C1,C3}, Bclu = {C1,C2,C3,C4}
JSclu(A,B) = JS(Aclu,Bclu) =|{C1,C2}|
|{C1,C2,C3,C4}|= 0.5
9-1
Min-hashing
Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix
Item S1 S2 S3 S4
1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1
9-2
Min-hashing
Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix
Item S1 S2 S3 S4
1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1
Item S1 S2 S3 S4
2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0
Step 1Randomlypermute
9-3
Min-hashing
Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix
Item S1 S2 S3 S4
1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1
Item S1 S2 S3 S4
2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0
Step 1Randomlypermute
Step 2: Record the first 1 in each column:m(S1) = 2,m(S2) = 3,m(S3) = 2,m(S4) = 6
9-4
Min-hashing
Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix
Item S1 S2 S3 S4
1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1
Item S1 S2 S3 S4
2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0
Step 1Randomlypermute
Step 2: Record the first 1 in each column:m(S1) = 2,m(S2) = 3,m(S3) = 2,m(S4) = 6
Step 3: Estimate the Jaccard similarity JS(Si , Sj) as
X =
{1 if m(Si ) = m(Sj),
0 otherwise .
10-1
Min-hashing (cont.)
Lemma:
Pr[m(Si ) = m(Sj)] = E[X ] = JS(Si ,Sj)
(proof on the board)
10-2
Min-hashing (cont.)
Lemma:
Pr[m(Si ) = m(Sj)] = E[X ] = JS(Si ,Sj)
(proof on the board)
To approximate Jaccard similarity within ε error withprobability at least 1− δ, repeat k = 1/ε2 · log(1/δ)times and then take the average.
(Chernoff bound on board)
11-1
Min-hashing (cont.)
How to implement Min-hashing efficiently?
Make one pass over the data.
Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).
Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.
Algorithm Min-hashingFor each i ∈ S do
for j = 1 to k doIf (hj(i) < cj) then
cj = hj(i)
Output mj(S) = cj for each j ∈ [k]
11-2
Min-hashing (cont.)
How to implement Min-hashing efficiently?
Make one pass over the data.
Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).
Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.
Algorithm Min-hashingFor each i ∈ S do
for j = 1 to k doIf (hj(i) < cj) then
cj = hj(i)
Output mj(S) = cj for each j ∈ [k]
JSk(Si ,Si ′) = 1k
∑kj=1 1(mj(Si ) = mj(Si ′))
11-3
Min-hashing (cont.)
How to implement Min-hashing efficiently?
Make one pass over the data.
Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).
Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.
Algorithm Min-hashingFor each i ∈ S do
for j = 1 to k doIf (hj(i) < cj) then
cj = hj(i)
Output mj(S) = cj for each j ∈ [k]
JSk(Si ,Si ′) = 1k
∑kj=1 1(mj(Si ) = mj(Si ′))
Collisions?
11-4
Min-hashing (cont.)
How to implement Min-hashing efficiently?
Make one pass over the data.
Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).
Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.
Algorithm Min-hashingFor each i ∈ S do
for j = 1 to k doIf (hj(i) < cj) then
cj = hj(i)
Output mj(S) = cj for each j ∈ [k]
JSk(Si ,Si ′) = 1k
∑kj=1 1(mj(Si ) = mj(Si ′))
Collisions?
3
Use Birthday Paradox
12-1
Random hashing functions
A random hash func. h ∈ H is the one that maps from aset A to a set B, so that conditioned on the random choiceh ∈ H,
1. The location h(x) ∈ B for ∀x ∈ A is equally likely.
2. (Stronger) For any two x ′ 6= x ∈ A, h(x) and h(x ′) areindependent over the random choice of h ∈ H.
12-2
Random hashing functions
A random hash func. h ∈ H is the one that maps from aset A to a set B, so that conditioned on the random choiceh ∈ H,
1. The location h(x) ∈ B for ∀x ∈ A is equally likely.
2. (Stronger) For any two x ′ 6= x ∈ A, h(x) and h(x ′) areindependent over the random choice of h ∈ H.
Even stronger: two → any disjoint subset.Good but hard to achieve in theory
12-3
Random hashing functions
A random hash func. h ∈ H is the one that maps from aset A to a set B, so that conditioned on the random choiceh ∈ H,
1. The location h(x) ∈ B for ∀x ∈ A is equally likely.
2. (Stronger) For any two x ′ 6= x ∈ A, h(x) and h(x ′) areindependent over the random choice of h ∈ H.
Even stronger: two → any disjoint subset.Good but hard to achieve in theory
Some practical hash function h :∑k → [m].
• Modular Hashing: h(x) = x mod m
• Multiplicative Hashing: ha(x) = bm · frac(x · a)c• Using Primes Let p ∈ [m, 2m] be a prime. Let a, b ∈R [p].
h(x) = ((ax + b) mod p) mod m
• Tabulation Hashing: h(x) = H1[x1]⊕ . . .⊕ Hk [xk ], where⊕ is an bitwise exclusive or, and Hi :
∑→ [m] is a random
mapping table (stored in a fast cache)
13-1
Finding similar items
Given a set of n = 1, 000, 000 items, we ask two questions:
Which items are similar?
We don’t want to check all n2 pairs.
13-2
Finding similar items
Given a set of n = 1, 000, 000 items, we ask two questions:
Which items are similar?
We don’t want to check all n2 pairs.
Given a query item, which others are similar to that item?
We don’t want to check all n items.
13-3
Finding similar items
Given a set of n = 1, 000, 000 items, we ask two questions:
Which items are similar?
We don’t want to check all n2 pairs.
Given a query item, which others are similar to that item?
We don’t want to check all n items.
What if the set is n points in the plane R2?
13-4
Finding similar items
Given a set of n = 1, 000, 000 items, we ask two questions:
Which items are similar?
We don’t want to check all n2 pairs.
Given a query item, which others are similar to that item?
We don’t want to check all n items.
What if the set is n points in the plane R2?
Choice 1: Hierarchical indexing trees(e.g., range tree, kd-tree, B-tree).
Problem: do not scale to high dimensions.
13-5
Finding similar items
Given a set of n = 1, 000, 000 items, we ask two questions:
Which items are similar?
We don’t want to check all n2 pairs.
Given a query item, which others are similar to that item?
We don’t want to check all n items.
What if the set is n points in the plane R2?
Choice 1: Hierarchical indexing trees(e.g., range tree, kd-tree, B-tree).
Problem: do not scale to high dimensions.
Choice 2: Lay down a (random) Grid. Similar to theLocality Sensitive Hashing that we will explore
14-1
Locality Sensitive Hashing
15-1
Locality Sensitive Hashing (LSH)
Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if
• Pr[h(a) = h(b)] > p` if d(a, b) < `.
• Pr[h(a) = h(b)] < pu if d(a, b) > u.
For this definition to make sense, we need pu < p` for u > `.
15-2
Locality Sensitive Hashing (LSH)
Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if
• Pr[h(a) = h(b)] > p` if d(a, b) < `.
• Pr[h(a) = h(b)] < pu if d(a, b) > u.
For this definition to make sense, we need pu < p` for u > `.
Idealy, want p` − pu to be large even when u − ` is small.Then we can repeat and further amplify this effect (nextslide)
15-3
Locality Sensitive Hashing (LSH)
Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if
• Pr[h(a) = h(b)] > p` if d(a, b) < `.
• Pr[h(a) = h(b)] < pu if d(a, b) > u.
For this definition to make sense, we need pu < p` for u > `.
Idealy, want p` − pu to be large even when u − ` is small.Then we can repeat and further amplify this effect (nextslide)
Define d(a, b) = 1− JS(a, b). The family of Min-hashingfunctions is (d1, d2, 1− d1, 1− d2)-sensitive for any0 ≤ d1 < d2 ≤ 1.
15-4
Locality Sensitive Hashing (LSH)
Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if
• Pr[h(a) = h(b)] > p` if d(a, b) < `.
• Pr[h(a) = h(b)] < pu if d(a, b) > u.
For this definition to make sense, we need pu < p` for u > `.
Idealy, want p` − pu to be large even when u − ` is small.Then we can repeat and further amplify this effect (nextslide)
Define d(a, b) = 1− JS(a, b). The family of Min-hashingfunctions is (d1, d2, 1− d1, 1− d2)-sensitive for any0 ≤ d1 < d2 ≤ 1.
Not good enough: d2 − d1 = (1− d1)− (1− d2).Can we improve it?
16-1
Partition into bands
Docs D1 D2 D3 D4
h1 1 2 4 0h2 2 0 1 3h3 5 3 3 0h4 4 0 1 1. . .ht 1 2 3 0
⇒ b bandsr = t/browsper band
Candidates pairs are those that hash to the same bucket forat least 1 band.
Tune b, r to catch most similar pairs, but few nonsimilar pairs.
17-1
A bite of theory for LSH (using min-hashing)
Docs D1 D2 D3 D4
h1 1 2 4 0h2 2 0 1 3h3 5 3 3 0h4 4 0 1 1. . .ht 1 2 3 0
⇒ b bandsr = t/browsper band
s = JS(D1,D2) is the probability that D1 and D2 have a hash collision.
s r = probability all hashes collide in 1 band
(1− s r ) = probability not all collide in 1 band
(1− s r )b = probability that in no bands do all hashes collide
f (s) = 1− (1− s r )b = probability all hashes collide in at least 1 band
18-1
E.g., if we chooset = 15, r = 3 andb = 5, thenf (0.1) = 0.005,f (0.2) = 0.04,f (0.3) = 0.13,f (0.4) = 0.28,f (0.5) = 0.48,f (0.6) = 0.70,f (0.7) = 0.88,f (0.8) = 0.97,f (0.9) = 0.998
s = JS(D1,D2) is the probability that D1 and D2 have a hash collision.
s r = probability all hashes collide in 1 band
(1− s r ) = probability not all collide in 1 band
(1− s r )b = probability that in no bands do all hashes collide
f (s) = 1− (1− s r )b = probability all hashes collide in at least 1 band
A bite of theory for LSH (cont.)
19-1
A bite of theory for LSH (cont.)
Choice of r and b. Usually there is a budget of t hashfunction one is willing to use. Given (a budget of) t hashfunctions, how does one divvy them up among r and b?
19-2
A bite of theory for LSH (cont.)
Choice of r and b. Usually there is a budget of t hashfunction one is willing to use. Given (a budget of) t hashfunctions, how does one divvy them up among r and b?
The threshold τ where f has the steepest slope is aboutτ ≈ (1/b)1/r .So given a similarity s that we want to use as a cut-off, we cansolve for r = t/b in s = (1/b)1/r to yield r ≈ logs(1/t).
If there is no budget on r and b, as they increase the S curvegets sharper.
20-1
LSH using Euclidean distance
Besides the min-hashing “distance function”, let’s usethe Euclidean distance function for LSH.
dE [u, v ] = ‖u − v‖2 =√∑k
i=1(vi − ui )2
20-2
LSH using Euclidean distance
Besides the min-hashing “distance function”, let’s usethe Euclidean distance function for LSH.
dE [u, v ] = ‖u − v‖2 =√∑k
i=1(vi − ui )2
Hashing function h
1. First take a random unit vector u ∈ Rd . A unit vector usatisfies that ‖u‖2 = 1, that is dE (u, 0) = 1. We will see howto generate a random unit vector later.
2. Project a, b ∈ P ⊂ Rk onto u:
au = 〈a, u〉 =∑k
i=1 ai · ui
This is contractive: |au − bu| ≤ ‖a− b‖2.
3. Create bins of size γ on u (that is, R1), (better with arandom shift z ∈ [0, γ)). h(a) = index of the bin a falls into.
21-1
Property of LSH using Euclidean distance
Hashing function h
1. First take a random unit vector u ∈ Rd . A unit vector usatisfies that ‖u‖2 = 1, that is dE (u, 0) = 1. We will see howto generate a random unit vector later.
2. Project a, b ∈ P ⊂ Rk onto u:
au = 〈a, u〉 =∑k
i=1 ai · ui
This is contractive: |au − bu| ≤ ‖a− b‖2.
3. Create bins of size γ on u (that is, R1). h(a) = index of thebin a falls into.
h is (γ/2, 2γ, 1/2, 1/3)-sensitive.
• If ‖a− b‖2 < γ/2, then Pr[h(a) = h(b)] > 1/2.• If ‖a− b‖2 > 2γ, then Pr[h(a) = h(b)] < 1/3.
22-1
Entropy LSH
Problem with the basic LSH: Need many hash functionsh1, h2, . . . , ht .
• Consume a lot of space.• Hard to implement in the distributed computation
model.
22-2
Entropy LSH
Problem with the basic LSH: Need many hash functionsh1, h2, . . . , ht .
• Consume a lot of space.• Hard to implement in the distributed computation
model.
Entrop hashing: solves the first issue.
– When query h(q), in addition query several “offsets”q + δi (1 ≤ i ≤ L), chosen randomly from the surface of aball B(q, r) for some carefully chosen radius r .
– Intuition: the data points close to the query point q arehighly likely to hash either to the same value as h(q) or toa value very close to that.
Need much fewer hash functions.
23-1
Distances
24-1
Distance functions
(M1) (non-negativity) d(a, b) ≥ 0
(M2) (identity) d(a, b) = 0 if and only if a = b
(M3) (symmetry) d(a, b) = d(b, a)
(M4) (triangle inequality) d(a, b) ≤ d(a, c) + d(c , b)
A distance d : X × X → R+ is a metric if
24-2
Distance functions
(M1) (non-negativity) d(a, b) ≥ 0
(M2) (identity) d(a, b) = 0 if and only if a = b
(M3) (symmetry) d(a, b) = d(b, a)
(M4) (triangle inequality) d(a, b) ≤ d(a, c) + d(c , b)
A distance d : X × X → R+ is a metric if
A distance that satisfies (M1), (M3) and (M4) is called apseudometric.
24-3
Distance functions
(M1) (non-negativity) d(a, b) ≥ 0
(M2) (identity) d(a, b) = 0 if and only if a = b
(M3) (symmetry) d(a, b) = d(b, a)
(M4) (triangle inequality) d(a, b) ≤ d(a, c) + d(c , b)
A distance d : X × X → R+ is a metric if
A distance that satisfies (M1), (M3) and (M4) is called apseudometric.
A distance that satisfies (M1), (M2) and (M4) is called aquasiometric.
25-1
Distance functions (cont.)
Lp distances on two vectors a = (a1, . . . , ad), b = (b1, . . . , bd).
dp(a, b) = ‖a− b‖p =(∑d
i=1(|ai − bi |)p)1/p
25-2
Distance functions (cont.)
Lp distances on two vectors a = (a1, . . . , ad), b = (b1, . . . , bd).
dp(a, b) = ‖a− b‖p =(∑d
i=1(|ai − bi |)p)1/p
L2: Euclidean distance
L0: d0(a, b) = d −∑d
i=1 1(ai , bi ).If ai , bi ∈ {0, 1}, then this is calledHamming distance.
L1: also known as Manhattan distance.d1(a, b) =
∑di=1(ai , bi ).
L∞: d∞(a, b) = maxdi=1 |ai − bi |
25-3
Distance functions (cont.)
Lp distances on two vectors a = (a1, . . . , ad), b = (b1, . . . , bd).
dp(a, b) = ‖a− b‖p =(∑d
i=1(|ai − bi |)p)1/p
L2: Euclidean distance
L0: d0(a, b) = d −∑d
i=1 1(ai , bi ).If ai , bi ∈ {0, 1}, then this is calledHamming distance.
L1: also known as Manhattan distance.d1(a, b) =
∑di=1(ai , bi ).
L∞: d∞(a, b) = maxdi=1 |ai − bi |
26-1
Jaccard distance:
dJ(a, b) = 1− JS(a, b) = 1− |A ∩ B||A ∪ B|
.
Distance functions (cont.)
Proof (on board)
27-1
Distance functions (cont.)
Cosine distance:
dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2
= 1−∑d
i=1 aibi‖a‖2 ‖b‖2
27-2
Distance functions (cont.)
Cosine distance:
dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2
= 1−∑d
i=1 aibi‖a‖2 ‖b‖2
A psuedometric.
27-3
Distance functions (cont.)
Cosine distance:
dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2
= 1−∑d
i=1 aibi‖a‖2 ‖b‖2
We can also develop an LSH function h for dcos(·, ·), asfollows. Choose a random vector v ∈ Rd . Then let
hv (a) =
{+1 if 〈v , a〉 > 0−1 otherwise
A psuedometric.
27-4
Distance functions (cont.)
Cosine distance:
dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2
= 1−∑d
i=1 aibi‖a‖2 ‖b‖2
We can also develop an LSH function h for dcos(·, ·), asfollows. Choose a random vector v ∈ Rd . Then let
hv (a) =
{+1 if 〈v , a〉 > 0−1 otherwise
It is (γ, φ, (π − γ)/π, φ/π)-sensitive for any γ < φ ∈ [0, π]
A psuedometric.
28-1
Distance functions (cont.)
Edit distance. Consider two strings a, b ∈∑∗, and
ded(a, b) = #operations to make a = b,
where an operation could be inserting a letter or deleting aletter or substituting a letter
App: Google’s auto-correct.
It is expensive to compute
29-1
Graph distance. Given a graph G = (V ,E ), whereV = {v1, . . . , vn} and E = {e1, . . . , em}. The graph distancebetween dG between any pair vi , vj ∈ V is defined as theshortest path between vi , vj .
Distance functions (cont.)
30-1
Distance functions (cont.)
Kullback-Liebler (KL) Divergence. Given two discretedistributions P = {p1, . . . , pd} and Q = {q1, . . . , qd}.
dKL(P,Q) =d∑
i=1
pi ln(pi/qi ).
30-2
Distance functions (cont.)
Kullback-Liebler (KL) Divergence. Given two discretedistributions P = {p1, . . . , pd} and Q = {q1, . . . , qd}.
dKL(P,Q) =d∑
i=1
pi ln(pi/qi ).
A distance that is NOT a metric.
30-3
Distance functions (cont.)
Kullback-Liebler (KL) Divergence. Given two discretedistributions P = {p1, . . . , pd} and Q = {q1, . . . , qd}.
dKL(P,Q) =d∑
i=1
pi ln(pi/qi ).
A distance that is NOT a metric.
Can be written as H(P,Q)− H(P)
31-1
Distance functions (cont.)
Given two multisets A,B in the grid [∆]2 with|A| = |B| = N, the Earth-Mover Distance (EMD) isdefined as the minimum cost of a perfect matchingbetween points in A and B, that is,
EMD(A,B) = minπ:A→B
∑a∈A ‖a− π(a)‖1 .
Good for comparing images
32-1
Thank you!
Some slides are based on the MMDS bookhttp://infolab.stanford.edu/~ullman/mmds.html
and Jeff Phillips lecture noteshttp://www.cs.utah.edu/~jeffp/teaching/cs5955.html
Top Related