Download - B490 Mining the Big Data - Indiana University Bloomington

1-1

B490 Mining the Big Data

Qin Zhang

§1 Finding Similar Items

2-1

Motivations

Finding similar documents/webpages/images

(Approximate) mirror sites.Application: Don’t want to show both when Google.

2-2

Motivations



Plagiarism, large quotationsApplication: I am sure you know

2-3

Motivations




Similar topic/articles from various placesApplication: Cluster articles by “same story”.

2-4

Motivations





Google image

2-5

Motivations





Google image

Social network analysis

2-6

Motivations





Google image

Social network analysis

Finding NetFlix users with similar tastes in movies,e.g., for personalized recomendation systems.

3-1

How can we compare similarity?

Convert the data (homework, webpages,images) into an object in an abstractspace that we know how to measuredistance, and how to do it efficiently.

3-2



What abstract space?

3-3




For example, the Euclidean space over Rd

3-4




For example, the Euclidean space over Rd

L1 space, L∞ space, . . .

4-1

Three essential techniques

• k-gram: convert documents to sets.

4-2



• Min-hashing: convert large sets to short signatures,while preserving similarity

4-3




• Locality-sensitive hashing (LSH) map the sets tosome space so that similar sets will be close indistance.

4-4




• Locality-sensitive hashing (LSH) map the sets tosome space so that similar sets will be close indistance.

k-gramMin-hashing

LSHDocs

A bag ofstrings oflength kfrom the doc

A “signature”representingthe sets “points”

in somespace

The usual way of finding similar items

5-1

Jacccard Distance

and Min-hashing

6-1

Convert documents to sets

k-gram: a sequence of k characters that appears in the doc

E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca}

6-2




Question: What is a good k?

6-3





Question: Shall we count replicas?

6-4






Question: How to deal with white space, and stop words(e.g., a you, for, the, to, and, that)?

6-5







Question: Capitaization? Punctuation?

6-6








Question: Characters vs. Words?

6-7








Question: Characters vs. Words?

Question: How about different languages?

7-1

Jaccard similarity

Intersection size: 3Union size: 8Jaccard similarity: 3/8

Formally, the Jaccard similarity of two sets C1,C2 is

JS(C1,C2) =|C1 ∩ C2||C1 ∪ C2|

.

It can be thought as a distance function, if we definedJS(C1,C2) = JS(C1,C2). Actually, any similaritymeasure can be converted to a distance function

8-1

Jaccard similarity with clusters

Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}.What is the Jaccard similartiy of A and B?

8-2

Jaccard similarity with clusters

Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}.What is the Jaccard similartiy of A and B?

With clusters: We may have some items whichbasically represent the same thing. We group objectsto clusters. E.g.,

C1 = {0, 1, 2},C2 = {3, 4},C3 = {5, 6},C4 = {7, 8, 9}For instance: C1 may represent action movies, C2 may representcomedies, C3 may represent documentaries, C4 may representhorror movies.

Now we can represent Aclu = {C1,C3}, Bclu = {C1,C2,C3,C4}

JSclu(A,B) = JS(Aclu,Bclu) =|{C1,C2}|

|{C1,C2,C3,C4}|= 0.5

9-1

Min-hashing

Given setsS1 = {1, 2, 5}, S2 = {3}, S3 = {2, 3, 4, 6}, S4 = {1, 4, 6}We can represent them as a matrix

Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

9-2

Min-hashing


Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

Item S1 S2 S3 S4

2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0

Step 1Randomlypermute

9-3

Min-hashing


Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

Item S1 S2 S3 S4

2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0


Step 2: Record the first 1 in each column:m(S1) = 2,m(S2) = 3,m(S3) = 2,m(S4) = 6

9-4

Min-hashing


Item S1 S2 S3 S4

1 1 0 0 12 1 0 1 03 0 1 1 04 0 0 1 15 1 0 0 06 0 0 1 1

Item S1 S2 S3 S4

2 1 0 1 05 1 0 0 06 0 0 1 11 1 0 0 14 0 0 1 13 0 1 1 0


Step 2: Record the first 1 in each column:m(S1) = 2,m(S2) = 3,m(S3) = 2,m(S4) = 6

Step 3: Estimate the Jaccard similarity JS(Si , Sj) as

X =

{1 if m(Si ) = m(Sj),

0 otherwise .

10-1

Min-hashing (cont.)

Lemma:

Pr[m(Si ) = m(Sj)] = E[X ] = JS(Si ,Sj)

(proof on the board)

10-2

Min-hashing (cont.)

Lemma:

Pr[m(Si ) = m(Sj)] = E[X ] = JS(Si ,Sj)

(proof on the board)

To approximate Jaccard similarity within ε error withprobability at least 1− δ, repeat k = 1/ε2 · log(1/δ)times and then take the average.

(Chernoff bound on board)

11-1

Min-hashing (cont.)

How to implement Min-hashing efficiently?

Make one pass over the data.

Maintain k random hash functions {h1, h2, . . . , hk} sothat hi : [N]→ [N] at random (more next slide).

Maintain k counters {c1, . . . , ck} with ci (1 ≤ i ≤ k)being initialized to ∞.

Algorithm Min-hashingFor each i ∈ S do

for j = 1 to k doIf (hj(i) < cj) then

cj = hj(i)

Output mj(S) = cj for each j ∈ [k]

11-2

Min-hashing (cont.)







cj = hj(i)


JSk(Si ,Si ′) = 1k

∑kj=1 1(mj(Si ) = mj(Si ′))

11-3

Min-hashing (cont.)







cj = hj(i)



∑kj=1 1(mj(Si ) = mj(Si ′))

Collisions?

11-4

Min-hashing (cont.)







cj = hj(i)



∑kj=1 1(mj(Si ) = mj(Si ′))

Collisions?

3

Use Birthday Paradox

12-1

Random hashing functions

A random hash func. h ∈ H is the one that maps from aset A to a set B, so that conditioned on the random choiceh ∈ H,

1. The location h(x) ∈ B for ∀x ∈ A is equally likely.

2. (Stronger) For any two x ′ 6= x ∈ A, h(x) and h(x ′) areindependent over the random choice of h ∈ H.

12-2





Even stronger: two → any disjoint subset.Good but hard to achieve in theory

12-3





Even stronger: two → any disjoint subset.Good but hard to achieve in theory

Some practical hash function h :∑k → [m].

• Modular Hashing: h(x) = x mod m

• Multiplicative Hashing: ha(x) = bm · frac(x · a)c• Using Primes Let p ∈ [m, 2m] be a prime. Let a, b ∈R [p].

h(x) = ((ax + b) mod p) mod m

• Tabulation Hashing: h(x) = H1[x1]⊕ . . .⊕ Hk [xk ], where⊕ is an bitwise exclusive or, and Hi :

∑→ [m] is a random

mapping table (stored in a fast cache)

13-1

Finding similar items

Given a set of n = 1, 000, 000 items, we ask two questions:

Which items are similar?

We don’t want to check all n2 pairs.

13-2





Given a query item, which others are similar to that item?

We don’t want to check all n items.

13-3







What if the set is n points in the plane R2?

13-4








Choice 1: Hierarchical indexing trees(e.g., range tree, kd-tree, B-tree).

Problem: do not scale to high dimensions.

13-5








Choice 1: Hierarchical indexing trees(e.g., range tree, kd-tree, B-tree).

Problem: do not scale to high dimensions.

Choice 2: Lay down a (random) Grid. Similar to theLocality Sensitive Hashing that we will explore

14-1

Locality Sensitive Hashing

15-1

Locality Sensitive Hashing (LSH)

Definition h is (`, u, p`, pu)-sensitive with a distancefunction d if

• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.

For this definition to make sense, we need pu < p` for u > `.

15-2



• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.


Idealy, want p` − pu to be large even when u − ` is small.Then we can repeat and further amplify this effect (nextslide)

15-3



• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.



Define d(a, b) = 1− JS(a, b). The family of Min-hashingfunctions is (d1, d2, 1− d1, 1− d2)-sensitive for any0 ≤ d1 < d2 ≤ 1.

15-4



• Pr[h(a) = h(b)] > p` if d(a, b) < `.

• Pr[h(a) = h(b)] < pu if d(a, b) > u.



Define d(a, b) = 1− JS(a, b). The family of Min-hashingfunctions is (d1, d2, 1− d1, 1− d2)-sensitive for any0 ≤ d1 < d2 ≤ 1.

Not good enough: d2 − d1 = (1− d1)− (1− d2).Can we improve it?

16-1

Partition into bands

Docs D1 D2 D3 D4

h1 1 2 4 0h2 2 0 1 3h3 5 3 3 0h4 4 0 1 1. . .ht 1 2 3 0

⇒ b bandsr = t/browsper band

Candidates pairs are those that hash to the same bucket forat least 1 band.

Tune b, r to catch most similar pairs, but few nonsimilar pairs.

17-1

A bite of theory for LSH (using min-hashing)

Docs D1 D2 D3 D4

h1 1 2 4 0h2 2 0 1 3h3 5 3 3 0h4 4 0 1 1. . .ht 1 2 3 0

⇒ b bandsr = t/browsper band

s = JS(D1,D2) is the probability that D1 and D2 have a hash collision.

s r = probability all hashes collide in 1 band

(1− s r ) = probability not all collide in 1 band

(1− s r )b = probability that in no bands do all hashes collide

f (s) = 1− (1− s r )b = probability all hashes collide in at least 1 band

18-1

E.g., if we chooset = 15, r = 3 andb = 5, thenf (0.1) = 0.005,f (0.2) = 0.04,f (0.3) = 0.13,f (0.4) = 0.28,f (0.5) = 0.48,f (0.6) = 0.70,f (0.7) = 0.88,f (0.8) = 0.97,f (0.9) = 0.998

s = JS(D1,D2) is the probability that D1 and D2 have a hash collision.

s r = probability all hashes collide in 1 band

(1− s r ) = probability not all collide in 1 band

(1− s r )b = probability that in no bands do all hashes collide

f (s) = 1− (1− s r )b = probability all hashes collide in at least 1 band

A bite of theory for LSH (cont.)

19-1


Choice of r and b. Usually there is a budget of t hashfunction one is willing to use. Given (a budget of) t hashfunctions, how does one divvy them up among r and b?

19-2


Choice of r and b. Usually there is a budget of t hashfunction one is willing to use. Given (a budget of) t hashfunctions, how does one divvy them up among r and b?

The threshold τ where f has the steepest slope is aboutτ ≈ (1/b)1/r .So given a similarity s that we want to use as a cut-off, we cansolve for r = t/b in s = (1/b)1/r to yield r ≈ logs(1/t).

If there is no budget on r and b, as they increase the S curvegets sharper.

20-1

LSH using Euclidean distance

Besides the min-hashing “distance function”, let’s usethe Euclidean distance function for LSH.

dE [u, v ] = ‖u − v‖2 =√∑k

i=1(vi − ui )2

20-2

LSH using Euclidean distance

Besides the min-hashing “distance function”, let’s usethe Euclidean distance function for LSH.

dE [u, v ] = ‖u − v‖2 =√∑k

i=1(vi − ui )2

Hashing function h

1. First take a random unit vector u ∈ Rd . A unit vector usatisfies that ‖u‖2 = 1, that is dE (u, 0) = 1. We will see howto generate a random unit vector later.

2. Project a, b ∈ P ⊂ Rk onto u:

au = 〈a, u〉 =∑k

i=1 ai · ui

This is contractive: |au − bu| ≤ ‖a− b‖2.

3. Create bins of size γ on u (that is, R1), (better with arandom shift z ∈ [0, γ)). h(a) = index of the bin a falls into.

21-1

Property of LSH using Euclidean distance

Hashing function h

1. First take a random unit vector u ∈ Rd . A unit vector usatisfies that ‖u‖2 = 1, that is dE (u, 0) = 1. We will see howto generate a random unit vector later.

2. Project a, b ∈ P ⊂ Rk onto u:

au = 〈a, u〉 =∑k

i=1 ai · ui

This is contractive: |au − bu| ≤ ‖a− b‖2.

3. Create bins of size γ on u (that is, R1). h(a) = index of thebin a falls into.

h is (γ/2, 2γ, 1/2, 1/3)-sensitive.

• If ‖a− b‖2 < γ/2, then Pr[h(a) = h(b)] > 1/2.• If ‖a− b‖2 > 2γ, then Pr[h(a) = h(b)] < 1/3.

22-1

Entropy LSH

Problem with the basic LSH: Need many hash functionsh1, h2, . . . , ht .

• Consume a lot of space.• Hard to implement in the distributed computation

model.

22-2

Entropy LSH

Problem with the basic LSH: Need many hash functionsh1, h2, . . . , ht .

• Consume a lot of space.• Hard to implement in the distributed computation

model.

Entrop hashing: solves the first issue.

– When query h(q), in addition query several “offsets”q + δi (1 ≤ i ≤ L), chosen randomly from the surface of aball B(q, r) for some carefully chosen radius r .

– Intuition: the data points close to the query point q arehighly likely to hash either to the same value as h(q) or toa value very close to that.

Need much fewer hash functions.

23-1

Distances

24-1

Distance functions

(M1) (non-negativity) d(a, b) ≥ 0

(M2) (identity) d(a, b) = 0 if and only if a = b

(M3) (symmetry) d(a, b) = d(b, a)

(M4) (triangle inequality) d(a, b) ≤ d(a, c) + d(c , b)

A distance d : X × X → R+ is a metric if

24-2

Distance functions






A distance that satisfies (M1), (M3) and (M4) is called apseudometric.

24-3

Distance functions






A distance that satisfies (M1), (M3) and (M4) is called apseudometric.

A distance that satisfies (M1), (M2) and (M4) is called aquasiometric.

25-1

Distance functions (cont.)

Lp distances on two vectors a = (a1, . . . , ad), b = (b1, . . . , bd).

dp(a, b) = ‖a− b‖p =(∑d

i=1(|ai − bi |)p)1/p

25-2



dp(a, b) = ‖a− b‖p =(∑d

i=1(|ai − bi |)p)1/p

L2: Euclidean distance

L0: d0(a, b) = d −∑d

i=1 1(ai , bi ).If ai , bi ∈ {0, 1}, then this is calledHamming distance.

L1: also known as Manhattan distance.d1(a, b) =

∑di=1(ai , bi ).

L∞: d∞(a, b) = maxdi=1 |ai − bi |

25-3



dp(a, b) = ‖a− b‖p =(∑d

i=1(|ai − bi |)p)1/p

L2: Euclidean distance

L0: d0(a, b) = d −∑d

i=1 1(ai , bi ).If ai , bi ∈ {0, 1}, then this is calledHamming distance.

L1: also known as Manhattan distance.d1(a, b) =

∑di=1(ai , bi ).

L∞: d∞(a, b) = maxdi=1 |ai − bi |

26-1

Jaccard distance:

dJ(a, b) = 1− JS(a, b) = 1− |A ∩ B||A ∪ B|

.


Proof (on board)

27-1


Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d

i=1 aibi‖a‖2 ‖b‖2

27-2


Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d


A psuedometric.

27-3


Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d


We can also develop an LSH function h for dcos(·, ·), asfollows. Choose a random vector v ∈ Rd . Then let

hv (a) =

{+1 if 〈v , a〉 > 0−1 otherwise

A psuedometric.

27-4


Cosine distance:

dcos(a, b) = 1− 〈a, b〉‖a‖2 ‖b‖2

= 1−∑d


We can also develop an LSH function h for dcos(·, ·), asfollows. Choose a random vector v ∈ Rd . Then let

hv (a) =

{+1 if 〈v , a〉 > 0−1 otherwise

It is (γ, φ, (π − γ)/π, φ/π)-sensitive for any γ < φ ∈ [0, π]

A psuedometric.

28-1


Edit distance. Consider two strings a, b ∈∑∗, and

ded(a, b) = #operations to make a = b,

where an operation could be inserting a letter or deleting aletter or substituting a letter

App: Google’s auto-correct.

It is expensive to compute

29-1

Graph distance. Given a graph G = (V ,E ), whereV = {v1, . . . , vn} and E = {e1, . . . , em}. The graph distancebetween dG between any pair vi , vj ∈ V is defined as theshortest path between vi , vj .


30-1


Kullback-Liebler (KL) Divergence. Given two discretedistributions P = {p1, . . . , pd} and Q = {q1, . . . , qd}.

dKL(P,Q) =d∑

i=1

pi ln(pi/qi ).

30-2



dKL(P,Q) =d∑

i=1

pi ln(pi/qi ).

A distance that is NOT a metric.

30-3



dKL(P,Q) =d∑

i=1

pi ln(pi/qi ).

A distance that is NOT a metric.

Can be written as H(P,Q)− H(P)

31-1


Given two multisets A,B in the grid [∆]2 with|A| = |B| = N, the Earth-Mover Distance (EMD) isdefined as the minimum cost of a perfect matchingbetween points in A and B, that is,

EMD(A,B) = minπ:A→B

∑a∈A ‖a− π(a)‖1 .

Good for comparing images

32-1

Thank you!

Some slides are based on the MMDS bookhttp://infolab.stanford.edu/~ullman/mmds.html

and Jeff Phillips lecture noteshttp://www.cs.utah.edu/~jeffp/teaching/cs5955.html