COMP4121 Advanced Algorithms - UNSW Engineering

22
COMP4121 Advanced Algorithms Aleks Ignjatovi´ c School of Computer Science and Engineering University of New South Wales More randomised algorithms: Gaussian Annulus, Random Projection and Johnson Lindenstrauss Lemmas Details to be found on pages 12-27 of the Blum, Hopcroft and Kannan textbook COMP4121 1 / 22

Transcript of COMP4121 Advanced Algorithms - UNSW Engineering

Page 1: COMP4121 Advanced Algorithms - UNSW Engineering

THE UNIVERSITY OFNEW SOUTH WALES

COMP4121 Advanced Algorithms

Aleks Ignjatovic

School of Computer Science and EngineeringUniversity of New South Wales

More randomised algorithms: Gaussian Annulus, Random Projection and JohnsonLindenstrauss Lemmas

Details to be found on pages 12-27 of the Blum, Hopcroft and Kannan textbook

COMP4121 1 / 22

Page 2: COMP4121 Advanced Algorithms - UNSW Engineering

Generating random points in d-dimensional spaces Rd

Let us consider a Gaussian random variable X with zero mean (E[X] = µ = 0)and variance V [X] = v = 1/2π; its density is given by

fX(x) =1√2π v

e−x2

2v = e−π x2

Assume that we use such X to generate independently the coordinates(x1, . . . , xd) of a random vector ~x from R

d.

Since E[X] = 0 we have E[X2] = E[(X − E[X])2] = V [X] = 1/2π.

Thus, also E[X2

1+...+X2d

d

]= d V [X]/d = V [X].

We denote by |x| the norm (here just the length) of a vector x,

|x| =√x21 + . . .+ x2d,

We now have E[|x|2d

]= V [X] = 1/2π.

Thus the expected value of the square of the length of such a random vector isE[|x|2] = d/2π.

So, on average, |x| ≈√d√2π

= Θ(√d).

COMP4121 2 / 22

Page 3: COMP4121 Advanced Algorithms - UNSW Engineering

Generating random points in d-dimensional spaces Rd

Also, by the Law of Large Numbers,

P(∣∣∣∣x21 + . . .+ x2d

d− V [X]

∣∣∣∣ > ε

)≤ V [X2]

d ε2

Thus,

P(∣∣∣∣ |x|2d − 1

∣∣∣∣ > ε

)≤ V [X2]

d ε2i.e., P

(∣∣∣∣ |x|2d − 1

∣∣∣∣ ≤ ε) ≥ 1− V [X2]

d ε2

This implies that, if d is large, then with high probability

|x|2

d≈ 1

2π⇒ |x| ≈

√d

2π= Θ(

√d)

i.e., thus generated point is at a distance Θ(√d) from the origin.

If we chose 2 points independently, then

E[〈x, y〉] = E[x1y1 + . . .+ xdyd] = dE[XY ] = dE[X]E[Y ] = 0

This means that the expected value of the scalar product of any two vectorswith independently randomly chosen coordinates is zero.So, intuitively, vectors with randomly chosen coordinates haveapproximately the same length Θ(

√d) and any two such random

vectors are likely to be almost orthogonal!COMP4121 3 / 22

Page 4: COMP4121 Advanced Algorithms - UNSW Engineering

High dimensional spaces are quite counter-intuitive!

The above should hint that high dimensional spaces are quitecounter-intuitive.

We have to be careful not to transfer our 3D intuition to highdimensional spaces.

We now show two strange facts about high dimensional balls:1 Most of the volume of a high dimensional ball is near any of its

equators. By this we mean that if we cut a high dimensional ballwith any two parallel hyper-planes symmetric with respect to thecenter and close to the center, then most of the volume of the ball isbetween these two hyper planes.

2 Most of the volume of a high dimensional ball is near its surface.By this we mean that if we consider a ball of radius r and a slightlysmaller ball of radius r(1− ε) than most of the volume of the biggerball is in the annulus outside the smaller ball.

These facts are not just curiosities, but have importance foralgorithms we will study.We now make these facts more rigorous.COMP4121 4 / 22

Page 5: COMP4121 Advanced Algorithms - UNSW Engineering

Generating random points in d-dimensional spaces Rd

We first note that the volume of a sphere of radius r is proportional to rd, inthe sense that, if we denote by V (d) the volume of the d-dimensional ball ofradius 1, then the volume of a d dimensional ball of radius r is equal tordV (d). (This is clear for the d dimensional cube and it follows for any othersolid by the exhaustion principle: we approximate the solid by a union of smalldisjoint hypercubes).This implies that the volume of a d-dimensional ball can be represented as thefollowing integral:

V (d) =

∫ 1

−1

(√1− x21

)d−1

V (d− 1)dx1

x1

√1-x12

V(d)

V(d-1)

COMP4121 5 / 22

Page 6: COMP4121 Advanced Algorithms - UNSW Engineering

Generating random points in d-dimensional spaces Rd

c/√d-1

√1-x12

1x1

Let c be a constant. Consider the portion S of a unit ball consisting of allpoints such that their x1 coordinate satisfies x1 ≥ c√

d−1.

Let A be the volume of such a slice S.Then

A =

∫ 1

c√d−1

(√1− x21

)d−1

V (d− 1)dx1

(Recall that our goal is to show that two randomly and independentlygenerated vectors are almost certainly almost orthogonal; to achieve this weneed to perform some mumbo-jumbo whose purpose will be clear a bit later).

COMP4121 6 / 22

Page 7: COMP4121 Advanced Algorithms - UNSW Engineering

Generating random points in d-dimensional spaces Rd

We need the inequality 1− x ≤ e−x for all 0 ≤ x ≤ 1.

For a proof, see the refresher notes on probability and statistics, available atthe course web site.

Using such an inequality, from what we have

A =

∫ 1

c√d−1

(√1− x21

)d−1

V (d− 1)dx1

we obtain

A <

∫ 1

c√d−1

(e−x21)d−12 V (d− 1)dx1

Over the slice S we have x1 ≥ c√d−1

which implies x1√d−1c

≥ 1 and this implies

A <

∫ 1

c√d−1

x1√d− 1

ce−x

21d−12 V (d− 1)dx1

The purpose of adding this term is to allow substitution of variable x1 withvariable u = x21; then du = 2x1dx1.

COMP4121 7 / 22

Page 8: COMP4121 Advanced Algorithms - UNSW Engineering

Generating random points in d-dimensional spaces Rd

Thus, with u = x21 we have du = 2x1dx1 and so, from

A <

∫ 1

c√d−1

x1√d− 1

ce−x

21d−12 V (d− 1)dx1

we obtain

A <

∫ 1

c2

d−1

√d− 1

ce−u

d−12 V (d− 1)

1

2du

<

√d− 1

2cV (d− 1)

∫ ∞c2

d−1

e−ud−12 du

= −√d− 1

2cV (d− 1)

1d−12

e−ud−12

∣∣∣u=∞u= c2

d−1

=

√d− 1

2cV (d− 1)

1d−12

e−d−12

c2

d−1

=1

c√d− 1

V (d− 1)e−c2

2

COMP4121 8 / 22

Page 9: COMP4121 Advanced Algorithms - UNSW Engineering

So we have obtained that the volume A of the slice S of the unit ball which liesabove the hyperplane x1 = c√

d−1satisfies

A < V (d− 1)e−

c2

2

c√d− 1

We now want to show that such a volume is small compared to half the volumeof the entire ball. However, we do not want to use V (d), we want an expressioninvolving V (d− 1) which can cancel V (d− 1) in the above bound.

Note that the volume of the whole hemisphere H is larger than the volume of acylinder below the intersection of the ball and the plane x1 = 1√

d−1, which has

the volume C which is the product of the d-1 dimensional “surface area” of its

base which is equal to V (d− 1)(√

1− 1d−1

)d−1

and its height which is 1/√d− 1.

Thus, the volume C of such a cylinder is

C = V (d− 1)

(√1− 1

d− 1

)d−11√d− 1

= V (d− 1)

(1− 1

d− 1

) d−12 1√

d− 1

1/√d-1

√1-1/(d-1)

C

V(d-1)

1

COMP4121 9 / 22

Page 10: COMP4121 Advanced Algorithms - UNSW Engineering

We now need yet another inequality: for all α ≥ 1 and all 0 < x < 1,

(1− x)α ≥ 1− αx

(This inequality can be easily proved by considering β(x) = (1− x)α − 1 + αx;again, β(0) = 0 and β′(x) = −α(1− x)α−1 + α = α(1− (1− x)α−1) ≥ 0 whichimplies that β(x) is non-decreasing and thus non-negative.)

Applying the above inequality with α = (d− 1)/2 and x = 1/(d− 1) we obtain

C = V (d− 1)

(1− 1

d− 1

) d−12 1√

d− 1

≥ V (d− 1)

(1− 1

d− 1

d− 1

2

)1√d− 1

≥ V (d− 1)1

2√d− 1

COMP4121 10 / 22

Page 11: COMP4121 Advanced Algorithms - UNSW Engineering

So we got that the volume A of the slice of the unit ball which is above theplane x1 = c√

d−1and the volume H of the whole hemisphere satisfy

A < V (d− 1)e−

c2

2

c√d− 1

H ≥ C ≥ V (d− 1)1

2√d− 1

Thus, the ratio of the volume of the slice of the sphere and the wholehemisphere satisfies

A

H<V (d− 1) e

− c2

2

c√d−1

V (d− 1) 12√d−1

=2

ce−

c2

2

Corollary: If we pick uniformly at random a point from the unit ball, the

probability that its coordinate x1 satisfies |x1| > c√d−1

is less than 2ce−

c2

2 .

This also means that most of the volume of the unit ball is betweenthe hyperplanes x1 = − c√

d−1and x1 = c√

d−1, i.e., near the equator.

Also, the annulus consisting of all x such that r ≤ |x| ≤ 1 for r < 1 has volumeV (d)− rdV (d) = V (d)(1− rd) so if d is large almost all of the volume ofthe unit ball is near its surface.

COMP4121 11 / 22

Page 12: COMP4121 Advanced Algorithms - UNSW Engineering

If we randomly and independently choose 2 points x, x′ from the unit ball, wecan rotate the ball so that x is in the direction of x1 coordinate.

Thus, the above two fact guarantee that with high probability |x| ≈ 1 and|x′| ≈ 1 as well as that the projection of x′ onto x has, with high

probability ≥ 1− 2ce−

c2

2 , a value smaller than c√d−1

, i.e., that x and x′

are almost orthogonal.

Next theorem generalises this to n points.

It shows that if we draw randomly a relatively small number n of vectorscompared to the dimension d, we are almost guaranteed that they will be ofapproximately the same length and that form an almost orthogonal basis.

They will span a space onto which we will project vectors from Rd into a space

of dimension n.

The purpose is to reduce the dimensionality of the space.

Theorem: Assume we draw independently and uniformly n pointsx1, . . . , xn from the unit ball.

Then with probability 1− 32

1n

for all 1 ≤ i, j ≤ n, i 6= j,

1 1 ≥ |xi| > 1− 2 lnnd ;

2 |〈xi, xj〉| ≤√

6 lnnd−1 .

COMP4121 12 / 22

Page 13: COMP4121 Advanced Algorithms - UNSW Engineering

Proof: Probability that |xi| < 1− ε is

V (d)(1− ε)d

V (d)= (1− ε)d < e−εd

Thus, taking ε = 2 lnnd

we obtain that for each i

P(|xi| < 1− 2 lnn

d

)< e−

2 lnnd

d = e− lnn2

=1

n2

Thus, the probability that |xi| < 1− 2 lnnd

for at least one i is less than

n · 1n2 = 1

n. This reasoning is called the Union bound.

Union bound: if you have n sets A1, . . . , An (not necessarily disjoint) such thatP(x ∈ Ai) = pi then P (x ∈ ∪ni=1Ai) ≤

∑ni=1 pi.

COMP4121 13 / 22

Page 14: COMP4121 Advanced Algorithms - UNSW Engineering

To prove the second claim we rotate the sphere so that xj is the polar axis.

We now note that, since |xi|, |xj | ≤ 1,

|〈xi, xj〉| = |xi| |xj | | cos∠(xi, xj)| ≤ |xi|| cos∠(xi, xj)|

Thus, [|〈xi, xj〉| ≥√6 lnn√d−1

implies that the length of the orthogonal projection

of xi onto xj , i.e., |xi|| cos∠(xi, xj)| is larger than√

6 lnn√d−1

.

But we have seen in the previous corollary that if we pick a point xi uniformlyat random a point from the unit ball, the probability that its coordinate xj

satisfies |xj | > c√d−1

is less than 2ce−

c2

2 .

Letting c =√

6 lnn, we obtain that for every pair i, j

P

(|〈xi, xj〉| ≥

√6 lnn√d− 1

)≤ 2√

6 lnne−

6 lnn2 =

2√6 lnn

e− lnn3

COMP4121 14 / 22

Page 15: COMP4121 Advanced Algorithms - UNSW Engineering

Simplifying the expression2√

6 lnne− lnn3

we obtain

P

(|〈xi, xj〉| ≥

√6 lnn√d− 1

)≤ 2√

6 lnn

1

n3<

1

n3.

There are(n2

)pairs, so again by union bound, the probability that for at least

one pair xi, xj the scalar product is large, i.e., |〈xi, xj〉| ≥√6 lnn√d−1

is at most(n2

)1n3 = n(n−1)

21n3 <

12n

.

Thus, all inequalities of both kinds hold with a probability(1− 1

n)(1− 1

2n) = 1− 1

n− 1

2n+ 1

2n2 > 1− 32

1n

.

Thus, to summarise, we have proved that, if we draw independently and

uniformly n points x1, . . . , xn from the unit ball, then with probability 1− 32

1n

for all 1 ≤ i, j ≤ n, i 6= j,

1 |xi| > 1− 2 lnnd ;

2 |〈xi, xj〉| ≤√

6 lnnd−1 .

But how do we draw random points from a unit ball?

It turns out that it is easier to drew points using a spherical Gaussian which iswhat we do next.

COMP4121 15 / 22

Page 16: COMP4121 Advanced Algorithms - UNSW Engineering

Gaussian Annulus Theorem

We need the following lemma which we do not prove; its proof is based on tailinequalities which can be found in the BHK book. Recal that forx = (x1, . . . , xd) ∈ Rd we let |x| =

√x21 + . . .+ x2d.

Lemma: Let X be a d-dimensional spherical Gaussian with zero mean andunit variance in each direction, i.e., the with the probability density

fX(x) =

d∏i=1

1√2πe−

x2i2 =

1

(2π)d2

e−|x|22

If we draw from such a distribution a random vector x, then the probabilitythat x belongs to the annulus (hollow ball of outer radius

√d+ β with the

shell of thickness 2β)

{x :√d− β ≤ |x| ≤

√d+ β} =

{x :∣∣∣|x| − √d∣∣∣ < β

}is at least 1− 3e−

β2

96 .

(For convenience of notation we denote 1/96 by γ, so the probability that x

belongs to the annulus is at least 1− 3e−γ β2

).

COMP4121 16 / 22

Page 17: COMP4121 Advanced Algorithms - UNSW Engineering

Let us choose k random vectors u1, . . . , uk ∈ Rd by sampling randomly aspherical Gaussian with a unit variance (i.e., we generate k vectors with eachof d many coordinatse obtained by sampling randomly and independently theone dimensional Gaussian).For any vector v define f(v), called the random projection, by

f(v) = (〈v, u1〉, . . . , 〈v, uk〉)Theorem: (Random Projection Lemma)

P(∣∣∣|f(v)| −

√k|v|

∣∣∣ ≥ ε√k|v|) ≤ 3e−γ k ε2

Proof: Since f(v) is a linear operator in v, we can divide both sides of theinner inequality with |v|. Thus, we can assume that |v| = 1 and we have toshow that for |v| = 1

P(∣∣∣|f(v)| −

√k∣∣∣ ≥ ε√k) ≤ 3e−γ k ε

2

Note that 〈ui, v〉 =∑dj=1 ui,jvj , since uij are Gaussians, so is their linear

combination; thus, 〈ui, v〉 is also Gaussian.Since uij are of unit variance and independent, we obtain

V [〈ui, v〉] = V

[d∑j=1

ui,jvj

]=

d∑j=1

V [ui,j ]v2j =

d∑j=1

v2j = |v| = 1.

Thus, 〈ui, v〉 is also Gaussian of unit variance.It follows that the theorem is a direct consequence of the Gaussian AnnulusTheorem with β = ε

√k.

COMP4121 17 / 22

Page 18: COMP4121 Advanced Algorithms - UNSW Engineering

Let us define f∗(v) = f(v)/√k;

if we divide by√k both sides of the inner inequality in

P(∣∣∣|f(v)| −

√k|v|

∣∣∣ ≥ ε√k|v|) ≤ 3e−γ k ε2

we obtain the following consequence:

Corollary:

P (||f∗(v)| − |v|| ≥ ε|v|) ≤ 3e−γ k ε2

Thus, the modified projection mapping f∗(v) “almost preserves” length ofvectors.

This corollary is needed to prove the following important theorem with lots ofpractical applications.

COMP4121 18 / 22

Page 19: COMP4121 Advanced Algorithms - UNSW Engineering

Johnson-Lindenstrauss Lemma

Theorem: (Johnson-Lindenstrauss Lemma) For any ε, 0 < ε < 1, andany integer n, assume that k satisfies k ≥ 3

γ ε2lnn. Then for any set of n

points given by vectors v1, . . . , vn in Rd, with the probability of at least

1− 32n

, the random projection f∗Rd → R

k has the property that for ALLpairs of points vi, vj

||f∗(vi − vj)| − |vi − vj || ≤ ε|vi − vj |

Thus, f∗(v) “almost” preserves distances between points given by vectors vi,despite reduction of dimensionality from d >> k to only k.

Proof: By applying the Random Projection Lemma, we get that for anypair vi, vj ,

P (||f∗(vi − vj)| − |vi − vj || ≥ ε|vi − vj |) ≤ 3e−γ k ε2

≤ 3e−γ 3

γ ε2lnn ε2

≤ 3e− lnn3

=3

n3

Since there are(n2

)< n2

2pairs, the probability that at least one of the above

inequalities fails is less than n2

2· 3n3 = 1− 3

2n. (Union bound again).

Note that k is a linear function of lnn, so it grows slowly, allowing to handleefficiently lots of points vi.

COMP4121 19 / 22

Page 20: COMP4121 Advanced Algorithms - UNSW Engineering

Johnson-Lindenstrauss Lemma

Main application of the Johnson Lindenstrauss Lemma: nearest neighboursearch in spaces of extremely high dimension, such as collections of documents,represented by the vector of the relative frequencies of each word from adictionary (of key words, for example).

Given a database of a large number of papers and a new paper p, find a paperpi from the database which is the “most similar” to p, in the sense that thesame keywords appear with approximately same frequencies.

The dictionary of all keywords is very large, in thousands or even tens ofthousands, so nearest neighbour search would be extremely slow.

Even if your database had 100,000 entries, ln(100, 000) < 12

COMP4121 20 / 22

Page 21: COMP4121 Advanced Algorithms - UNSW Engineering

Applications of the Johnson-Lindenstrauss Lemma

Solution: choose a large enough k, i.e., k ≥ 288ε2 ln(|DB|+ |Q|) where

|DB| is the size of the database and |Q| the expected number of queriesto be made.

Recall that γ = 1/96 so 3/γ = 288. However, with more careful analysis,the requirement k ≥ 288

ε2 lnn can be reduced to k ≥ 243ε2−2ε3 lnn.

A good dictionary contains more than 100,000 words.

So if you want to compare English documents according to the relativefrequency of words, each document has to be represented as a vector in100, 000 dimensional space!

Assume that your database has 100,000 documents stored; you want tofind the most similar documents in your database for a batch of 10, 000new documents. How many random vectors should you choose to projectonto?

If we set ε = 0.1 we obtain

k ≥ 24

3ε2 − 2ε3ln(|DB|+ |Q|) =

24

3× .01− 2× .001× ln(110, 000) = 9950,

so more than 10-folds reduction of dimensionality.

COMP4121 21 / 22

Page 22: COMP4121 Advanced Algorithms - UNSW Engineering

Applications of the Johnson-Lindenstrauss Lemma

So we choose k random vectors, by choosing each coordinate of every vectorusing a unit variance Gaussian.

Pre-process your database by replacing each vector xj ∈ DB, (xj ∈ Rd) with

its scaled projection f∗(xj) = f(xj)/√k to the k << d dimensional subspace

spanned by these k random vectors.

As each query y arrives, also replace y ∈ Rd with its projectionf∗(y) = f(y)/

√k.

Do the search for the nearest neighbour of f(y)/√k in the projected k

dimensional space instead of the search for the nearest neighbour of y in thespace of dimension d >> k.

COMP4121 22 / 22