Dimensionality and dimensionality … and dimensionality reductiondimensionality reduction Nuno...

Dimensionality and dimensionality reductiondimensionality reduction

Nuno Vasconcelos ECE Department, UCSDp ,

Notethis course requires a project

it is your responsibility to define it (although we can talk)• ideally, something connected to your research• If you are too far from this, consider P/F

next landmark is a project proposalnext landmark is a project proposal• two page, describing the main idea• due next Thursdayy• not cast in stone, but bad idea to delay

2

Plan for today

high dimensional spaces are STRANGE!!!

introduction to dimensionality reduction

principal component analysis (PCA)

3

High dimensional spacesare strange!first thing to know:first thing to know:

“never trust your intuition in high dimensions!”

more often than not you will be wrong!th l f thithere are many examples of thiswe will do a couple

4

The hyper-sphereConsider the sphere of radius r on a space of dimension d

Homework: show that its volume is

rr

where Γ(n) is the Gamma function

5

Hyper-cube vs hyper-spherenext consider the hyper-cube [-a,a]d and the inscribed hyper-sphere, i.e.yp p ,

a

a-a

-a

Q: what does your intuition tell you about the relative sizes of these two objects?j1. volume of sphere ~= volume of cube2. volume of sphere >> volume of cube

6

3. volume of sphere << volume of cube

Answerwe can just compute this

sequence that does not depend on a, just on the dimension d!

d 1 2 3 4 5 6 7f 1 785 524 308 164 08 037

it goes to zero, and goes to zero fast!

fd 1 .785 .524 .308 .164 .08 .037

7

Hyper-cube vs hyper-shperethis means that:

“as the dimension of the space increases the volume ofas the dimension of the space increases, the volume of the sphere is much smaller (infinitesimal) than that of

the cube!”how is this going against intuition?it is actually not very surprising. we can see it even in l di ilow dimensions1. d = 1 volume is the same 2 d = 2

a-aa2. d = 2

volume of sphere is alreadysmaller

a-a

8

-a

Hyper-cube vs hyper-sphereas the dimension increases the volume of the shaded corners becomes largerg

a

a-a

-a

in high dimensions the picture you should have in mind is

all the volume of the cubeis in these spikes!

9

Believe it or notwe can check mathematically: consider d and p

a

a-a

d

p

note that-a

d orthogonal to p as d increases and infinitel larger!!!

10

increases and infinitely larger!!!

But there is moreconsider the crust of unit sphere of thickness εwe can compute its volume ε Swe can compute its volume ε

S1

S2

a

no matter how small ε is, ratio goes to zero as d increasesi e “all the volume is in the crust!”

11

i.e. all the volume is in the crust!

High dimensional GaussianHomework: show that if

and one considers the hyper-sphere where the probability density drops to 1% of peak valuep y y p p

the probability mass outside this sphere is

where χ2(n) is a chi-squared random variable with n

12

where χ (n) is a chi-squared random variable with ndegrees of freedom

High-dimensional Gaussianif you evaluate this, you’ll find out that

n 1 2 3 4 5 6 10 15 20n 1 2 3 4 5 6 10 15 201-Pn .998 .99 .97 .94 .89 .83 .48 .134 .02

as the dimension increases, all probability mass is on the tailsthe point of maximum density is still the meanthe point of maximum density is still the meanreally strange: in high-dimensions the Gaussian is a very heavy-tailed distributionheavy tailed distributiontake-home message:• “in high dimensions never trust your intuition!”

13

g y

Q: how does all this affect decision rules?

The curse of dimensionalitytypical observation in Bayes decision theory:

• error increases when number of features is large

highly unintuitive since, theoretically:• if I have a problem in n-D I can always generate a problem in

(n+1) D with smaller probability of error(n+1)-D with smaller probability of error

e.g. two uniform classes in 1D

A BA B

can be transformed into a 2D problem with same errorjust add a non-informative variable y.

14

just add a non informative variable y.

Curse of dimensionalityx x

y y

but it is also possible to reduce the error by adding a second variable which is informativesecond variable which is informativeon the left there is no decision boundary that will achieve zero error

15

zero erroron the right, the decision boundary shown has zero error

Curse of dimensionalityin fact, it is impossible to do worse in 2D than 1D

x

yy

if we move the classes along the lines shown in green the error can only go down, since there will be less overlap

16

Curse of dimensionalityso why do we observe this curse of dimensionality?the problem is the quality of the density estimatesp q y ythe best way to see this is to think of an histogram• suppose you have 100 points and you need at least 10 bins per

axis in order to get a reasonable quantization

for uniform data you get, on average,

dimension 1 2 3

points/bin 10 1 0.1

decent in1D, bad in 2D, terrible in 3D(9 out of each10 bins empty)

17

( p y)

Dimensionality reductionwhat do we do about this? we avoid unnecessary dimensionsunnecessary can be measured in two ways:1.features are not discriminat2.features are not independent

non-discriminant means that they do not separate the l llclasses well

discriminant non-discriminant

18

Dimensionality reductiondependent features, even if very discriminant, are not needed - one is enough!ge.g. data-mining company studying consumer credit card ratingsX = {salary, mortgage, car loan, # of kids, profession, ...}the first three features tend to be highly correlated: • “the more you make, the higher the mortgage, the more

expensive the car you drive”• from one of these variables I can predict the others very wellp y

including features 2 and 3 does not increase the discrimination, but increases the dimension and leads to

d it ti t

19

poor density estimates

Dimensionality reductionQ: how do we detect the presence of these correlations?A: the data “lives” in a low dimensional subspace (up toA: the data lives in a low dimensional subspace (up to some amounts of noise). E.g.

new feature y

o ooo o

oo oo

o

o

oooo o

o

salary

projection ontooo

oo

oooo

oo oo

salary

in the example above we have a 3D hyper plane in 5D

o o

car loan1D subspace: y = a x

oo

car loan

in the example above we have a 3D hyper-plane in 5Dif we can find this hyper-plane we can • project the data onto it

20

project the data onto it• get rid of half of the dimensions without introducing significant error

Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when

viewed from the full space, e.g.

1D subspace in 2D 2D subspace in 3D

• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids

21

Gaussian reviewthe equiprobability contours of a Gaussian are the points such thatp

l t’ id th h f i bl hi hlet’s consider the change of variable z = x-µ, which only moves the origin by µ. The equation

is the equation of an ellipse. this is easy to see when Σ is diagonal:this is easy to see when Σ is diagonal:

22

Gaussian reviewthis is the equation of an ellipse with principal lengths σi

• e.g. when d = 2

is the ellipse z2

σ1

σ2

z1

i t d th t f ti Φ

1

23

introduce the transformation y = Φ z

Gaussian reviewintroduce the transformation y = Φ zthen y has covariance yif Φ is orthonormal this is just a rotation and we have

z2y2

σ2

z2

y = Φ zσ1σ

y2

φ1φ2

σ1 z1

σ2y1

we obtain a rotated ellipse with principal components φ1and φ2 which are the columns of Φ

24

note that is the eigen-decomposition of Σy

Principal component analysisIf y is Gaussian with covariance Σ, the equiprobability contours are the ellipses whose y2

φ• principal components φi are the eigenvectors of Σ

• principal lengths λi are the λ1λ2

y

φ1φ2

p p g ieigenvalues of Σ

by computing the eigenvalues we know if the data is flat

y1

by computing the eigenvalues we know if the data is flatλ1 >> λ2: flat λ1=λ2: not flat

y2 y2

λ1λ2y1 λ1

λ2

y1

25

y1 λ1 y1

Principal component analysis (learning)

26

Principal component analysis

27

Principal componentswhat are they? in some cases it is possible to seeexample: eigenfacesexample: eigenfaces• face recognition problem: can you identify who is the person in

this picture• training:

• assemble examples from people’s faces• compute the PCA basiscompute the PCA basis• project each image into PCA space

• recognition:• project image to classify into PCA space• find the closest vector in that space• label the image with the identity of this nearest neighbor

28

g y g

Principal componentsface examples

29

Principal componentsPrincipal components (eigenfaces)• high-energy oneshigh-energy ones

tend to have low-frequency

• capture average• capture averageface, illumination, etc.

• at the• at the intermediate-levelwe have face detail

• low-enegy tends tobe high-frequencynoise

30

Dimensionality and dimensionality … and dimensionality reductiondimensionality reduction Nuno...

Documents

Transcript of Dimensionality and dimensionality … and dimensionality reductiondimensionality reduction Nuno...