Dimensionality and dimensionality … and dimensionality reductiondimensionality reduction Nuno...
Transcript of Dimensionality and dimensionality … and dimensionality reductiondimensionality reduction Nuno...
Dimensionality and dimensionality reductiondimensionality reduction
Nuno Vasconcelos ECE Department, UCSDp ,
Notethis course requires a project
it is your responsibility to define it (although we can talk)• ideally, something connected to your research• If you are too far from this, consider P/F
next landmark is a project proposalnext landmark is a project proposal• two page, describing the main idea• due next Thursdayy• not cast in stone, but bad idea to delay
2
Plan for today
high dimensional spaces are STRANGE!!!
introduction to dimensionality reduction
principal component analysis (PCA)
3
High dimensional spacesare strange!first thing to know:first thing to know:
“never trust your intuition in high dimensions!”
more often than not you will be wrong!th l f thithere are many examples of thiswe will do a couple
4
The hyper-sphereConsider the sphere of radius r on a space of dimension d
Homework: show that its volume is
rr
where Γ(n) is the Gamma function
5
Hyper-cube vs hyper-spherenext consider the hyper-cube [-a,a]d and the inscribed hyper-sphere, i.e.yp p ,
a
a-a
-a
Q: what does your intuition tell you about the relative sizes of these two objects?j1. volume of sphere ~= volume of cube2. volume of sphere >> volume of cube
6
3. volume of sphere << volume of cube
Answerwe can just compute this
sequence that does not depend on a, just on the dimension d!
d 1 2 3 4 5 6 7f 1 785 524 308 164 08 037
it goes to zero, and goes to zero fast!
fd 1 .785 .524 .308 .164 .08 .037
7
Hyper-cube vs hyper-shperethis means that:
“as the dimension of the space increases the volume ofas the dimension of the space increases, the volume of the sphere is much smaller (infinitesimal) than that of
the cube!”how is this going against intuition?it is actually not very surprising. we can see it even in l di ilow dimensions1. d = 1 volume is the same 2 d = 2
a-aa2. d = 2
volume of sphere is alreadysmaller
a-a
8
-a
Hyper-cube vs hyper-sphereas the dimension increases the volume of the shaded corners becomes largerg
a
a-a
-a
in high dimensions the picture you should have in mind is
all the volume of the cubeis in these spikes!
9
Believe it or notwe can check mathematically: consider d and p
a
a-a
d
p
note that-a
d orthogonal to p as d increases and infinitel larger!!!
10
increases and infinitely larger!!!
But there is moreconsider the crust of unit sphere of thickness εwe can compute its volume ε Swe can compute its volume ε
S1
S2
a
no matter how small ε is, ratio goes to zero as d increasesi e “all the volume is in the crust!”
11
i.e. all the volume is in the crust!
High dimensional GaussianHomework: show that if
and one considers the hyper-sphere where the probability density drops to 1% of peak valuep y y p p
the probability mass outside this sphere is
where χ2(n) is a chi-squared random variable with n
12
where χ (n) is a chi-squared random variable with ndegrees of freedom
High-dimensional Gaussianif you evaluate this, you’ll find out that
n 1 2 3 4 5 6 10 15 20n 1 2 3 4 5 6 10 15 201-Pn .998 .99 .97 .94 .89 .83 .48 .134 .02
as the dimension increases, all probability mass is on the tailsthe point of maximum density is still the meanthe point of maximum density is still the meanreally strange: in high-dimensions the Gaussian is a very heavy-tailed distributionheavy tailed distributiontake-home message:• “in high dimensions never trust your intuition!”
13
g y
Q: how does all this affect decision rules?
The curse of dimensionalitytypical observation in Bayes decision theory:
• error increases when number of features is large
highly unintuitive since, theoretically:• if I have a problem in n-D I can always generate a problem in
(n+1) D with smaller probability of error(n+1)-D with smaller probability of error
e.g. two uniform classes in 1D
A BA B
can be transformed into a 2D problem with same errorjust add a non-informative variable y.
14
just add a non informative variable y.
Curse of dimensionalityx x
y y
but it is also possible to reduce the error by adding a second variable which is informativesecond variable which is informativeon the left there is no decision boundary that will achieve zero error
15
zero erroron the right, the decision boundary shown has zero error
Curse of dimensionalityin fact, it is impossible to do worse in 2D than 1D
x
yy
if we move the classes along the lines shown in green the error can only go down, since there will be less overlap
16
Curse of dimensionalityso why do we observe this curse of dimensionality?the problem is the quality of the density estimatesp q y ythe best way to see this is to think of an histogram• suppose you have 100 points and you need at least 10 bins per
axis in order to get a reasonable quantization
for uniform data you get, on average,
dimension 1 2 3
points/bin 10 1 0.1
decent in1D, bad in 2D, terrible in 3D(9 out of each10 bins empty)
17
( p y)
Dimensionality reductionwhat do we do about this? we avoid unnecessary dimensionsunnecessary can be measured in two ways:1.features are not discriminat2.features are not independent
non-discriminant means that they do not separate the l llclasses well
discriminant non-discriminant
18
Dimensionality reductiondependent features, even if very discriminant, are not needed - one is enough!ge.g. data-mining company studying consumer credit card ratingsX = {salary, mortgage, car loan, # of kids, profession, ...}the first three features tend to be highly correlated: • “the more you make, the higher the mortgage, the more
expensive the car you drive”• from one of these variables I can predict the others very wellp y
including features 2 and 3 does not increase the discrimination, but increases the dimension and leads to
d it ti t
19
poor density estimates
Dimensionality reductionQ: how do we detect the presence of these correlations?A: the data “lives” in a low dimensional subspace (up toA: the data lives in a low dimensional subspace (up to some amounts of noise). E.g.
new feature y
o ooo o
oo oo
o
o
oooo o
o
salary
projection ontooo
oo
oooo
oo oo
salary
in the example above we have a 3D hyper plane in 5D
o o
car loan1D subspace: y = a x
oo
car loan
in the example above we have a 3D hyper-plane in 5Dif we can find this hyper-plane we can • project the data onto it
20
project the data onto it• get rid of half of the dimensions without introducing significant error
Principal component analysisbasic idea:• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D 2D subspace in 3D
• this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids
21
Gaussian reviewthe equiprobability contours of a Gaussian are the points such thatp
l t’ id th h f i bl hi hlet’s consider the change of variable z = x-µ, which only moves the origin by µ. The equation
is the equation of an ellipse. this is easy to see when Σ is diagonal:this is easy to see when Σ is diagonal:
22
Gaussian reviewthis is the equation of an ellipse with principal lengths σi
• e.g. when d = 2
is the ellipse z2
σ1
σ2
z1
i t d th t f ti Φ
1
23
introduce the transformation y = Φ z
Gaussian reviewintroduce the transformation y = Φ zthen y has covariance yif Φ is orthonormal this is just a rotation and we have
z2y2
σ2
z2
y = Φ zσ1σ
y2
φ1φ2
σ1 z1
σ2y1
we obtain a rotated ellipse with principal components φ1and φ2 which are the columns of Φ
24
note that is the eigen-decomposition of Σy
Principal component analysisIf y is Gaussian with covariance Σ, the equiprobability contours are the ellipses whose y2
φ• principal components φi are the eigenvectors of Σ
• principal lengths λi are the λ1λ2
y
φ1φ2
p p g ieigenvalues of Σ
by computing the eigenvalues we know if the data is flat
y1
by computing the eigenvalues we know if the data is flatλ1 >> λ2: flat λ1=λ2: not flat
y2 y2
λ1λ2y1 λ1
λ2
y1
25
y1 λ1 y1
Principal component analysis (learning)
26
Principal component analysis
27
Principal componentswhat are they? in some cases it is possible to seeexample: eigenfacesexample: eigenfaces• face recognition problem: can you identify who is the person in
this picture• training:
• assemble examples from people’s faces• compute the PCA basiscompute the PCA basis• project each image into PCA space
• recognition:• project image to classify into PCA space• find the closest vector in that space• label the image with the identity of this nearest neighbor
28
g y g
Principal componentsface examples
29
Principal componentsPrincipal components (eigenfaces)• high-energy oneshigh-energy ones
tend to have low-frequency
• capture average• capture averageface, illumination, etc.
• at the• at the intermediate-levelwe have face detail
• low-enegy tends tobe high-frequencynoise
30
31