Curse of Dimensionality - George Mason University
Transcript of Curse of Dimensionality - George Mason University
9/6/10
1
Curse of Dimensionality
Real world applications deal with spaces with high dimensionality
High dimensionality poses serious challenges for the design of pattern recognition techniques
Oil flaw data in two dimensions
Red = homogeneous
Green = annular
Blue = laminar
Curse of Dimensionality
A simple classification approach
9/6/10
2
Curse of Dimensionality
Going higher in dimensionality…
The number of regions of a regular grid grows exponentially with the dimensionality of the space
DMAX/DMIN
Curse of Dimensionality
9/6/10
3
Curse of Dimensionality
Curse of Dimensionality
9/6/10
4
Curse of Dimensionality
Curse of Dimensionality
9/6/10
5
€
uniform distribution in the unit cube -.5,.5[ ]q
9/6/10
6
Curse-of-Dimensionality
€
K =1
€
d(q,N) =O(N−1/ q )
€
d(q,N)⇒
Curse-of-Dimensionality
9/6/10
7
Curse-of-Dimensionality
Microarray Data Analysis
9/6/10
8
Document Classification
Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.
sensory, brain, visual, perception, retinal, cerebral cortex, eye, cell, optical nerve, image Hubel, Wiesel
brain visual
eye
cell
Cerebral cortex
perception sensory retinal
Wiesel optical nerve
Bag-of-words representation of a document
w1 w2 …. wq
Term Frequency
brain visual
eye
cell
Cerebral cortex
perception sensory retinal
Wiesel optical nerve
Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we
Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we
Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we
Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we
dictionary
9/6/10
9
What’s the size of the dictionary?
9/6/10
10
Curse-of-Dimensionality
9/6/10
11
€
x =x1
x2
µ =
µ1
µ2
=
1N
xii=1
N
∑
2 × 2 covariance matrix :
E x −µ( ) x −µ( )T[ ] =
Ex1 −µ1
x2 −µ2
x1 −µ1,x2 −µ2( )
=
Ex1 −µ1( )2 x1 −µ1( ) x2 −µ2( )
x1 −µ1( ) x2 −µ2( ) x2 −µ2( )2
=
1N
x1i −µ1( )2
x1i −µ1( ) x2
i −µ2( )x1i −µ1( ) x2
i −µ2( ) x2i −µ2( )
2
i=1
N
∑
€
1N
x1i −µ1( )2 x1
i −µ1( ) x2i −µ2( )x1i −µ1( ) x2i −µ2( ) x2
i −µ2( )2
i=1
N
∑ =
1N
x1i −µ1( )
2
i=1
N
∑ 1N
x1i −µ1( ) x2i −µ2( )[ ]
i=1
N
∑1N
x1i −µ1( ) x2i −µ2( )[ ]
i=1
N
∑ 1N
x2i −µ2( )
2
i=1
N
∑
9/6/10
12
€
0.99 −0.5−0.5 1.06
€
1.04 −1.05−1.05 1.15
€
0.93 0.010.01 1.05
€
0.97 0.490.49 1.04
€
0.94 0.930.93 1.03
Dimensionality Reduction
• Many dimensions are often interdependent (correlated);
We can:
• Reduce the dimensionality of problems;
• Transform interdependent coordinates into significant and independent ones;
9/6/10
13
Decision Theory • Decision theory, when combined with probability
theory, allows to make optimal decisions in situations involving uncertainty
• Training data:
• Inference: joint probability distribution
• Decision step: make optimal decision
Decision Theory Classification example: medical diagnosis problem
• set of pixel intensities in an image
• Two classes: – absence of cancer
– presence of cancer
• Inference step: estimate
• Decision step: given predict so that a measure of error is minimized according to the given probabilities
9/6/10
14
Decision Theory How probabilities play a role in decision making?
• Decision step: given predict
Thus, we are interested in
Intuitively: we want to minimize the chance of assigning to the wrong class. Thus, choose the class that gives the higher posterior probability
Minimizing the misclassification rate • Goal: Minimize the number of misclassifications
We need to find a rule that assigns each input vector to one of the possible classes
Such rule divides the input space into regions so that all points in are assigned to
Boundaries between regions are called decision boundaries
9/6/10
15
Minimizing the misclassification rate • Goal: Minimize the number of misclassifications
• Assign x to the class that gives the smaller value of the integrand: – Choose if
– Choose if €
p mistake( ) = p x ∈ R1,C2( ) + p x ∈ R2,C1( ) = p x,C2( )
R1
∫ dx + p x,C1( )R2
∫ dx
Minimizing the misclassification rate – Choose if
– Choose if
Thus:
– Choose if
– Choose if
9/6/10
16
Minimizing the misclassification rate
Optimal decision boundary:
Minimizing the misclassification rate
Thus:
General case of K classes:
Choose that gives the largest
9/6/10
17
Minimizing the expected loss
cancer
cancer normal
normal
The optimal solution is the one that minimizes the loss function
Minimizing the expected loss
9/6/10
18
Minimizing the expected loss
The Reject Option
9/6/10
19
Inference and Decision
( )x|kCp
Generative Methods
( )kCp |x kC
( )kCp
( ) ( ) ( )( )x
xxp
CpCpCp kkk
|| =
( ) ( ) ( )∑=k
kk CpCpp |xx
9/6/10
20
Discriminative Methods
( )x|kCp
Discriminant Functions ( )xf
9/6/10
21
Example
Linear Models for Classification Classification: Given an input vector x, assign it to one of K classes where k = 1,…, K
The input space is divided in decision regions whose boundaries are called decision boundaries or decision surfaces
Linear models: decision surfaces are linear functions of the input vector x. They are defined by (D -1)-dimensional hyperplanes within the D -dimensional input space
kC
9/6/10
22
Linear Models for Classification For regression:
For classification, we want to predict class labels, or more generally class posterior probabilities.
We transform the linear function of w using a nonlinear function f () so that
( ) 0wy T += xwx
€
f wTx + w0( )
Generalized Linear Models
Linear Discriminant Functions Two classes:
Decision boundary:
( ) 0wy T += xwx
( )2
10CtoassignotherwiseCtoassignyif
x x x ≥
( ) 0=xy
9/6/10
23
Linear Discriminant Functions Geometrical properties:
Decision boundary:
Let be two points which lie on the decision boundary
( ) 00 =+= wy T xwx
21, xx
( ) ( )( ) 0
0,0
21
022011
=−⇒
=+==+=
xx wxwxxwx
T
TT wywy
w represents the orthogonal direction to the decision boundary
Geometrical properties (con’t)
wwwT
T =*
0x( )
( )
( ) ( )
( ) ( )
( )ww
x 0x
wxwxw
w
xwxww
xxw
w xx
xxw
0
0
00
*0
0*
,
1
1
wywhen
y
directionwtheontoofprojectiontheis
T
TTT
T
==
=+=
−=−
−
−
Signed orthogonal distance of the origin from the decision surface
9/6/10
24
Linear Discriminant Functions Multiple classes
one-versus-the-rest: K-1 classifiers each of which solves a two-class problem of separating points of from points not in that class
kC
Linear Discriminant Functions Multiple classes
one-versus-one: K(K-1)/2 binary discriminant functions, one for every possible pair of classes.
9/6/10
25
Linear Discriminant Functions Multiple classes
Solution: consider a single K-class discriminant comprising K linear functions of the form
Assign a point x to class if The decision boundary between class and class is given by
( ) 0kTkk wy += xwx
kC ( ) ( ) kjxyxy jk ≠∀>
€
yk x( ) =y j x( )
⇒ wk − w j( )T
x + wk 0 − wj0( ) = 0
kC jC
Linear Discriminant Functions
Two approaches:
Fisher’s linear discriminant
Perceptron algorithm
9/6/10
26
Fisher’s Linear Discriminant One way to view a linear classification model is in terms of dimensionality reduction.
Two class case:
Suppose we project x onto one dimension:
Set a threshold t
xwTy =
2
1
CtoassignotherwiseCtoassigntyif
x x ≥
9/6/10
27
9/6/10
28
9/6/10
29
Fisher’s Linear Discriminant
• Find an orientation along which the projected samples are well separated;
• This is exactly the goal of linear discriminant analysis (LDA);
• In other words: we are after the linear projection that best separates the data, i.e. best discriminates data of different classes.
9/6/10
30
LDA
• samples of class
• samples of class
• Consider with
• Then: is the projection of along the direction of
• We want the projections where separated from the projections where
Niin C 1)},{( =x q
n ℜ∈x } ,{ 21 CCCi ∈
1N 1C
2N 2Cqℜ∈w 1=w
xwT
wx
xwT 1C∈xxwT
2C∈x
LDA • A measure of the separation between the
projected points is the difference of the sample means:
iC
€
mi =1N i
xx∈Ci
∑
€
mi =1N i
wT xx∈Ci
∑ = wT mi
€
m1 −m2 = wT (m1 −m2)
9/6/10
31
LDA • To obtain good separation of the projected data we
really want the difference between the means to be large relative to some measure of the standard deviation of each class:
iC
€
si2 = wT x −mi( )
2
x∈Ci
∑
€
s12 + s2
2
€
arg maxw
m1 −m2
2
s12 + s2
2
LDA
9/6/10
32
LDA
( ) : matrices following the
define we of function explicit an as obtain To wwJ
( ) 22
21
221
ssmm
J+
−=w
€
Si = x −mi( ) x −mi( )T
x∈Ci
∑
SW = S1 + S2Then:
€
si2 = wT x −mi( )
x∈Ci
∑2
= wT x − wT mi( )2
x∈Ci
∑
= wT x −mi( )x∈Ci
∑ x −mi( )T w = wT Siw
LDA
€
So : s12 = wT S1w and s2
2 = wT S2w Thus : s1
2 + s22 = wT S1w + wT S2w =
wT S1 + S2( )w = wT SWw
€
Similarly :
m1 −m2( )2= wT m1 − wT m2( )2
=
wT m1 −m2( ) m1 −m2( )T w =
wT SB w
where SB = m1 −m2( ) m1 −m2( )T
9/6/10
33
LDA
ww WT Sss =+ 2
221
: obtained have We
( ) ww BT Smm =− 2
21
( )wwwww
WT
BT
SS
ssmm
J =+
−= 2
221
221
wwww
wW
TB
T
SSmaxarg
LDA
€
We observe that :
SB w = m1 −m2( ) m1 −m2( )T w( )21 mm −
( )211 mmw −= −
WS
( )wwwww
WT
BT
SSJ =
€
J w( ) is maximized when wT SB w( )SWw = wT SWw( )SB w
9/6/10
34
LDA
Projection onto the line joining the class means
LDA
Solution of LDA
9/6/10
35
LDA
( )211 mmw −= −
WS
LDA