Curse of Dimensionality - George Mason University

9/6/10

1

Curse of Dimensionality

 Real world applications deal with spaces with high dimensionality

 High dimensionality poses serious challenges for the design of pattern recognition techniques

Oil flaw data in two dimensions

Red = homogeneous

Green = annular

Blue = laminar


 A simple classification approach

9/6/10

2


 Going higher in dimensionality…

The number of regions of a regular grid grows exponentially with the dimensionality of the space

DMAX/DMIN


9/6/10

3



9/6/10

4



9/6/10

5

€

uniform distribution in the unit cube -.5,.5[ ]q

9/6/10

6

Curse-of-Dimensionality

€

K =1

€

d(q,N) =O(N−1/ q )

€

d(q,N)⇒


9/6/10

7


Microarray Data Analysis

9/6/10

8

Document Classification

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual perception in the brain there is a considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image.

sensory, brain, visual, perception, retinal, cerebral cortex, eye, cell, optical nerve, image Hubel, Wiesel

brain visual

eye

cell

Cerebral cortex

perception sensory retinal

Wiesel optical nerve

Bag-of-words representation of a document

w1 w2 …. wq

Term Frequency

brain visual

eye

cell

Cerebral cortex

perception sensory retinal

Wiesel optical nerve

Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted point by point to visual centers in the brain; the cerebral cortex was a movie screen, so to speak, upon which the image in the eye was projected. Through the discoveries of Hubel and Wiesel we




dictionary

9/6/10

9

What’s the size of the dictionary?

9/6/10

10


9/6/10

11

€

x =x1

x2

µ =

µ1

µ2

=

1N

xii=1

N

∑

2 × 2 covariance matrix :

E x −µ( ) x −µ( )T[ ] =

Ex1 −µ1

x2 −µ2

x1 −µ1,x2 −µ2( )

=

Ex1 −µ1( )2 x1 −µ1( ) x2 −µ2( )

x1 −µ1( ) x2 −µ2( ) x2 −µ2( )2

=

1N

x1i −µ1( )2

x1i −µ1( ) x2

i −µ2( )x1i −µ1( ) x2

i −µ2( ) x2i −µ2( )

2

i=1

N

∑

€

1N

x1i −µ1( )2 x1

i −µ1( ) x2i −µ2( )x1i −µ1( ) x2i −µ2( ) x2

i −µ2( )2

i=1

N

∑ =

1N

x1i −µ1( )

2

i=1

N

∑ 1N

x1i −µ1( ) x2i −µ2( )[ ]

i=1

N

∑1N

x1i −µ1( ) x2i −µ2( )[ ]

i=1

N

∑ 1N

x2i −µ2( )

2

i=1

N

∑

9/6/10

12

€

0.99 −0.5−0.5 1.06

€

1.04 −1.05−1.05 1.15

€

0.93 0.010.01 1.05

€

0.97 0.490.49 1.04

€

0.94 0.930.93 1.03

Dimensionality Reduction

•  Many dimensions are often interdependent (correlated);

We can:

•  Reduce the dimensionality of problems;

•  Transform interdependent coordinates into significant and independent ones;

9/6/10

13

Decision Theory •  Decision theory, when combined with probability

theory, allows to make optimal decisions in situations involving uncertainty

•  Training data:

•  Inference: joint probability distribution

•  Decision step: make optimal decision

Decision Theory Classification example: medical diagnosis problem

•  set of pixel intensities in an image

•  Two classes: –  absence of cancer

–  presence of cancer

•  Inference step: estimate

•  Decision step: given predict so that a measure of error is minimized according to the given probabilities

9/6/10

14

Decision Theory How probabilities play a role in decision making?

•  Decision step: given predict

Thus, we are interested in

Intuitively: we want to minimize the chance of assigning to the wrong class. Thus, choose the class that gives the higher posterior probability

Minimizing the misclassification rate •  Goal: Minimize the number of misclassifications

We need to find a rule that assigns each input vector to one of the possible classes

Such rule divides the input space into regions so that all points in are assigned to

Boundaries between regions are called decision boundaries

9/6/10

15

Minimizing the misclassification rate •  Goal: Minimize the number of misclassifications

•  Assign x to the class that gives the smaller value of the integrand: –  Choose if

–  Choose if €

p mistake( ) = p x ∈ R1,C2( ) + p x ∈ R2,C1( ) = p x,C2( )

R1

∫ dx + p x,C1( )R2

∫ dx

Minimizing the misclassification rate –  Choose if

–  Choose if

Thus:

–  Choose if

–  Choose if

9/6/10

16

Minimizing the misclassification rate

Optimal decision boundary:

Minimizing the misclassification rate

Thus:

General case of K classes:

Choose that gives the largest

9/6/10

17

Minimizing the expected loss

cancer

cancer normal

normal

The optimal solution is the one that minimizes the loss function


9/6/10

18


The Reject Option

9/6/10

19

Inference and Decision

( )x|kCp

Generative Methods

( )kCp |x kC

( )kCp

( ) ( ) ( )( )x

xxp

CpCpCp kkk

|| =

( ) ( ) ( )∑=k

kk CpCpp |xx

9/6/10

20

Discriminative Methods

( )x|kCp

Discriminant Functions ( )xf

9/6/10

21

Example

Linear Models for Classification   Classification: Given an input vector x, assign it to one of K classes where k = 1,…, K

  The input space is divided in decision regions whose boundaries are called decision boundaries or decision surfaces

  Linear models: decision surfaces are linear functions of the input vector x. They are defined by (D -1)-dimensional hyperplanes within the D -dimensional input space

kC

9/6/10

22

Linear Models for Classification   For regression:

  For classification, we want to predict class labels, or more generally class posterior probabilities.

  We transform the linear function of w using a nonlinear function f () so that

( ) 0wy T += xwx

€

f wTx + w0( )

Generalized Linear Models

Linear Discriminant Functions Two classes:

Decision boundary:

( ) 0wy T += xwx

( )2

10CtoassignotherwiseCtoassignyif

x x x ≥

( ) 0=xy

9/6/10

23

Linear Discriminant Functions Geometrical properties:

Decision boundary:

Let be two points which lie on the decision boundary

( ) 00 =+= wy T xwx

21, xx

( ) ( )( ) 0

0,0

21

022011

=−⇒

=+==+=

xx wxwxxwx

T

TT wywy

w represents the orthogonal direction to the decision boundary

Geometrical properties (con’t)

wwwT

T =*

0x( )

( )

( ) ( )

( ) ( )

( )ww

x 0x

wxwxw

w

xwxww

xxw

w xx

xxw

0

0

00

*0

0*

,

1

1

wywhen

y

directionwtheontoofprojectiontheis

T

TTT

T

==

=+=

−=−

−

−

Signed orthogonal distance of the origin from the decision surface

9/6/10

24

Linear Discriminant Functions Multiple classes

one-versus-the-rest: K-1 classifiers each of which solves a two-class problem of separating points of from points not in that class

kC


one-versus-one: K(K-1)/2 binary discriminant functions, one for every possible pair of classes.

9/6/10

25


Solution: consider a single K-class discriminant comprising K linear functions of the form

Assign a point x to class if The decision boundary between class and class is given by

( ) 0kTkk wy += xwx

kC ( ) ( ) kjxyxy jk ≠∀>

€

yk x( ) =y j x( )

⇒ wk − w j( )T

x + wk 0 − wj0( ) = 0

kC jC

Linear Discriminant Functions

Two approaches:

  Fisher’s linear discriminant

  Perceptron algorithm

9/6/10

26

Fisher’s Linear Discriminant One way to view a linear classification model is in terms of dimensionality reduction.

Two class case:

Suppose we project x onto one dimension:

Set a threshold t

xwTy =

2

1

CtoassignotherwiseCtoassigntyif

x x ≥

9/6/10

27

9/6/10

28

9/6/10

29

Fisher’s Linear Discriminant

•  Find an orientation along which the projected samples are well separated;

•  This is exactly the goal of linear discriminant analysis (LDA);

•  In other words: we are after the linear projection that best separates the data, i.e. best discriminates data of different classes.

9/6/10

30

LDA

•  samples of class

•  samples of class

•  Consider with

•  Then: is the projection of along the direction of

•  We want the projections where separated from the projections where

Niin C 1)},{( =x q

n ℜ∈x } ,{ 21 CCCi ∈

1N 1C

2N 2Cqℜ∈w 1=w

xwT

wx

xwT 1C∈xxwT

2C∈x

LDA •  A measure of the separation between the

projected points is the difference of the sample means:

iC

€

mi =1N i

xx∈Ci

∑

€

mi =1N i

wT xx∈Ci

∑ = wT mi

€

m1 −m2 = wT (m1 −m2)

9/6/10

31

LDA •  To obtain good separation of the projected data we

really want the difference between the means to be large relative to some measure of the standard deviation of each class:

iC

€

si2 = wT x −mi( )

2

x∈Ci

∑

€

s12 + s2

2

€

arg maxw

m1 −m2

2

s12 + s2

2

LDA

9/6/10

32

LDA

( ) : matrices following the

define we of function explicit an as obtain To wwJ

( ) 22

21

221

ssmm

J+

−=w

€

Si = x −mi( ) x −mi( )T

x∈Ci

∑

SW = S1 + S2Then:

€

si2 = wT x −mi( )

x∈Ci

∑2

= wT x − wT mi( )2

x∈Ci

∑

= wT x −mi( )x∈Ci

∑ x −mi( )T w = wT Siw

LDA

€

So : s12 = wT S1w and s2

2 = wT S2w Thus : s1

2 + s22 = wT S1w + wT S2w =

wT S1 + S2( )w = wT SWw

€

Similarly :

m1 −m2( )2= wT m1 − wT m2( )2

=

wT m1 −m2( ) m1 −m2( )T w =

wT SB w

where SB = m1 −m2( ) m1 −m2( )T

9/6/10

33

LDA

ww WT Sss =+ 2

221

: obtained have We

( ) ww BT Smm =− 2

21

( )wwwww

WT

BT

SS

ssmm

J =+

−= 2

221

221

wwww

wW

TB

T

SSmaxarg

LDA

€

We observe that :

SB w = m1 −m2( ) m1 −m2( )T w( )21 mm −

( )211 mmw −= −

WS

( )wwwww

WT

BT

SSJ =

€

J w( ) is maximized when wT SB w( )SWw = wT SWw( )SB w

9/6/10

34

LDA

Projection onto the line joining the class means

LDA

Solution of LDA

9/6/10

35

LDA

( )211 mmw −= −

WS

LDA

Curse of Dimensionality - George Mason University

Documents

Transcript of Curse of Dimensionality - George Mason University