INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1...

1

INTRODUCTION TO

Machine Learning

ETHEM ALPAYDIN© The MIT Press, 2004

Edited for CS 536 Fall 2005 – Rutgers UniversityAhmed Elgammal

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

CHAPTER 6:

Dimensionality Reduction

2

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

Why Reduce Dimensionality?

1. Reduces time complexity: Less computation2. Reduces space complexity: Less parameters3. Saves the cost of observing the feature4. Simpler models are more robust on small datasets5. More interpretable; simpler explanation6. Data visualization (structure, groups, outliers, etc)

if plotted in 2 or 3 dimensions


Feature Selection vs Extraction

Feature selection: Choosing k<d important features, ignoring the remaining d – k

Subset selection algorithmsFeature extraction: Project the

original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k

Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)

3


Subset Selection

There are 2d subsets of d featuresForward search: Add the best feature at each step

Set of features F initially Ø.At each iteration, find the best new feature

j = argmini E ( F ∪ xi )Add xj to F if E ( F ∪ xj ) < E ( F )

Hill-climbing O(d2) algorithmBackward search: Start with all features and remove

one at a time, if possible.Floating search (Add k, remove l)


Principal Components Analysis (PCA)

Find a low-dimensional space such that when x is projected there, information loss is minimized.The projection of x on the direction of w is: z = wTxFind w such that Var(z) is maximized

Var(z) = Var(wTx) = E[(wTx – wTµ)2] = E[(wTx – wTµ)(wTx – wTµ)]= E[wT(x – µ)(x – µ)Tw]= wT E[(x – µ)(x –µ)T]w = wT∑w

where Var(x)= E[(x – µ)(x –µ)T] = ∑

4


Maximize Var(z) subject to ||w||=1

∑w1 = αw1 that is, w1 is an eigenvector of ∑Choose the one with the largest eigenvalue for

Var(z) to be maxSecond principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1

∑ w2 = α w2 that is, w2 is another eigenvector of ∑and so on.

( )1max 11111

−α− wwwww

TTΣ

( ) ( )01max 1222222

−β−−α− wwwwwww

TTTΣ


What PCA doesz = WT(x – m)

where the columns of W are the eigenvectors of ∑, and m is sample meanCenters the data at the origin and rotates the axes

5


How to choose k ?

Proportion of Variance (PoV) explained

when λi are sorted in descending order Typically, stop at PoV>0.9Scree graph plots of PoV vs k, stop at “elbow”

dk

k

λ++λ++λ+λλ++λ+λLL

L

21

21


6



Principle Component Analysis PCA

Given a set of points We are looking for a linear projection: a linear combination of orthogonal basis vectors

diN Rxxxx ∈},,,,{ 21 L

cAx ⋅≈

dRdmRm <<,

≈ c1 + c2 + c3 +…+

…x ≈ c

x cmA

What is the projection that minimizes the reconstruction error ? ∑ −=

iii AcxE

7


Principle Component Analysis PCA

Given a set of points

Center the points: compute

Compute covariance matrixCompute the eigen vectors for QEigenvectors are the orthogonal basis we are looking for

diN Rxxxx ∈},,,,{ 21 L

∑=i

iN x1µd

iN RxxxxP ∈−−−= ],,,,[ 21 µµµ LTPPQ =

kkk eQe λ=


Singular Value DecompositionSVD: If A is a real m by n matrix then there exist orthogonal matrices

U (m⋅m) and V (n⋅n) such that UtAV= Σ =diag(σ1, σ2,…, σp) p=min{m,n}

UtAV= Σ A= U Σ Vt

Singular values: Non negative square roots of the eigenvalues of AtA. Denoted σi, i=1,…,nAtA is symmetric ⇒ eigenvalues and singular values are real.Singular values arranged in decreasing order.

Amxn

Umxm

Σmxn

Vt

nxn=

λvvAAVVAA

VVVVVUUVVUVUAA

t

t

ttttttttt

=

∑=

∑=∑∑=∑∑=∑∑= −

)()(

)()(2

12

8


SVD for PCASVD can be used to efficiently compute the image basis

U are the eigen vectors (basis)Most important thing to notice: Distance in the eigen-space is an approximation to the correlation in the original space

jiji ccxx −≈−

λvvPPUUPP

UUUUUVVUVUVUPP

t

t

ttttttttt

=

∑=

∑=∑∑=∑∑=∑∑= −

)()(

))((2

12


PCA

Most important thing to notice: Distance in the eigen-space is an approximation to the correlation in the original space

jiji ccxx −≈−

xAcAcxT≈

≈dR dmRm <<,

9


EigenfaceUse PCA and subspace projection to perform face recognitionHow to describe a face as a linear combination of face basis Matthew Turk and Alex Pentland “Eigenfaces for Recognition” 1991


Face Recognition - Eigenface

MIT Media Lab -Face Recognition demo page http://vismod.media.mit.edu/vismod/demos/facerec/

10


Factor Analysis

Find a small number of factors z, which when combined generate x :

xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi

where zj, j =1,...,k are the latent factors with E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,

εi are the noise sourcesE[ εi ]= ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 ,

and vij are the factor loadings


PCA vs FA

PCA From x to z z = WT(x – µ)FA From z to x x – µ = Vz + ε

x z

z x

11


Factor Analysis

In FA, factors zj are stretched, rotated and translated to generate x


Multidimensional Scaling

Given pairwise distances between N points, dij, i,j =1,...,N

place on a low-dim map s.t. distances are preserved.z = g (x | θ ) Find θ that min Sammon stress

( )( )

( ) ( )( )∑

∑

−

−−θ−θ=

−

−−−=θ

s,r sr

srsr

s,r sr

srsr

E

2

2

2

2

||

|

xx

xxxgxg

xx

xxzzX

12


Map of Europe by MDS

Map from CIA – The World Factbook: http://www.cia.gov/


Linear Discriminant Analysis

Find a low-dimensional space such that when xis projected, classes are well-separated.Find w that maximizes

( ) ( )

( )∑∑∑ −==

+−

=

tttT

tt

tttT

rmsr

rm

ssmmJ

21

211

22

21

221

xwxw

w

13


Between-class scatter:

Within-class scatter:

( ) ( )( )( )

( )( )TBBT

TT

TTmm

2121

2121

221

221

where mmmmwwwmmmmw

mwmw

−−==

−−=

−=−

SS

( )( )( )

( )( )21

21

21

111

111

21

21

where

where

SSSS

S

S

+==+

−−=

=−−=

−=

∑∑∑

WWT

tt

Ttt

Ttt

TttT

tt

tT

ss

r

r

rms

ww

mxmx

wwwmxmxw

xw


Fisher’s Linear Discriminant

Find w that max

LDA soln:

Parametric soln:

( )211 mmw −⋅= −

Wc S

( )( )

wwmmw

wwwww

WT

T

WT

BT

JSS

S2

21 −==

( )( ) ( )Σ

−Σ= −

,~Cp ii µµµ

N| n whe21

1

xw

14


K>2 Classes

Within-class scatter:

Between-class scatter:

Find W that max

( )( )Tit

it

ttii

K

iiW r mxmx −−== ∑∑

=

SSS 1

( )( ) ∑∑==

=−−=K

ii

K

i

TiiiB K

N11

1 mmmmmmS

( )WSWWSW

WW

T

BT

J =The largest eigenvectors of SW

-1SBMaximum rank of K-1


15


Separating Style and Content

Objective: Decomposing two factors using linear methods

Content: which characterStyle : which font

“Bilinear models”J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000

Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000


Bilinear Models

Symmetric bilinear model

∑=ji

cj

siij

sc bawy,


16


Bilinear models cssc bAy =

Head pose as style factor person as content

Person as style factor pose as content


Asymmetric bilinear model: use style dependent basis vectors



INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1...

Documents

Transcript of INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1...