INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1...

16
1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall 2005 – Rutgers University Ahmed Elgammal [email protected] http://www.cmpe.boun.edu.tr/~ethem/i2ml Lecture Slides for CHAPTER 6: Dimensionality Reduction

Transcript of INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1...

Page 1: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

1

INTRODUCTION TO

Machine Learning

ETHEM ALPAYDIN© The MIT Press, 2004

Edited for CS 536 Fall 2005 – Rutgers UniversityAhmed Elgammal

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml

Lecture Slides for

CHAPTER 6:

Dimensionality Reduction

Page 2: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

2

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

Why Reduce Dimensionality?

1. Reduces time complexity: Less computation2. Reduces space complexity: Less parameters3. Saves the cost of observing the feature4. Simpler models are more robust on small datasets5. More interpretable; simpler explanation6. Data visualization (structure, groups, outliers, etc)

if plotted in 2 or 3 dimensions

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)4

Feature Selection vs Extraction

Feature selection: Choosing k<d important features, ignoring the remaining d – k

Subset selection algorithmsFeature extraction: Project the

original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k

Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)

Page 3: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

3

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5

Subset Selection

There are 2d subsets of d featuresForward search: Add the best feature at each step

Set of features F initially Ø.At each iteration, find the best new feature

j = argmini E ( F ∪ xi )Add xj to F if E ( F ∪ xj ) < E ( F )

Hill-climbing O(d2) algorithmBackward search: Start with all features and remove

one at a time, if possible.Floating search (Add k, remove l)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)6

Principal Components Analysis (PCA)

Find a low-dimensional space such that when x is projected there, information loss is minimized.The projection of x on the direction of w is: z = wTxFind w such that Var(z) is maximized

Var(z) = Var(wTx) = E[(wTx – wTµ)2] = E[(wTx – wTµ)(wTx – wTµ)]= E[wT(x – µ)(x – µ)Tw]= wT E[(x – µ)(x –µ)T]w = wT∑w

where Var(x)= E[(x – µ)(x –µ)T] = ∑

Page 4: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

4

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)7

Maximize Var(z) subject to ||w||=1

∑w1 = αw1 that is, w1 is an eigenvector of ∑Choose the one with the largest eigenvalue for

Var(z) to be maxSecond principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1

∑ w2 = α w2 that is, w2 is another eigenvector of ∑and so on.

( )1max 11111

−α− wwwww

TTΣ

( ) ( )01max 1222222

−β−−α− wwwwwww

TTTΣ

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)8

What PCA doesz = WT(x – m)

where the columns of W are the eigenvectors of ∑, and m is sample meanCenters the data at the origin and rotates the axes

Page 5: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

5

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9

How to choose k ?

Proportion of Variance (PoV) explained

when λi are sorted in descending order Typically, stop at PoV>0.9Scree graph plots of PoV vs k, stop at “elbow”

dk

k

λ++λ++λ+λλ++λ+λLL

L

21

21

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)10

Page 6: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

6

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12

Principle Component Analysis PCA

Given a set of points We are looking for a linear projection: a linear combination of orthogonal basis vectors

diN Rxxxx ∈},,,,{ 21 L

cAx ⋅≈

dRdmRm <<,

≈ c1 + c2 + c3 +…+

…x ≈ c

x cmA

What is the projection that minimizes the reconstruction error ? ∑ −=

iii AcxE

Page 7: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

7

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)13

Principle Component Analysis PCA

Given a set of points

Center the points: compute

Compute covariance matrixCompute the eigen vectors for QEigenvectors are the orthogonal basis we are looking for

diN Rxxxx ∈},,,,{ 21 L

∑=i

iN x1µd

iN RxxxxP ∈−−−= ],,,,[ 21 µµµ LTPPQ =

kkk eQe λ=

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)14

Singular Value DecompositionSVD: If A is a real m by n matrix then there exist orthogonal matrices

U (m⋅m) and V (n⋅n) such that UtAV= Σ =diag(σ1, σ2,…, σp) p=min{m,n}

UtAV= Σ A= U Σ Vt

Singular values: Non negative square roots of the eigenvalues of AtA. Denoted σi, i=1,…,nAtA is symmetric ⇒ eigenvalues and singular values are real.Singular values arranged in decreasing order.

Amxn

Umxm

Σmxn

Vt

nxn=

λvvAAVVAA

VVVVVUUVVUVUAA

t

t

ttttttttt

=

∑=

∑=∑∑=∑∑=∑∑= −

)()(

)()(2

12

Page 8: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

8

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15

SVD for PCASVD can be used to efficiently compute the image basis

U are the eigen vectors (basis)Most important thing to notice: Distance in the eigen-space is an approximation to the correlation in the original space

jiji ccxx −≈−

λvvPPUUPP

UUUUUVVUVUVUPP

t

t

ttttttttt

=

∑=

∑=∑∑=∑∑=∑∑= −

)()(

))((2

12

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)16

PCA

Most important thing to notice: Distance in the eigen-space is an approximation to the correlation in the original space

jiji ccxx −≈−

xAcAcxT≈

≈dR dmRm <<,

Page 9: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

9

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)17

EigenfaceUse PCA and subspace projection to perform face recognitionHow to describe a face as a linear combination of face basis Matthew Turk and Alex Pentland “Eigenfaces for Recognition” 1991

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)18

Face Recognition - Eigenface

MIT Media Lab -Face Recognition demo page http://vismod.media.mit.edu/vismod/demos/facerec/

Page 10: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

10

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19

Factor Analysis

Find a small number of factors z, which when combined generate x :

xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi

where zj, j =1,...,k are the latent factors with E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,

εi are the noise sourcesE[ εi ]= ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 ,

and vij are the factor loadings

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)20

PCA vs FA

PCA From x to z z = WT(x – µ)FA From z to x x – µ = Vz + ε

x z

z x

Page 11: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

11

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)21

Factor Analysis

In FA, factors zj are stretched, rotated and translated to generate x

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)22

Multidimensional Scaling

Given pairwise distances between N points, dij, i,j =1,...,N

place on a low-dim map s.t. distances are preserved.z = g (x | θ ) Find θ that min Sammon stress

( )( )

( ) ( )( )∑

−−θ−θ=

−−−=θ

s,r sr

srsr

s,r sr

srsr

E

2

2

2

2

||

|

xx

xxxgxg

xx

xxzzX

Page 12: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

12

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)23

Map of Europe by MDS

Map from CIA – The World Factbook: http://www.cia.gov/

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)24

Linear Discriminant Analysis

Find a low-dimensional space such that when xis projected, classes are well-separated.Find w that maximizes

( ) ( )

( )∑∑∑ −==

+−

=

tttT

tt

tttT

rmsr

rm

ssmmJ

21

211

22

21

221

xwxw

w

Page 13: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

13

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)25

Between-class scatter:

Within-class scatter:

( ) ( )( )( )

( )( )TBBT

TT

TTmm

2121

2121

221

221

where mmmmwwwmmmmw

mwmw

−−==

−−=

−=−

SS

( )( )( )

( )( )21

21

21

111

111

21

21

where

where

SSSS

S

S

+==+

−−=

=−−=

−=

∑∑∑

WWT

tt

Ttt

Ttt

TttT

tt

tT

ss

r

r

rms

ww

mxmx

wwwmxmxw

xw

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)26

Fisher’s Linear Discriminant

Find w that max

LDA soln:

Parametric soln:

( )211 mmw −⋅= −

Wc S

( )( )

wwmmw

wwwww

WT

T

WT

BT

JSS

S2

21 −==

( )( ) ( )Σ

−Σ= −

,~Cp ii µµµ

N| n whe21

1

xw

Page 14: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

14

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)27

K>2 Classes

Within-class scatter:

Between-class scatter:

Find W that max

( )( )Tit

it

ttii

K

iiW r mxmx −−== ∑∑

=

SSS 1

( )( ) ∑∑==

=−−=K

ii

K

i

TiiiB K

N11

1 mmmmmmS

( )WSWWSW

WW

T

BT

J =The largest eigenvectors of SW

-1SBMaximum rank of K-1

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)28

Page 15: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

15

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)29

Separating Style and Content

Objective: Decomposing two factors using linear methods

Content: which characterStyle : which font

“Bilinear models”J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000

Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)30

Bilinear Models

Symmetric bilinear model

∑=ji

cj

siij

sc bawy,

Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000

Page 16: INTRODUCTION TO Machine Learning - Rutgers Universityelgammal/classes/cs536/lectures/...1 INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Edited for CS 536 Fall

16

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)31

Bilinear models cssc bAy =

Head pose as style factor person as content

Person as style factor pose as content

Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000

Asymmetric bilinear model: use style dependent basis vectors

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)32

Figures from J. Tenenbaum and W. Freeman “Separating Style and Content with Bilinear Models” Neural computation 2000