Post on 12-Jul-2020
ETHEM ALPAYDIN© The MIT Press, 2010
alpaydin@boun.edu.trhttp://www.cmpe.boun.edu.tr/~ethem/i2ml2e
Lecture Slides for
Why Reduce Dimensionality? Reduces time complexity: Less computation
Reduces space complexity: Less parameters
Saves the cost of observing the feature
Simpler models are more robust on small datasets
More interpretable; simpler explanation
Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions
3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Feature Selection vs Extraction Feature selection: Choosing k<d important features,
ignoring the remaining d – k
Subset selection algorithms
Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k
Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)
4Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Subset Selection There are 2d subsets of d features Forward search: Add the best feature at each step Set of features F initially Ø. At each iteration, find the best new feature
j = argmini E ( F xi )
Add xj to F if E ( F xj ) < E ( F )
Hill-climbing O(d2) algorithm Backward search: Start with all features and remove
one at a time, if possible. Floating search (Add k, remove l)
5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Principal Components Analysis (PCA)
Find a low-dimensional space such that when x is projected there, information loss is minimized.
The projection of x on the direction of w is: z = wTx
Find w such that Var(z) is maximized
Var(z) = Var(wTx) = E[(wTx – wTμ)2]
= E[(wTx – wTμ)(wTx – wTμ)]
= E[wT(x – μ)(x – μ)Tw]
= wT E[(x – μ)(x –μ)T]w = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑
6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Maximize Var(z) subject to ||w||=1
∑w1 = αw1 that is, w1 is an eigenvector of ∑
Choose the one with the largest eigenvalue for Var(z) to be max
Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1
∑ w2 = α w2 that is, w2 is another eigenvector of ∑
and so on.
01 1222222
wwwwwww
TTT max
7
111111
wwwww
TT max
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
What PCA doesz = WT(x – m)
where the columns of W are the eigenvectors of ∑, and mis sample mean
Centers the data at the origin and rotates the axes
8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Proportion of Variance (PoV) explained
when λi are sorted in descending order
Typically, stop at PoV>0.9
Scree graph plots of PoV vs k, stop at “elbow”
How to choose k ?
dk
k
21
21
9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
10Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Factor Analysis Find a small number of factors z, which when combined
generate x :
xi – µi = vi1z1 + vi2z2 + ... + vikzk + εi
where zj, j =1,...,k are the latent factors with
E[ zj ]=0, Var(zj)=1, Cov(zi ,, zj)=0, i ≠ j ,
εi are the noise sources
E* εi += ψi, Cov(εi , εj) =0, i ≠ j, Cov(εi , zj) =0 ,
and vij are the factor loadings
12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
PCA vs FA PCA From x to z z = WT(x – µ)
FA From z to x x – µ = Vz + ε
13
x z
z x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Factor Analysis In FA, factors zj are stretched, rotated and translated to
generate x
14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Multidimensional Scaling
srsr
srsr
srsr
srsr
E
,
,
||
|
2
2
2
2
xx
xxxgxg
xx
xxzz
X
15
Given pairwise distances between N points,
dij, i,j =1,...,N
place on a low-dim map s.t. distances are preserved.
z = g (x | θ ) Find θ that min Sammon stress
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Map of Europe by MDS
16
Map from CIA – The World Factbook: http://www.cia.gov/
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Linear Discriminant Analysis Find a low-dimensional
space such that when x is projected, classes are well-separated.
Find w that maximizes
t
ttT
t
tt
ttT
rmsr
rm
ss
mmJ
2
1
2
11
2
2
2
1
2
21
xwxw
w
17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
TBBT
TT
TTmm
2121
2121
2
21
2
21
mmmmww
wmmmmw
mwmw
SS where
18
Between-class scatter:
Within-class scatter:
21
2
1
2
1
111
111
2
1
2
1
SSSS
S
S
WWT
t
t
Ttt
Tt
t
TttT
t
t
tT
ss
r
r
rms
where
where
ww
mxmx
wwwmxmxw
xw
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Fisher’s Linear Discriminant
21
1mmw
Wc S
19
Find w that max
LDA soln:
Parametric soln:
ww
mmw
ww
www
WT
T
WT
BT
JSS
S2
21
,~| when iiCp μ
μμ 21
1
Nx
w
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
K>2 Classes
Tit
it
t
tii
K
iiW r mxmx
SSS 1
20
Within-class scatter:
Between-class scatter:
Find W that max
K
ii
K
i
T
iiiBK
N11
1mmmmmm S
WSW
WSWW
WT
BT
J
The largest eigenvectors of SW-1SB
Maximum rank of K-1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Isomap Geodesic distance is the distance along the manifold that
the data lies in, as opposed to the Euclidean distance in the input space
22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Isomap Instances r and s are connected in the graph if
||xr-xs||<e or if xs is one of the k neighbors of xr
The edge length is ||xr-xs||
For two nodes r and s not connected, the distance is equal to the shortest path between them
Once the NxN distance matrix is thus formed, use MDS to find a lower-dimensional mapping
23Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
24
-150 -100 -50 0 50 100 150-150
-100
-50
0
50
100
150Optdigits after Isomap (with neighborhood graph).
0
0
7
4
6
2
55
08
71
9 5
3
0
4
7
84
7
85
9
1
2
0
6
1
8
7
0
7
6
9
1
93
94
9
2
1
99
6
43
2
8
2
7
1
4
6
2
0
4
6
37 1
0
2
2
5
2
4
81
73
0
3 377
9
13
3
4
3
4
2
889 8
4
71
6
9
4
0
1 3
6
2
Matlab source from http://web.mit.edu/cocosci/isomap/isomap.html
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Locally Linear Embedding1. Given xr find its neighbors xs
(r)
2. Find Wrs that minimize
3. Find the new coordinates zr that minimize
25
2
r s
srrs
rXE )()|( xWxW
2
r s
srrs
r zzE )()|( WWz
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
26Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
LLE on Optdigits
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.51
1
1
1
1
1
00
7
4
6
2
5
5
0
8
7
1
9
5
3
0
47
84
7
8
5
91
2
0
6
18
7
0
76
91
9
3
9
4
9
2
1
99
6
432
82
7
14
6
2
0
46
3
7
1
0
22 52
48
1
7
3
0
33
77
91
334
342
88
98 4
7
1
6 94
0
1
36
2
Matlab source from http://www.cs.toronto.edu/~roweis/lle/code.html