INTRODUCTION TO MACHİNE LEARNİNG3RD EDİTİON
ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz© The MIT Press, 2014 for CS539 Machine Learning at WPI
[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml3e
Lecture Slides for
CHAPTER 5:
MULTİVARİATE METHODS
3
Multivariate Data
Nd
NN
d
d
XXX
XXXXXX
21
222
21
112
11
X
Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples
4
Multivariate Parameters
221
22221
11221
μμCov
ddd
d
d
TE
XXX
ji
ijijji
jiTji
Tjjiijiij
Td
XX
XXEXXEXX
E
,Corr :nCorrelatio
,Cov:Covariance
,...,:Mean 1μx
Covariance Matrix:
5
Parameter Estimation from data sample X
ji
ijij
jtj
N
t iti
ij
N
t
ti
i
ss
sr
N
mxmxs
diN
xm
:matrix nCorrelatio
:matrix Covariance
,...,1,: mean Sample
1
1
R
S
m
6
Estimation of Missing Values
What to do if certain instances have missing attribute values?
Ignore those instances. This is not a good idea if the sample is small
Use ‘missing’ as an attribute: may give information
Imputation: Fill in the missing value Mean imputation: Use the most likely value
(e.g., mean) Imputation by regression: Predict based on
other attributes
7
Multivariate Normal Distribution
μxμxx
μx
1212
Σ2
1
Σ2
1
Σ
Td
d
p exp
~
//
,N
8
Multivariate Normal Distribution
Mahalanobis distance: (x – μ)T ∑–1 (x – μ) measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)
Bivariate: d = 2
2221
2121
iiii xz
zzzzxxp
/
exp,
2
2212122
21
21 212
1
12
1
z-normalization
9
Bivariate NormalIsoprobability [i.e., (x – μ)T ∑–1 (x – μ) = c2] contour plots
when covariance is 0, ellipsoid axes are parallel to coordinate axes
10
11
If xi are independent, offdiagonal values of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:
If variances are also equal, reduces to Euclidean distance
Independent Inputs: Naive Bayes
d
i i
iid
ii
d
d
iii
xxpp
1
2
1
2/1 2
1exp
2
1
x
The use of the term “Naïve Bayes” in this chapter is somewhat wrong Naïve Bayes assumes independence in the probability sense,
not in the linear algebra sense
12
Parametric Classification
If p (x | Ci ) ~ N ( μi , ∑i )
Discriminant functions
iiT
i
idiCp μxμxx 1
212Σ
2
1
Σ2
1exp| //
iiiT
ii
iii
CPd
CPCpg
log loglog
log| log
μΣμ2
1Σ
2
12
21 xx
xx
13
Estimation of Parameters from data sample X
tti
Ti
tt i
tti
i
tti
ttt
ii
tti
i
r
r
r
rN
rCP
mxmx
xm
S
ˆ
iiiT
iii CPg ˆ log log mxmxx 1
2
1
2
1SS
14
Assuming a different Si for each Ci
Quadratic discriminant. Expanding the formula on previous slide:
has the form of a quadratic formula
See figure on next slide
iiiiTii
iii
ii
iTii
T
iiiTiii
Ti
Tii
CPw
w
CPg
ˆ log log
where
ˆ log log
SS
S
SW
W
SSSS
2
1
2
1
2
1
22
1
2
1
10
1
1
0
111
mm
mw
xwxx
mmmxxxx
15
likelihoods
posterior for C1
discriminant: P (C1|x ) = 0.5
16
Assuming Common Covariance Matrix S
ii
iCP̂ SS
Shared common sample covariance S
Discriminant reduces to
which is a linear discriminant
iiT
ii CP̂g log21 1 mxmxx S
iiTiiii
iTii
CPw
wg
ˆ log
where
mmmw
xwx
10
1
0
2
1SS
17
Common Covariance Matrix S
Arbitrary covariances but shared by classes
18
Assuming Common Covariance Matrix S is Diagonal
id
j j
ijtj
i CPsmx
g ˆ log
1
2
2
1x
When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’
assumption)
Classify based on weighted Euclidean distance (in sj units) to the nearest mean
19
Assuming Common Covariance Matrix S is Diagonal
variances may bedifferent
Covariances are 0, so ellipsoid axes are parallel to coordinate axes
20
Assuming Common Covariance Matrix S is Diagonal and variances are equal
id
jij
tj
ii
i
CPmxs
CPs
g
ˆ log
ˆ log
2
12
2
2
2
1
2
mxx
Nearest mean classifier: Classify based on Euclidean distance to the nearest mean
Each mean can be considered a prototype or template and this is template matching
21
Assuming Common Covariance Matrix S is Diagonal and variances are equal
*?
Covariances are 0, so ellipsoid axes are parallel to coordinate axes.Variances are the same, so ellipsoids become circles.
Classifier looks for nearest mean
22
Model Selection
Assumption Covariance matrix No of parameters
Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal Si K d(d+1)/2
As we increase complexity (less restricted S), bias decreases and variance increases
Assume simple models (allow some bias) to control variance (regularization)
23
Different cases of covariance matrices
fitted to the same data lead to different decision
boundaries
24
Discrete Features
i
jijjijj
iii
CPpxpx
CPCpg
log log log
log| log
11
xx
Binary features:if xj are independent (Naive Bayes’)
the discriminant is linear
ijij Cxpp |1
d
j
xij
xiji
jj ppCxp1
11|
tti
tti
tj
ij r
rxp̂Estimated parameters
25
Discrete Features
ikjijkijk CvxpCzpp || 1
Multinomial (1-of-nj) features: xj in {v1, v2,..., vnj
}
where zjk = 1 if xj = vk ; or 0 otherwiseif xj are independent
tti
tti
tjk
ijk
iijkj k jki
d
j
n
k
zijki
r
rzp
CPpzg
pCpj
jk
ˆ
log log
|
x
x1 1
Multivariate Regression26
dtt wwwxgr ,...,,| 10
Multivariate linear model
Multivariate polynomial model: Define new higher-order variables
z1=x1, z2=x2, z3=x12, z4=x2
2, z5=x1x2
and use the linear model in this new z space (basis functions, kernel trick: Chapter 13)
211010
22110
2
1
ttdd
ttd
tdd
tt
xwxwwrwwwE
xwxwxww
X|,...,,
Top Related