INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The...

Post on 13-Jan-2016

249 views 2 download

Transcript of INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The...

INTRODUCTION TO MACHİNE LEARNİNG3RD EDİTİON

ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz© The MIT Press, 2014 for CS539 Machine Learning at WPI

alpaydin@boun.edu.trhttp://www.cmpe.boun.edu.tr/~ethem/i2ml3e

Lecture Slides for

CHAPTER 5:

MULTİVARİATE METHODS

3

Multivariate Data

Nd

NN

d

d

XXX

XXXXXX

21

222

21

112

11

X

Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples

4

Multivariate Parameters

221

22221

11221

μμCov

ddd

d

d

TE

XXX

ji

ijijji

jiTji

Tjjiijiij

Td

XX

XXEXXEXX

E

,Corr :nCorrelatio

,Cov:Covariance

,...,:Mean 1μx

Covariance Matrix:

5

Parameter Estimation from data sample X

ji

ijij

jtj

N

t iti

ij

N

t

ti

i

ss

sr

N

mxmxs

diN

xm

:matrix nCorrelatio

:matrix Covariance

,...,1,: mean Sample

1

1

R

S

m

6

Estimation of Missing Values

What to do if certain instances have missing attribute values?

Ignore those instances. This is not a good idea if the sample is small

Use ‘missing’ as an attribute: may give information

Imputation: Fill in the missing value Mean imputation: Use the most likely value

(e.g., mean) Imputation by regression: Predict based on

other attributes

7

Multivariate Normal Distribution

μxμxx

μx

1212

Σ2

1

Σ2

1

Σ

Td

d

p exp

~

//

,N

8

Multivariate Normal Distribution

Mahalanobis distance: (x – μ)T ∑–1 (x – μ) measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)

Bivariate: d = 2

2221

2121

iiii xz

zzzzxxp

/

exp,

2

2212122

21

21 212

1

12

1

z-normalization

9

Bivariate NormalIsoprobability [i.e., (x – μ)T ∑–1 (x – μ) = c2] contour plots

when covariance is 0, ellipsoid axes are parallel to coordinate axes

10

11

If xi are independent, offdiagonal values of ∑ are 0, Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:

If variances are also equal, reduces to Euclidean distance

Independent Inputs: Naive Bayes

d

i i

iid

ii

d

d

iii

xxpp

1

2

1

2/1 2

1exp

2

1

x

The use of the term “Naïve Bayes” in this chapter is somewhat wrong Naïve Bayes assumes independence in the probability sense,

not in the linear algebra sense

12

Parametric Classification

If p (x | Ci ) ~ N ( μi , ∑i )

Discriminant functions

iiT

i

idiCp μxμxx 1

212Σ

2

1

Σ2

1exp| //

iiiT

ii

iii

CPd

CPCpg

log loglog

log| log

μΣμ2

2

12

21 xx

xx

13

Estimation of Parameters from data sample X

tti

Ti

tt i

tti

i

tti

ttt

ii

tti

i

r

r

r

rN

rCP

mxmx

xm

S

ˆ

iiiT

iii CPg ˆ log log mxmxx 1

2

1

2

1SS

14

Assuming a different Si for each Ci

Quadratic discriminant. Expanding the formula on previous slide:

has the form of a quadratic formula

See figure on next slide

iiiiTii

iii

ii

iTii

T

iiiTiii

Ti

Tii

CPw

w

CPg

ˆ log log

where

ˆ log log

SS

S

SW

W

SSSS

2

1

2

1

2

1

22

1

2

1

10

1

1

0

111

mm

mw

xwxx

mmmxxxx

15

likelihoods

posterior for C1

discriminant: P (C1|x ) = 0.5

16

Assuming Common Covariance Matrix S

ii

iCP̂ SS

Shared common sample covariance S

Discriminant reduces to

which is a linear discriminant

iiT

ii CP̂g log21 1 mxmxx S

iiTiiii

iTii

CPw

wg

ˆ log

where

mmmw

xwx

10

1

0

2

1SS

17

Common Covariance Matrix S

Arbitrary covariances but shared by classes

18

Assuming Common Covariance Matrix S is Diagonal

id

j j

ijtj

i CPsmx

g ˆ log

1

2

2

1x

When xj j = 1,..d, are independent, ∑ is diagonal

p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’

assumption)

Classify based on weighted Euclidean distance (in sj units) to the nearest mean

19

Assuming Common Covariance Matrix S is Diagonal

variances may bedifferent

Covariances are 0, so ellipsoid axes are parallel to coordinate axes

20

Assuming Common Covariance Matrix S is Diagonal and variances are equal

id

jij

tj

ii

i

CPmxs

CPs

g

ˆ log

ˆ log

2

12

2

2

2

1

2

mxx

Nearest mean classifier: Classify based on Euclidean distance to the nearest mean

Each mean can be considered a prototype or template and this is template matching

21

Assuming Common Covariance Matrix S is Diagonal and variances are equal

*?

Covariances are 0, so ellipsoid axes are parallel to coordinate axes.Variances are the same, so ellipsoids become circles.

Classifier looks for nearest mean

22

Model Selection

Assumption Covariance matrix No of parameters

Shared, Hyperspheric Si=S=s2I 1

Shared, Axis-aligned Si=S, with sij=0 d

Shared, Hyperellipsoidal Si=S d(d+1)/2

Different, Hyperellipsoidal Si K d(d+1)/2

As we increase complexity (less restricted S), bias decreases and variance increases

Assume simple models (allow some bias) to control variance (regularization)

23

Different cases of covariance matrices

fitted to the same data lead to different decision

boundaries

24

Discrete Features

i

jijjijj

iii

CPpxpx

CPCpg

log log log

log| log

11

xx

Binary features:if xj are independent (Naive Bayes’)

the discriminant is linear

ijij Cxpp |1

d

j

xij

xiji

jj ppCxp1

11|

tti

tti

tj

ij r

rxp̂Estimated parameters

25

Discrete Features

ikjijkijk CvxpCzpp || 1

Multinomial (1-of-nj) features: xj in {v1, v2,..., vnj

}

where zjk = 1 if xj = vk ; or 0 otherwiseif xj are independent

tti

tti

tjk

ijk

iijkj k jki

d

j

n

k

zijki

r

rzp

CPpzg

pCpj

jk

ˆ

log log

|

x

x1 1

Multivariate Regression26

dtt wwwxgr ,...,,| 10

Multivariate linear model

Multivariate polynomial model: Define new higher-order variables

z1=x1, z2=x2, z3=x12, z4=x2

2, z5=x1x2

and use the linear model in this new z space (basis functions, kernel trick: Chapter 13)

211010

22110

2

1

ttdd

ttd

tdd

tt

xwxwwrwwwE

xwxwxww

X|,...,,