Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures ›...

43
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models Michael Mandel <[email protected]> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 February 7, 2008 1 Classification 2 Generative models 3 Gaussian models 4 Hidden Markov models Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 1 / 43

Transcript of Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures ›...

Page 1: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

EE E6820: Speech & Audio Processing & Recognition

Lecture 3:Machine learning, classification, and generative

models

Michael Mandel <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

February 7, 2008

1 Classification

2 Generative models

3 Gaussian models

4 Hidden Markov models

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 1 / 43

Page 2: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Outline

1 Classification

2 Generative models

3 Gaussian models

4 Hidden Markov models

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 2 / 43

Page 3: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Classification and generative models

0

00

1000

1000

2000

2000

1000

2000

3000

4000

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f/Hz

time/s

F1/Hz

F2/Hz

x

x

ay ao

ClassificationI discriminative modelsI discrete, categorical random variable of interestI fixed set of categories

Generative modelsI descriptive modelsI continuous or discrete random variable(s) of interestI can estimate parametersI Bayes’ rule makes them useful for classification

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 3 / 43

Page 4: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Building a classifier

Define classes/attributesI could state explicit rulesI better to define through ‘training’ examples

Define feature space

Define decision algorithmI set parameters from examples

Measure performanceI calculated (weighted) error rate

250 300 350 400 450 500 550 600600

700

800

900

1000

1100

F1 / Hz

F2

/ Hz

Pols vowel formants: "u" (x) vs. "o" (o)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 4 / 43

Page 5: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Classification system parts

signal

segment

feature vector

class

Pre-processing/ segmentation

Feature extraction

Classification

Post-processing

Sensor

• STFT • Locate vowels

• Formant extraction

• Context constraints • Costs/risk

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 5 / 43

Page 6: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Feature extraction

Right features are criticalI waveform vs formants vs cepstraI invariance under irrelevant modifications

Theoretically equivalent features may act very differently in aparticular classifier

I representations make important aspects explicitI remove irrelevant information

Feature design incorporates ‘domain knowledge’I although more data ⇒ less need for ‘cleverness’

Smaller ‘feature space’ (fewer dimensions)

→ simpler models (fewer parameters)→ less training data needed→ faster training

[inverting MFCCs]

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 6 / 43

Page 7: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Optimal classification

Minimize probability of error with Bayes optimal decision

θ = argmaxθi

p(θi | x)

p(error) =

∫p(error | x)p(x) dx

=∑

i

∫Λi

(1− p(θi | x))p(x) dx

I where Λi is the region of x where θi is chosen. . . but p(θi | x) is largest in that region

I so p(error) is minimized

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 7 / 43

Page 8: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Sources of error

p(x|ω2)·Pr(ω2)

p(x|ω1)·Pr(ω1)

x

Xω2^

Pr(ω1|Xω2) = Pr(err|Xω2)

^

^

Pr(ω2|Xω1) = Pr(err|Xω1)

^

^

Xω1^

Suboptimal threshold / regions (bias error)I use a Bayes classifier

Incorrect distributions (model error)I better distribution models / more training data

Misleading features (‘Bayes error’)I irreducible for given feature setI regardless of classification scheme

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 8 / 43

Page 9: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Two roads to classification

Optimal classifier is

θ = argmaxθi

p(θi | x)

but we don’t know p(θi | x)

Can model distribution directly

e.g. Nearest neighbor, SVM, AdaBoost, neural netI maps from inputs x to outputs θiI a discriminative model

Often easier to model data likelihood p(x | θi )I use Bayes’ rule to convert to p(θi | x)I a generative (descriptive) model

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 9 / 43

Page 10: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Nearest neighbor classification

0 200 400 600 800 1000 1200 1400 1600 1800

600

800

1000

1200

1400

1600

1800Pols vowel formants: "u" (x), "o" (o), "a" (+)

F1 / Hz

F2

/ Hz

x

*

+

o

new observation x

Find closest match (Nearest Neighbor)

Naıve implementation takes O(N) time for N training points

As N →∞, error rate approaches twice the Bayes error rate

With K summarized classes, takes O(K ) time

Locality sensitive hashing gives approximate nearest neighborsin O(dn1/c2

) time (Andoni and Indyk, 2006)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 10 / 43

Page 11: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Support vector machines

“Large margin” linear classifier forseparable data

I regularization of margin avoidsover-fitting

I can be adapted to non-separabledata (C parameter)

I made nonlinear using kernelsk(x1, x2) = Φ(x1) · Φ(x2) .

(w x) + b = –1.

(w x) + b = + 1.

x 1

y

y i = +1

w

(w x) + b = 0

x 2

i = – 1

Depends only on training points near the decision boundary,the support vectors

Unique, optimal solution for given Φ and C

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 11 / 43

Page 12: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Outline

1 Classification

2 Generative models

3 Gaussian models

4 Hidden Markov models

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 12 / 43

Page 13: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Generative models

Describe the data using structured probabilistic models

Observations are random variables whose distribution dependson model parameters

Source distributions p(x | θi )I reflect variability in featuresI reflect noise in observationI generally have to be estimated from data (rather than known

in advance)

p(x|ωi)

x

ω1 ω2 ω3 ω4

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 13 / 43

Page 14: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Generative models (2)

Three things to do with generative models

Evaluate the probability of an observation,possibly under multiple parameter settings

p(x), p(x | θ1), p(x | θ2), . . .

Estimate model parameters from observed data

θ = argminθ

C (θ∗, θ | x)

Run the model forward to generate new data

x ∼ p(x | θ)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 14 / 43

Page 15: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Random variables review

Random variables have jointdistributions, p(x , y)

Marginal distribution of y

p(y) =

∫p(x , y) dx

Knowing one value in a jointdistribution constrains the remainder

Conditional distribution of x given y

p(x | y) ≡ p(x , y)

p(y)=

p(x , y)∫p(x , y) dy

xµx

σx

σxyσyµy

yp(x,y)

p(y)

p(x)

x

y

Y

p(x,y)

p(x | y = Y )

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 15 / 43

Page 16: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Bayes’ rule

p(x | y)p(y) = p(x , y) = p(y | x)p(x)

∴ p(y | x) =p(x | y)p(y)

p(x)

⇒ can reverse conditioning given priors/marginals

terms can be discrete or continuous

generalizes to more variables

p(x , y , z) = p(x | y , z)p(y , z) = p(x | y , z)p(y | z)p(z)

allows conversion between joint, marginals, conditionals

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 16 / 43

Page 17: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Bayes’ rule for generative models

Run generative models backwards to compare them

Likelihood

p(θ | x) = p(x | θ)Rp(x | θ)p(θ)

· p(θ)

Posterior prob Evidence = p(x) Prior prob

Posterior is the classification we’re looking forI combination of prior belief in each classI with likelihood under our modelI normalized by evidence (so

∫posteriors = 1)

Objection: priors are often unknown

. . . but omitting them amounts to assuming they are all equal

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 17 / 43

Page 18: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Computing probabilities and estimating parameters

Want probability of the observation under a model, p(x)I regardless of parameter settings

Full Bayesian integral

p(x) =

∫p(x | θ)p(θ) dθ

Difficult to compute in general, approximate as p(x | θ)I Maximum likelihood (ML)

θ = argmaxθ

p(x | θ)

I Maximum a posteriori (MAP): ML + prior

θ = argmaxθ

p(θ | x) = argmaxθ

p(x | θ)p(θ)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 18 / 43

Page 19: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Model checking

After estimating parameters, run the model forward

Check thatI model is rich enough to capture variation in dataI parameters are estimated correctlyI there aren’t any bugs in your code

Generate data from the model and compare it to observations

x ∼ p(x | θ)

I are they similar under some statistics T (x) : Rd 7→ R ?I can you find the real data set in a group of synthetic data sets?

Then go back and update your model accordingly

Gelman et al. (2003, ch. 6)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 19 / 43

Page 20: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Outline

1 Classification

2 Generative models

3 Gaussian models

4 Hidden Markov models

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 20 / 43

Page 21: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Gaussian models

Easiest way to model distributions is via parametric modelI assume known form, estimate a few parameters

Gaussian model is simple and useful. In 1D

p(x | θi ) =1

σi

√2π

exp

[−1

2

(x − µi

σi

)2]

Parameters mean µi and variance σi → fit

p(x|ωi)

xµi

σi

PMPM e1/2

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 21 / 43

Page 22: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Gaussians in d dimensions

p(x | θi ) =1

(2π)d/2|Σi |1/2exp

[−1

2(x− µi )

T Σ−1i (x− µi )

]Described by a d-dimensional mean µi

and a d × d covariance matrix Σi

0

2

4

0

2

4

0

0.5

1

0 1 2 3 4 50

1

2

3

4

5

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 22 / 43

Page 23: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Gaussian mixture models

Single Gaussians cannot modelI distributions with multiple modesI distributions with nonlinear correlations

What about a weighted sum?

p(x) ≈∑k

ckp(x | θk)

I where {ck} is a set of weights and {p(x | θk)} is a set ofGaussian components

I can fit anything given enough components

Interpretation: each observation is generated by one of theGaussians, chosen with probability ck = p(θk)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 23 / 43

Page 24: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Gaussian mixtures (2)

e.g. nonlinear correlation

05

1015

2025

3035

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-2

-1

0

1

2

3

original data

Gaussian components

resulting surface

Problem: finding ck and θk parameters

easy if we knew which θk generated each x

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 24 / 43

Page 25: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Expectation-maximization (EM)

General procedure for estimating model parameters whensome are unknown

e.g. which GMM component generated a point

Iteratively updated model parameters θ to maximize Q, theexpected log-probability of observed data x and hidden data z

Q(θ, θt) =

∫zp(z | x , θt) log p(z , x | θ)

I E step: calculate p(z | x , θt) using θtI M step: find θ that maximizes Q using p(z | x , θt)I can prove p(x | θ) non-decreasingI hence maximum likelihood modelI local optimum—depends on initialization

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 25 / 43

Page 26: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Fitting GMMs with EM

Want to findI parameters of the Gaussians θk = {µk ,Σk}I weights/priors on Gaussians ck = p(θk)

. . . that maximize likelihood of training data x

If we could assign each x to a particular θk , estimation wouldbe direct

Hence treat mixture indices, z , as hiddenI form Q = E [p(x , z | θ)]I differentiate wrt model parameters→ equations for µk , Σk , ck to maximize Q

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 26 / 43

Page 27: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

GMM EM updated equations

Parameters that maximize Q

νnk ≡ p(zk | xn, θt)

µk =

∑n νnkxn∑n νnk

Σk =

∑n νnk(xn − µk)(xn − µk)T∑

n νnk

ck =1

N

∑n

νnk

Each involves νnk , ‘fuzzy membership’ of xn in Gaussian k

Updated parameter is just sample average, weighted by fuzzymembership

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 27 / 43

Page 28: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

GMM examplesVowel data fit with different mixture counts

200 400 600 800 1000 1200600

800

1000

1200

1400

16001 Gauss logp(x)=-1911

200 400 600 800 1000 1200600

800

1000

1200

1400

16002 Gauss logp(x)=-1864

200 400 600 800 1000 1200600

800

1000

1200

1400

16003 Gauss logp(x)=-1849

200 400 600 800 1000 1200600

800

1000

1200

1400

16004 Gauss logp(x)=-1840

[Example. . . ]

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 28 / 43

Page 29: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Outline

1 Classification

2 Generative models

3 Gaussian models

4 Hidden Markov models

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 29 / 43

Page 30: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Markov models

A (first order) Markov model is a finite-state system whosebehavior depends only on the current state

“The future is independent of the past, conditioned on thepresent”

e.g. generative Markov model

A

S

E

C

Bp(qn+1|qn)

S A B C E

0 1 0 0 0

0 0 0 0 1

0 .8 .1 .1 00 .1 .8 .1 00 .1 .1 .7 .1

S A B C E

qn

qn+1.8 .8

.7

.1

.1

.1.1.1

.1

.1

S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 30 / 43

Page 31: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Hidden Markov models

Markov model where state sequence Q = {qn} is not directlyobservable (‘hidden’)

But, observations X do depend on QI xn is RV that depends only on current state p(xn | qn)

1

2

3

0

0.2

0.4

0.6

0.8

0 10 20 300

0

1

2

3

0 1 2 3 40

0.2

0.4

0.6

0.8

observation x time step n

State sequenceEmission distributions

Observation sequence

x nx n

p(x|

q)p(

x|q)

q = A q = B q = C

q = A q = B q = C

AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC

can still tell something about state sequence. . .

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 31 / 43

Page 32: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

(Generative) Markov models

HMM is specified by parameters Θ:

k a t

k a t •

k a t •

• •

• k a t

k a t •

0.9 0.1 0.0 0.01.0 0.0 0.0 0.0

0.0 0.9 0.1 0.00.0 0.0 0.9 0.1

p(x|q)

x

- states qi

- transition probabilities aij

- emission distributions bi(x)

(+ initial state probabilities πi )

aij ≡ p(q jn | qi

n−1) bi (x) ≡ p(x | qi ) πi ≡ p(qi1)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 32 / 43

Page 33: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Markov models for sequence recognitionIndependence of observations

I observation xn depends only on current state qn

p(X |Q) = p(x1, x2, . . . xN | q1, q2, . . . qN)

= p(x1 | q1)p(x2 | q2) · · · p(xN | qN)

=N∏

n=1

p(xn | qn) =N∏

n=1

bqn(xn)

Markov transitionsI transition to next state qi+1 depends only on qi

p(Q |M) = p(q1, q2, . . . |M)

= p(qN | qN−1 . . . q1)p(qN−1 | qN−2 . . . q1)p(q2 | q1)p(q1)

= p(qN | qN−1)p(qN−1 | qN−2)p(q2 | q1)p(q1)

= p(q1)N∏

n=2

p(qn | qn−1) = πq1

N∏n=2

aqn−1qn

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 33 / 43

Page 34: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Model-fit calculations

From ‘state-based modeling’:

p(X |Θj) =∑all Q

p(X |Q,Θj)p(Q |Θj)

For HMMs

p(X |Q) =N∏

n=1

bqn(xn)

p(Q |M) = πq1

N∏n=2

aqn−1qn

Hence, solve for Θ = argmaxΘjp(Θj |X )

I Using Bayes’ rule to convert from p(X |Θj)

Sum over all Q???

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 34 / 43

Page 35: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Summing over all paths

q0 q1 q2 q3 q4

S A A A ES A A B ES A B B ES B B B E

.9 x .7 x .7 x .1 = 0.0441

.9 x .7 x .2 x .2 = 0.0252

.9 x .2 x .8 x .2 = 0.0288

.1 x .8 x .8 x .2 = 0.0128Σ = 0.1109 Σ = p(X | M) = 0.4020

2.5 x 0.2 x 0.1 = 0.052.5 x 0.2 x 2.3 = 1.152.5 x 2.2 x 2.3 = 12.650.1 x 2.2 x 2.3 = 0.506

0.00220.02900.36430.0065

S A B E

AS B E0.9• 0.1 • 0.7• 0.2 0.1• • 0.8 0.2• • • 1

All possible 3-emission paths Qk from S to Ep(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)

Observation likelihoods

x1 x2 x3

AB

p(x|q)

q{ 2.5 0.2 0.10.1 2.2 2.3

A B ES0.9

0.1

0.7

0.2

0.1

0.8

0.2

Model M1 xn

n1

p(x|B)p(x|A)

2 3

Observations x1, x2, x3

S0 1 2 3 4

A

B

E

Sta

tes

time n

0.1

0.9

0.80.2

0.7

0.8 0.2

0.2

0.7

0.1

Paths

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 35 / 43

Page 36: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

The ‘forward recursion’

Dynamic-programming-like technique to sum over all Q

Define αn(i) as the probability of getting to state qi at timestep n (by any path):

αn(i) = p(x1, x2, . . . xn, qn = qi ) ≡ p(X n1 , q

in)

αn+1(j) can be calculated recursively:

Time steps n

Mo

del

sta

tes

qi

αn(i+1)

αn(i)

ai+1j

aij

bj(xn+1)

αn+1(j) = [Σ αn(i)·aij]·bj(xn+1)i=1

S

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 36 / 43

Page 37: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Forward recursion (2)

Initialize α1(i) = πibi (x1)

Then total probability p(XN1 |Θ) =

∑Si=1 αN(i)

→ Practical way to solve for p(X |Θj) and hence select the mostprobable model (recognition)

Observations X

p(X | M1)·p(M1)

p(X | M2)·p(M2)

Choose best

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 37 / 43

Page 38: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Optimal path

May be interested in actual qn assignmentsI which state was ‘active’ at each time frame

e.g. phone labeling (for training?)

Total probability is over all paths

. . . but can also solve for single best path, “Viterbi” state sequence

Probability along best path to state qjn+1:

αn+1(j) =

[max

i{αn(i)aij}

]bj(xn+1)

I backtrack from final state to get best pathI final probability is product only (no sum)→ log-domain calculation is just summation

Best path often dominates total probability

p(X |Θ) ≈ p(X , Q |Θ)

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 38 / 43

Page 39: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Interpreting the Viterbi path

Viterbi path assigns each xn to a state qi

I performing classification based on bi (x). . . at the same time applying transition constraints aij

0 10 20 300

1

2

3

Viterbi labels:

Inferred classification

x nAAAAAAAABBBBBBBBBBBCCCCBBBBBBBC

Can be used for segmentationI train an HMM with ‘garbage’ and ‘target’ statesI decode on new data to find ‘targets’, boundaries

Can use for (heuristic) training

e.g. forced alignment to bootstrap speech recognizere.g. train classifiers based on labels. . .

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 39 / 43

Page 40: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Aside: Training and test data

A rich model can learn every training example (overtraining)

But the goal is to classify new, unseen dataI sometimes use ‘cross validation’ set to decide when to stop

training

For evaluation results to be meaningful:I don’t test with training data!I don’t train on test data (even indirectly. . . )

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 40 / 43

Page 41: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Aside (2): Model complexity

More training data allows the use of larger models

More model parameters create a better fit to the training dataI more Gaussian mixture componentsI more HMM states

For fixed training set size, there will be some optimal modelsize that avoids overtraining

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 41 / 43

Page 42: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

Summary

Classification is making discrete (hard) decisions

Basis is comparison with known examplesI explicitly or via a model

Classification modelsI discriminative models, like SVMs, neural nets, boosters,

directly learn posteriors p(θi | x)I generative models, like Gaussians, GMMs, HMMs, model

likelihoods p(x | θ)I Bayes’ rule lets us use generative models for classification

EM allows parameter estimation even with some data missing

Parting thought

Is it wise to use generative models for discrimination or vice versa?

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 42 / 43

Page 43: Lecture 3: Machine learning, classification, and ... › ~dpwe › e6820 › lectures › L03-ml.pdf · Lecture 3: Machine learning, classi cation, and generative models ... I Bayes’

References

Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximatenearest neighbor in high dimensions. In Proc. IEEE Symposium on Foundations ofComputer Science, pages 459–468, Washington, DC, USA, 2006. IEEE ComputerSociety.

Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian DataAnalysis. Chapman & Hall/CRC, second edition, July 2003. ISBN 158488388X.

Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.

Jeff A. Bilmes. A gentle tutorial on the em algorithm and its application to parameterestimation for gaussian mixture and hidden markov models, 1997.

Christopher J. C. Burges. A tutorial on support vector machines for patternrecognition. Data Min. Knowl. Discov., 2(2):121–167, June 1998.

Michael Mandel (E6820 SAPR) Machine learning February 7, 2008 43 / 43