Expectation Maximization Introduction to Artificial Intelligence COS302 Michael L. Littman Fall...

Expectation MaximizationExpectation Maximization

Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence

COS302COS302

Michael L. LittmanMichael L. Littman

Fall 2001Fall 2001

AdministrationAdministration

Exams halfway graded. They assure Exams halfway graded. They assure me they will be working over me they will be working over Thanksgiving break.Thanksgiving break.

Project groups.Project groups.

Next week, synonyms via web.Next week, synonyms via web.

Week after, synonyms via wordnet. Week after, synonyms via wordnet. (See web site.)(See web site.)

PlanPlan

Connection between learning from Connection between learning from data and finding a maximum data and finding a maximum likelihood (ML) modellikelihood (ML) model

ML from complete dataML from complete data

EM: ML with missing dataEM: ML with missing data

EM for HMMsEM for HMMs

QED, PDQ, MOUSEQED, PDQ, MOUSE

Learning from DataLearning from Data

We want to learn a model with a set We want to learn a model with a set of parameter values M.of parameter values M.

We are given a set of data D.We are given a set of data D.

An approach: argmaxAn approach: argmaxMM Pr(M|D) Pr(M|D)

This is the This is the maximum likelihoodmaximum likelihood model (ML).model (ML).

How relate to Pr(D|M)?How relate to Pr(D|M)?

Super Simple ExampleSuper Simple Example

Coin I and Coin II. (Weighted.)Coin I and Coin II. (Weighted.)

Pick a coin at random (uniform).Pick a coin at random (uniform).

Flip it 4 times.Flip it 4 times.

Repeat.Repeat.

What are the parameters of the What are the parameters of the model?model?

DataData

Coin ICoin I Coin IICoin II

HHHTHHHT TTTHTTTH

HTHHHTHH THTTTHTT

HTTHHTTH TTHTTTHT

THHHTHHH HTHTHTHT

HHHHHHHH HTTTHTTT

Probability of D Given MProbability of D Given M

p: Probability of H from Coin Ip: Probability of H from Coin I

q: Probability of H from Coin IIq: Probability of H from Coin II

Let’s say h heads and t tails for Coin Let’s say h heads and t tails for Coin I. h’ and t’ for Coin II.I. h’ and t’ for Coin II.

Pr(D|M) = pPr(D|M) = ph h (1-p)(1-p)t t qqh’ h’ (1-q)(1-q)t’ t’

How maximize this quantity?How maximize this quantity?

Maximizing pMaximizing p

DDpp(p(ph h (1-p)(1-p)t t qqh’ h’ (1-q)(1-q)t’ t’ ) = 0) = 0

DDpp(p(phh)(1-p))(1-p)t t + p+ ph h DDpp((1-p)((1-p)tt) = 0) = 0

h ph ph-1h-1 (1-p) (1-p)t t = p= ph h t(1-p)t(1-p)t-1t-1

h (1-p)h (1-p) = p= p tt

h = ph = p t + hpt + hp

h/(t+h) = ph/(t+h) = pDuh…Duh…

Missing DataMissing Data

HHHTHHHT HTTHHTTH

TTTHTTTH HTHHHTHH

THTTTHTT HTTTHTTT

TTHTTTHT HHHHHHHH

THHHTHHH HTHTHTHT

Oh Boy, Now What!Oh Boy, Now What!

If we knew the labels (which flips If we knew the labels (which flips from which coin), we could find ML from which coin), we could find ML values for p and q.values for p and q.

What could we use to label?What could we use to label?

p and q!p and q!

Computing LabelsComputing Labels

p = ¾, q = 3/10p = ¾, q = 3/10

Pr(Coin I | HHTH)Pr(Coin I | HHTH)

= Pr(HHTH | Coin I) Pr(Coin I) / c= Pr(HHTH | Coin I) Pr(Coin I) / c

= (3/4)= (3/4)33(1/4) (1/2)/c = .052734375/c(1/4) (1/2)/c = .052734375/c

Pr(Coin II | HHTH)Pr(Coin II | HHTH)

= Pr(HHTH | Coin II) Pr(Coin II) / c= Pr(HHTH | Coin II) Pr(Coin II) / c

= (3/10)= (3/10)33(7/10) (1/2)/c= .00945/c(7/10) (1/2)/c= .00945/c

Expected LabelsExpected Labels

II IIII II IIII

HHHTHHHT .85.85 .15.15 HTTHHTTH .44.44 .56.56

TTTHTTTH .10.10 .90.90 HTHHHTHH .85.85 .15.15

THTTTHTT .10.10 .90.90 HTTTHTTT .10.10 .90.90

TTHTTTHT .10.10 .90.90 HHHHHHHH .98.98 .02.02

THHHTHHH .85.85 .15.15 HTHTHTHT .44.44 .56.56

Wait, I Have an IdeaWait, I Have an Idea

Pick some model MPick some model M00

ExpectationExpectation

• Compute expected labels via MCompute expected labels via Mii

MaximizationMaximization

• Compute ML model MCompute ML model Mi+1i+1

RepeatRepeat

Could This Work?Could This Work?

Expectation-Maximization (EM)Expectation-Maximization (EM)

Pr(D|MPr(D|Mii) will not decrease.) will not decrease.

Sound familiar? Type of search.Sound familiar? Type of search.

Coin ExampleCoin Example

Compute expected labels.Compute expected labels.

Compute counts of heads and tails Compute counts of heads and tails (fractions).(fractions).

Divide to get new probabilities.Divide to get new probabilities.

p=.63 q=.42p=.63 q=.42 Pr(D|M)=9.95 x 10Pr(D|M)=9.95 x 10-13-13

p=.42 q=.63p=.42 q=.63 Pr(D|M)=9.95 x 10Pr(D|M)=9.95 x 10-13-13

p=.52 q=.52p=.52 q=.52 Pr(D|M)=9.56 x 10Pr(D|M)=9.56 x 10-13-13

More General EMMore General EM

Need to be able to compute Need to be able to compute probabilities: generative modelprobabilities: generative model

Need to tabulate counts to estimate Need to tabulate counts to estimate ML modelML model

Let’s think this through with HMMs.Let’s think this through with HMMs.

Recall HMM ModelRecall HMM Model

N states, M observationsN states, M observations

(s): prob. starting state is s(s): prob. starting state is s

p(s,s’): prob. of s to s’ transitionp(s,s’): prob. of s to s’ transition

b(s, k): probability of obs k from sb(s, k): probability of obs k from s

kk00 k k11 … k … kll: observation sequence: observation sequence

argmaxargmax, p, b, p, b Pr( Pr(, p, b | k, p, b | k00 k k11 … k … kll))

ML in HMMML in HMM

How estimate How estimate , p, b?, p, b?

What’s the missing information?What’s the missing information?

kk00 k k11 … k … kll

ss00 s s11 … s … sll

Pr(sPr(stt=s|=s|N N NN N N))

R: 0.7, N: 0.3 R: 0.0, N: 1.00.1

0.2 0.80.9

NN NN NN

UPUP

DOWNDOWN

Forward ProcedureForward Procedure

(s,t): probability of seeing first t (s,t): probability of seeing first t observations observations andand ending up in ending up in state s: Pr(kstate s: Pr(k00…k…ktt, s, stt=s)=s)

(s,0) = (s,0) = (s) b(k(s) b(k00, s), s)

(s,t) = sum(s,t) = sums’s’ b(k b(ktt,s) p(s,s’) ,s) p(s,s’) (s’,t-1)(s’,t-1)

Backward ProcedureBackward Procedure

(s,t): probability of seeing (s,t): probability of seeing observations from t to l given that observations from t to l given that we start in state s: we start in state s: Pr(kPr(kt+1t+1…k…kll | | sstt=s)=s)

(s,l+1) = 1(s,l+1) = 1

(s,t)=sum(s,t)=sumss p(s,s’) p(s,s’)(s’,t+1) b(k(s’,t+1) b(kt+1t+1,s) ,s)

Combing Combing and and

Want to know Pr(sWant to know Pr(stt=s | k=s | k00…k…kll))

= Pr(k= Pr(k00…k…kll, s, stt=s ) / c=s ) / c

= Pr(k= Pr(k00…k…kt t kkt+1t+1…k…kll, s, stt=s ) / c=s ) / c

= Pr(k= Pr(k00…k…ktt, s, stt=s )=s ) Pr(k Pr(kt+1t+1…k…kl l | k| k00…k…ktt, s, stt=s ) / c=s ) / c

= Pr(k= Pr(k00…k…ktt, s, stt=s ) Pr(k=s ) Pr(kt+1t+1…k…kl l | s| stt=s ) / c=s ) / c

= = (s,t) (s,t) (s,t) / c(s,t) / c

EM For HMMEM For HMM

Expectation: Forward-backward Expectation: Forward-backward (Baum-Welch)(Baum-Welch)

Maximization: Use counts to Maximization: Use counts to reestimate parametersreestimate parameters

Repeat.Repeat.

Gets stuck, but still works well.Gets stuck, but still works well.

What to LearnWhat to Learn

Maximum Likelihood (counts)Maximum Likelihood (counts)

Expectation (expected counts)Expectation (expected counts)

EMEM

Forward-backward for HMMsForward-backward for HMMs

Homework 8 (due 11/28)Homework 8 (due 11/28)

1.1. Write a program that decides if a pair of Write a program that decides if a pair of words are synonyms using the web. I’ll words are synonyms using the web. I’ll send you the list, you send me the send you the list, you send me the answers.answers.

2.2. Recall the naïve Bayes model in which a Recall the naïve Bayes model in which a class is chosen at random, then features class is chosen at random, then features are generated from the class. Consider are generated from the class. Consider a simple example with 2 classes with 3 a simple example with 2 classes with 3 binary features. Let’s use EM to learn a binary features. Let’s use EM to learn a naïve Bayesnaïve Bayes

(continued)(continued)

model. (a) What are the parameters of model. (a) What are the parameters of the model? (b) Imagine we are given the model? (b) Imagine we are given data consisting of the two feature data consisting of the two feature values for each sample from the model. values for each sample from the model. We are not given the class label. We are not given the class label. Describe an “expectation” procedure to Describe an “expectation” procedure to compute class labels for the data given compute class labels for the data given a model. (c) How do you use this a model. (c) How do you use this procedure to learn a maximum procedure to learn a maximum likelihood model for the data?likelihood model for the data?

Homework 9 (due 12/5)Homework 9 (due 12/5)

1.1. Write a program that decides if a Write a program that decides if a pair of words are synonyms using pair of words are synonyms using wordnet. I’ll send you the list, you wordnet. I’ll send you the list, you send me the answers.send me the answers.

2.2. more soonmore soon

Expectation Maximization Introduction to Artificial Intelligence COS302 Michael L. Littman Fall...

Documents

Transcript of Expectation Maximization Introduction to Artificial Intelligence COS302 Michael L. Littman Fall...