Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

98
Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Transcript of Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Page 1: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Bayesian approaches to knowledge representation and reasoning

Part 1(Chapter 13)

Page 2: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Bayesianism vs. Frequentism

• Classical probability: Frequentists

– Probability of a particular event is defined relative to its frequency in a sample space of events.

– E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses.

Page 3: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Bayesian probability:

– Combine measure of “prior” belief you have in a proposition with your subsequent observations of events.

• Example: Bayesian can assign probability to statement “The first e-mail message ever written was not spam” but frequentist cannot.

Page 4: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Bayesian Knowledge Representation and Reasoning

• Question: Given the data D and our prior beliefs, what is the probability that h is the correct hypothesis? (spam example)

Page 5: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Bayesian terminology (example -- spam recognition)

– Random variable X: returns one of a set of values

{x1, x2, ...,xm},

or a continuous value in interval [a,b] with probability distribution D(X).

– Data D: {v1, v2, v3, ...} Set of observed values of random variables X1, X2, X3, ...

Page 6: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

– Hypothesis h: Function taking instance j and returning classification of j (e.g., “spam” or “not spam”).

– Space of hypotheses H: Set of all possible hypotheses

Page 7: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

– Prior probability of h:

• P(h): Probability that hypothesis h is true given our prior knowledge

• If no prior knowledge, all h H are equally probable

– Posterior probability of h:

• P(h|D): Probability that hypothesis h is true, given the data D.

– Likelihood of D:

• P(D|h): Probability that we will see data D, given hypothesis h is true.

Page 8: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Recall definition of conditional probability:

)(

)()|(

YP

YXPYXP

X Yevent space

Event space = all e-mail messages

X = all spam messages

Y = all messages containing word “v1agra”

Page 9: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Bayes Rule:

)(

)()|()|(

YP

XPXYPYXP

)()|()()()|(

:Proof

XPXYPYXPYPYXP

X Yevent space

Page 10: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Example: Using Bayes Rule

Hypotheses:

h = “message m is spam”

h = “message m is not spam”

Data:

+ = message m contains “viagra”

– = message m does not contain “viagra”

Prior probability:

P(h) = 0.1 P(h) = 0.9

Likelihood:

P(+ | h) = 0.6 P(– | h) = 0.4

P(+ | h) = 0.03, P(– | h) = 0.97

Page 11: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

P(+) = P(+ | h) P(h) + P(+ | h)P(h)

= 0.6 * .1 + .03 * .9 = 0.09

P(–) = 0.91

P(h | +) = P(+ | h) P(h) / P(+)

= 0.6 * 0.1 / .09 = .67

How would we learn these prior probabilities and likelihoods from past examples of spam and not spam?

Page 12: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Full joint probability distribution(CORRECTED)

“viagra” “viagra”

Spam .06 .04

Spam .027 .873

Notation: P(h,D) P(h D)

P (h +) = P(h | +) P(+)

P(h -) = P(h | -) P(-)

etc.

Page 13: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Now suppose there is a second feature examined: does message contain the word “offer”?

viagra “viagra” “viagra” “viagra”

spam

spam

offer offer

Full joint distribution scales exponentially with number of parameters

P(m=spam, viagra, offer)

Page 14: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Bayes optimal classifier for spam:

where fi is a feature (here, could be a “keyword”)

• In general, intractable.

),...,,|(maxarg 21},{

nspamspamh

fffhP

Page 15: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Classification using “naive Bayes”

• Assumes that all features are independent of one another.

• How do we learn the naive Bayes model from data?

• How do we apply the naive Bayes model to a new instance?

)|()(),...,,|( 21 hfPhPfffhPi

in

Page 16: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Example: Training and Using Naive Bayes for Classification

• Features:

– CAPS: Boolean (longest contiguous string of capitalized letters in message is longer than 3)

– URL: Boolean (0 if no URL in message, 1 if at least one URL in message)

– $: Boolean (0 if $ does not appear at least once in message; 1 otherwise)

Page 17: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Training data:

M1: “DON’T MISS THIS AMAZING OFFER $$$!” spam

M2: “Dear mm, for more $$, check this out: http://www.spam.com” spam

M3: “I plan to offer two sections of CS 250 next year” not spam

M4: “Hi Mom, I am a bit short on $$ right now, can you not spam

send some ASAP? Love, me”

Page 18: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Training a Naive Bayes Classifier

• Two hypotheses: spam or not spam

• Estimate: P(spam) = .5 P(spam) = .5

P(CAPS | spam) = .5 P(CAPS | spam) = .5

P(URL | spam) = .5 P(URL | spam) = .5

P($ | spam) = .75 P($ | spam) = .25

P(CAPS | spam )= .5 P(CAPS | spam) = .5

P(URL | spam) = .25 P(URL | spam) = .75

P($ | spam) = .5 P($ | spam) =.5

)|()(),...,,,( 21 hfPhPfffhPi

in

Page 19: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• m-estimate of probability (to fix cases where one of the terms in the product is 0):

.2 and 2

1set we typicallyfeatures,Boolean For

. weight heavily to how determineshat constant t a is

for estimateprior is

true,is for which examples training ofnumber theis

trueis for which examples trainingofnumber theis where

,)|(

i

mp

pm

fhp

ftruehn

hn

mn

mpnhfP

ic

ci

Page 20: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Now classify new message:

M4: “This is a ONE-TIME-ONLY offer that will get you BIG $$$, just click on http://www.spammers.com”

Page 21: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Information Retrieval

• Most important concepts:

– Defining features of a document

– Indexing documents according to features

– Retrieving documents in response to a query

– Ordering retrieved documents by relevance

• Early search engines:

– Features: List of all terms (keywords) in document (minus “a”, “the”, etc.)

– Indexing: by keyword

– Retrieval: by keyword match with query

– Ordering: by number of keywords matched

• Problems with this approach

Page 22: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Naive Bayesian Document retrieval• Let D be a document (“bag of words”), Q be a query (“bag of words”), and r be the event that D is

relevant to Q.

• In document retreival, we want to compute:

• Or, “odds ratio”:

• In the book, they show (via a lot of algebra) that

• Chain rule: P(A,B) = p(A|B) p(B)

),|( QDrP

),|(

),|(

QDrP

QDrP

document" theof relevance" intrinsic" ),|(

)|(

)|(),|(

),|(

),|(

rDQP

DrP

DrPrDQP

QDrP

QDrP

Page 23: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Naive Bayesian Document retrieval• Let D be a document (“bag of words”), Q be a query (“bag of words”), and r be the event that D is

relevant to Q.

• In document retreival, we want to compute:

• Or, “odds ratio”:

• In the book, they show (via a lot of algebra) that

• Chain rule: P(A,B) = p(A|B) p(B)

),|( QDrP

),|(

),|(

QDrP

QDrP

document" theof relevance" intrinsic" ),|(

)|(

)|(),|(

),|(

),|(

rDQP

DrP

DrPrDQP

QDrP

QDrP

Page 24: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Naive Bayesian Document retrieval

• Where Qj is the jth keyword in the query.

• The probability of a query given a relevant document D is estimated as the product of the probabilities of each keyword in the query, given the relevant document.

• How to learn these probabilities?

j

j rDQPtRDQP ),|(),|(

Page 25: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Evaluating Information Retrieval Systems• Precision and Recall

• Example: Out of corpus of 100 documents, query has following results:

• Precision: Fraction of relevant documents in results set = 30/40 = .75 “How precise is results set?”

• Recall: Fraction of relevant documents in whole corpus that are in results set = 30/50 = .60 “How many relevant documents were recalled?”

In results set Not in results set

Relevant 30 20

Not relevant 10 40

Page 26: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Tradeoff between recall and precision:

If we want to ensure that recall is high, just recall a lot of documents. Then precision may be low. If we recall 100% of documents, but only 50% are relevant, then recall is 1, but precision is 0.5.

If we want high chance for precision to be high, just recall the single document judged most relevant (“I’m feeling lucky” in Google.) Then precision will (likely) be 1.0, but recall will be low.

When do you want high precision? When do you want high recall?

Page 27: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Bayesian approaches to knowledge representation and reasoning

Part 2(Chapter 14, sections 1-4)

Page 28: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Recall Naive Bayes method:

• This can also be written in terms of “cause” and “effect”:

)|()()...,|( ,,1 hfPhPffhPi

in

)|()()...,|( ,,1 CauseeffectPCausePeffecteffectCausePi

in

Page 29: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Spam

v1agra

stockoffer

cause

effects

Naive Bayes

Bayesian network

Spam

v1agra stock

offer

Page 30: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Each node has a “conditional probability table” that gives its dependencies on its parents.

Spam

v1agra stock

offer

P(Spam)

0.1

Spam P(v1agra)

t 0.6

f 0.03

Spam P(stock)

t 0.2

f 0.3 stock P(offer)

t 0.6

f 0.1

Page 31: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Semantics of Bayesian networks

• If network is correct, can calculate full joint probability distribution from network.

where parents(Xi) denotes specific values of parents of Xi.

))(|(),...,(

)...(

11

11

i

n

iin

nn

XparentsxPxxP

xXxXP

offer offer offer offer

Spam .012 .008 .008 .072

Spam .162 .108 .063 .567

stock stock

Sum of all boxesis 1.

Page 32: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Example from textbook

• I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?

• Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls

• Network topology reflects "causal" knowledge:– A burglar can set the alarm off– An earthquake can set the alarm off– The alarm can cause Mary to call– The alarm can cause John to call

Page 33: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Example continued

Page 34: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Complexity of Bayesian Networks

• For n random Boolean variables:

– Full joint probability distribution: 2n entries

– Bayesian network with at most k parents per node:

• Each conditional probability table: at most 2k entries

• Entire network: n 2k entries

Page 35: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Exact inference in Bayesian networks

Query:

What is P(Burglary | JohnCalls=true ^ MaryCalls = true)?

Notation: Capital letters are distributions; lower case letters are values or variables, depending on context.

We have:

},{ },{

)(

)()()(

)(),|(

falsetruee falsetruea

mjaeBP

mjBPmPjP

mjBPmjBP

Page 36: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

e a

amPajPebaPePbPmjbP )|()|(),|()()( ),|(

Let’s calculate this for b = “Burglary = true”:

Worse case complexity: O(n 2n), where n is number of Boolean

variables.

We can simplify:

e a

amPajPebaPePbPmjbP )|()|(),|()()( ),|(

Page 37: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

A. Onisko et al., A Bayesian network model for diagnosis of liver disorders

Page 38: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Can speed up further via “variable elimination”.

However, bottom line on exact inference:

In general, it’s intractable. (Exponential in n.)

Solution:

Approximate inference, by sampling.

Page 39: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Bayesian approaches to knowledge representation and reasoning

Part 3(Chapter 14, section 5)

Page 40: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

What are the advantages of Bayesian networks?

• Intuitive, concise representation of joint probability distribution (i.e., conditional dependencies) of a set of random variables.

• Represents “beliefs and knowledge” about a particular class of situations.

• Efficient (?) (approximate) inference algorithms

• Efficient, effective learning algorithms

Page 41: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Review of exact inference in Bayesian networks

General question: What is P(x|e)?

Example Question: What is P(c| r,w)?

Page 42: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

General question: What is P(x|e)?

networks)Bayesian of (semantics ))(|(

) variablesevidence-non theare (where ),,(

factor)ion normalizat a is ( ),(

y)probabilit lconditiona ofn (definitio )(

),()|(

},,{

y ye

y

YYe

e

e

ee

ZparentszP

xP

xP

P

xPxP

xz

Page 43: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Event space

Page 44: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Event space

Cloudy

Page 45: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Event space

Cloudy

Rain

Page 46: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Event space

Cloudy

Sprinkler

Rain

Page 47: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Event space

Cloudy

Sprinkler

Wet Grass

Rain

Page 48: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Event space

Cloudy

Sprinkler

Wet Grass

Rain

Page 49: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

)3636(.

)9.9.8.5(.99.1.8.5.

),|()|()|()(

))(|(

),,,(),|(

},,,{

s

s swrcz

s

rswPcsPcrPcP

ZparentszP

swrcPwrcPEvent space

Cloudy

Sprinkler

Wet Grass

Rain

)0945(.),|( wrcP

206.,794.

0945.3636.

0945.,

0945.3636.

3636.

0945.,3636.),|(

wrCP

Page 50: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Draw expression tree for

• Worst-case complexity is exponential in n (number of nodes)

• Problem is having to enumerate all possibilities for many variables.

s

rswPcsPcrPcP ),|()|()|()(

Page 51: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Issues in Bayesian Networks

• Building / learning network topology

• Assigning / learning conditional probability tables

• Approximate inference via sampling

Page 52: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Real-World Example 1: The Lumière Project at Microsoft Research

• Bayesian network approach to answering user queries about Microsoft Office.

• “At the time we initiated our project in Bayesian information retrieval, managers in the Office division were finding that users were having difficulty finding assistance efficiently.”

• “As an example, users working with the Excel spreadsheet might have required assistance with formatting “a graph”. Unfortunately, Excel has no knowledge about the common term, “graph,” and only considered in its keyword indexing the term “chart”.

Page 53: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 54: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Networks were developed by experts from user modeling studies.

Page 55: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 56: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Offspring of project was Office Assistant in Office 97.

Page 57: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Real-World Example 2:Diagnosing liver disorders with Bayesian

networks• Variables: “disorder class” (16 possibilities) plus 93

features from existing database of patient records.

• Data: 600 patient records, which used those features

• Network structure: designed by “domain experts” (30 hours)

Page 58: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

A. Onisko et al., A Bayesian network model for diagnosis of liver disorders

Page 59: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Prior and conditional probability distributions were learned from data in liver-disorders database.

• Problem: Data doesn’t give enough samples for good conditional probability estimates.

• For combinations of parent values that are not adequately sampled, assume uniform distribution over those values.

Page 60: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Results

“number of observations” = number of evidence variables in query

window = n means that classification is counted as correct if it is in the n most probable diagnoses given by the network for the given evidence values.

Page 61: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Approximate inference in Bayesian networks

Instead of enumerating all possibilities, sample to estimate probabilities.

X1 X2 X3 Xn

...

Page 62: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Direct Sampling

• Suppose we have no evidence, but we want to determine P(c,s,r,w) for all c,s,r,w.

• Direct sampling:

– Sample each variable in topological order, conditioned on values of parents.

– I.e., always sample from P(Xi | parents(Xi))

Page 63: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

1. Sample from P(Cloudy). Suppose returns true.

2. Sample from P(Sprinkler | Cloudy = true). Suppose returns false.

3. Sample from P(Rain | Cloudy = true). Suppose returns true.

4. Sample from P(WetGrass | Sprinkler = false, Rain = true). Suppose returns true.

Here is the sampled event: [true, false, true, true]

Example

Page 64: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Suppose there are N total samples, and let NS (x1, ..., xn) be the observed frequency of the specific event x1, ..., xn.

• Suppose N samples, n nodes. Complexity O(Nn).

• Problem 1: Need lots of samples to get good probability estimates.

• Problem 2: Many samples are not realistic; low likelihood.

),...,(),...,(

lim 11

nnS

NxxP

N

xxN

),...,(),...,(

11

nnS xxP

N

xxN

Page 65: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Likelihood weighting

• Now suppose we have evidence e. Thus values for the evidence variables E are fixed.

• We want to estimate P(X | e)

• Need to sample X and Y, where Y is the set of non-evidence variables.

• Each event sampled is weighted by the likelihood that that event accords to the evidence.

• I.e., events in which the actual evidence appears unlikely should be given less weight.

Page 66: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Example:

Estimate P(Rain | Sprinkler = true, WetGrass = true).

WeightedSample algorithm:

1. Set weight w = 1.0

2. Sample from Cloudy. Suppose it returns true.

3. Sprinkler is an evidence variable with value true. Update likelihood weighting:

Low likelihood for sprinkler if cloudy is true, so this sample gets lower weight.

1.0)|( trueCloudytrueSprinklerPww

Page 67: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

4. Sample from P(Rain | Cloudy = true). Suppose this returns true.

5. WetGrass is an evidence variable with value true. Update likelihood weighting:

6. Return event [true, true, true, true] with weight 0.099.

Weight is low because cloudy = true, so sprinkler is unlikely to be true.

099.0),|( trueRaintrueSprinklertrueWetGrassPww

Page 68: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 69: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Problem with likelihood sampling

• As number of evidence variables increases, performance degrades. This is because most samples will have very low weights, so weighted estimate will be dominated by fraction of samples that accord more than an infinitesimal likelihood to the evidence.

Page 70: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Markov Chain Monte Carlo Sampling

• One of most common methods used in real applications.

• Uses idea of “Markov blanket” of a variable Xi:

– parents, children, children’s parents

• Recall that: By construction of Bayesian network, a node is conditionaly independent of its non-descendants, given its parents.

Page 71: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Proposition: A node Xi is conditionally independent of all other nodes in the network, given its Markov blanket.

– Example.

– Need to show that Xi is conditionally independent of nodes outside its Markov blanket.

– Need to show that Xi can be conditionally dependent on children’s parents.

Page 72: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

A C

B

E

F

Example:

The proposition says: B is conditionally independent of F given A, C, E.

This can only be true if

P(B | A,C,E,F) = P(B | A, C, E)

Page 73: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Prove:

We know, by definition of conditional probability:

From tree we have:

),,|(),,,|( ECABPFECABP

),,,(

),,,,(),,,|(

FECAP

FECABPFECABP

),(),|()|()()(

)|(),(),|()()(),,,()2(

)|()|(),|()()(),,,,()1(

bEPCAbPEFPCPAP

EFPbEPCAbPCPAPFECAP

EFPBEPCABPCPAPFECABP

b

b

Page 74: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Thus:

b

b

bEPCAbP

BEPCABP

BEPCAbPEFPCPAP

EFPBEPCABPCPAPFECABP

)|(),|(

)|(),|(

)|(),|()|()()(

)|()|(),|()()(),,,|(

Page 75: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Now compute P(B | A, C, E):

Thus:

Q.E.D.

b

bEPCAbPCPAPECAP

BEPCABPCPAPECABP

ECAP

ECABPECABP

)|(),|()()(),,()2(

)|(),|()()(),,,()1(

),,(

),,,(),,|(

).,,,|(

),(),|(

)|(),|(),,|(

FECABP

bEPCAbP

BEPCABPECABP

b

Page 76: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Markov Chain Monte Carlo Sampling

• Start with random sample from variables: (x1, ..., xn). This is the current “state” of the algorithm.

• Next state: Randomly sample value for one non-evidence variable Xi , conditioned on current values in “Markov Blanket” of Xi.

Page 77: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Example

• Query: What is P(Rain | Sprinkler = true, WetGrass = true)?

• MCMC: – Random sample, with evidence variables fixed:

[true, true, false, true]

– Repeat:

1. Sample Cloudy, given current values of its Markov blanket: Sprinkler = true, Rain = false. Suppose result is false. New state:

[false, true, false, true]

2. Sample Rain, given current values of its Markov blanket:

Cloudy = false, Sprinkler = true, WetGrass = true. Suppose

result is true. New state: [false, true, true, true].

Page 78: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Each sample contributes to estimate for query

P(Rain | Sprinkler = true, WetGrass = true)

• Suppose we perform 100 such samples, 20 with Rain = true and 80 with Rain = false.

• Then answer to the query is

Normalize (20,80) = .20,.80

• Claim: “The sampling process settles into a dynamic equilibrium in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability, given the evidence.”

• Proof of claim is on pp. 517-518.

Page 79: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 80: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Claim (again)

• Claim: MCMC settles into behavior in which each state is sampled exactly according to its posterior probability, given the evidence.

• That is: for all variables Xi, the probability of the value xi of Xi appearing in a sample is equal to P(xi | e).

Page 81: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Proof of Claim (outline)

First, give example of Markov chain.

Now:

Let x be a state, with x = (x1, ..., xn).

Let q (x x) be the transition probability from state x to state x.

Let t(x) be the probability that the system will be in state x after t time steps, starting from state x0.

Let t+1(x’) be the probability that the system will be in state x after t+1 time steps, starting from state x0.

Page 82: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

We have:

x

xxxx )'()()'(1 qtt

Page 83: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Definition:

Result from Markov chain theory: Given q, there is exactly one such stationary distribution (assuming q is “ergodic”).

is called the Markov process’s stationary distribution if t = t+1 for all x.

Defining equation for stationary distribution:

(1) )'()()'( xxxxxx

q

Page 84: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

One way to satisfy equation 1 is:

Called property of detailed balance.

Detailed balance implies stationarity:

',)'()'()'()( xxxxxxxx qq

).'()'()'(

)'()'()'()(

xxxx

xxxxxxxx

x

q

qq

Page 85: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Proof of claim: [Replace old version of this slide with this version]

Show that transition probability q (x x) defined by MCMC sampling satisfies detailed balance equation, with a stationary distribution equal to P(x | e).

Let Xi be the variable to be sampled. Let e be the values of the evidence variables and let Y be the other non-evidence variables.

Current sample: x = vector(xi, y), with fixed evidence variable values e.

We have, by definition of MCMC algorithm:

),|()),,(),,'(()'( eyeyeyxx iii xPxxqq ),|'()),,'(),,(()'( eyeyeyxx iii xPxxqq

Page 86: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Now, show this transition probability produces detailed balance.

We want to show:

)|()(then

)'()'()'()( If

exx

xxxxxx

P

qq

Page 87: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

...

)|()|,()(

have we),'(),|'( Since

3 and 2 terms

on rulechain backwardsby )|,(),|'(

),|()|(),|'(

)'()'()'()(

)'()'(

DEQ

ePxP

qxP

xPxP

xPPxP

qq

i

i

q

ii

xeyx

xxey

eyey

eyeyey

xxxxxx

xxx

Page 88: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Speech Recognition(Section 15.6)

• Task: Identify sequence of words uttered by speaker, given acoustic signal.

• Uncertainty introduced by noise, speaker error, variation in pronunciation, homonyms, etc.

• Thus speech recognition is viewed as problem of probabilistic inference.

Page 89: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Speech Recognition

• So far, we’ve looked at probabilistic reasoning in static environments.

• Speech: Time sequence of “static environments”.

– Let X be the “state variables” (i.e., set of non-evidence variables) describing the environment (e.g., Words said during time step t)

– Let E be the set of evidence variables (e.g., S = features of acoustic signal).

Page 90: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

– The E values and X joint probability distribution changes over time.

t1: X1, e1

t2: X2 , e2

etc.

Page 91: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• At each t, we want to compute P(Words | S).

• We know from Bayes rule:

• P(S | Words), for all words, is a previously learned “acoustic model”.

– E.g. For each word, probability distribution over phones, and for each phone, probability distribution over acoustic signals (which can vary in pitch, speed, volume).

• P(Words), for all words, is the “language model”, which specifies prior probability of each utterance.

– E.g. “bigram model”: probability of each word following each other word.

)()|()|( WordsPWordsPWordsP SS

Page 92: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

• Speech recognition typically makes three assumptions:

1. Process underlying change is itself “stationary”

i.e., state transition probabilities don’t change

2. Current state X depends on only a finite history of previous states (“Markov assumption”).

– Markov process of order n: Current state depends only on n previous states.

3. Values et of evidence variables depend only on current state Xt. (“Sensor model”)

Page 93: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 94: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 95: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 96: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)
Page 97: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)

Example: “I’m firsty, um, can I have something to dwink?”

Page 98: Bayesian approaches to knowledge representation and reasoning Part 1 (Chapter 13)