The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye ›...

43
Uncertainty Reasoning in Knowledge Systems The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering K.U.Leuven, Belgium Autumn 2007 H. Bruyninckx, The Bayesian approach 1 Further reading ++ Christopher M. Bishop, Pattern Recognition and Machine Learning, 2006. (Permanently available in CBA library: 1 681.3*I5 2006.) + Jensen, Finn V. and Nielsen, Thomas D., Bayesian Networks and Decision Graphs, 2001. (Available in CBA library: 1 519.2 2001.) + David MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press. (Available in CBA library: 1 519.72 2003. Available electronically at http://www.inference. phy.cam.ac.uk/mackay/itila/book.html.) H. Bruyninckx, The Bayesian approach 2 Further reading (2) Sebastian Thrun, Wolfram Burgard and Dieter Fox, Probabilistic Robotics, MIT Press, 2005. (Available in CBA library: 1 681.3* I29 2005.) The http://en.wikipedia.org contains many good articles on Bayesian concepts and algorithms, and a (still rather poor) overview: http: //en.wikipedia.org/wiki/Bayesian_probability. Kevin P. Murphy, A Brief Introduction to Graphical Models and Bayesian Networks, 1998. Available electronically at http: //www.cs.ubc.ca/ ~ murphyk/Bayes/bnintro.html. H. Bruyninckx, The Bayesian approach 3

Transcript of The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye ›...

Page 1: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Uncertainty Reasoning inKnowledge Systems

—The Bayesian approach

Herman BruyninckxDepartment of Mechanical Engineering

K.U.Leuven, Belgium

Autumn 2007

H. Bruyninckx, The Bayesian approach 1

Further reading

++ Christopher M. Bishop, Pattern Recognition andMachine Learning, 2006.(Permanently available in CBA library: 1 681.3*I5 2006.)

+ Jensen, Finn V. and Nielsen, Thomas D., BayesianNetworks and Decision Graphs, 2001.(Available in CBA library: 1 519.2 2001.)

+ David MacKay, Information Theory, Inference, andLearning Algorithms, Cambridge University Press.(Available in CBA library: 1 519.72 2003.Available electronically at http://www.inference.phy.cam.ac.uk/mackay/itila/book.html.)

H. Bruyninckx, The Bayesian approach 2

Further reading (2)

I Sebastian Thrun, Wolfram Burgard and Dieter Fox,Probabilistic Robotics, MIT Press, 2005.(Available in CBA library: 1 681.3* I29 2005.)

I The http://en.wikipedia.org contains many goodarticles on Bayesian concepts and algorithms, and a (stillrather poor) overview: http:

//en.wikipedia.org/wiki/Bayesian_probability.

I Kevin P. Murphy, A Brief Introduction to GraphicalModels and Bayesian Networks, 1998. Availableelectronically at http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html.

H. Bruyninckx, The Bayesian approach 3

Page 2: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Further reading (3)

I On-line facsimile van Bayes’ paper: http:

//www.stat.ucla.edu/history/essay.pdf.

I Stephen E. Fienberg, When Did BayesianInference Become “Bayesian”? BayesianAnalysis, 1(1):1-40, 2006.http://lib.stat.cmu.edu/~fienberg/

fienberg-BA-06-Bayesian.pdf

H. Bruyninckx, The Bayesian approach 4

What is URKS?Knowledge System (intelligent system; reasoning system):

I Computer program that can answer questions about theinformation in a real-world system (engineered, social).

I It uses relationships between variables that represent thereal-world system in a mathematical way.

I Learning of the variables as well as of the relationships isan important (but not necessary) part of a knowledgesystem.

I Examples: weather forecasting; traffic speed detector;medical diagnosis system; legal court reasoning system;robot moving around in unknown environment; GPSnavigation system; . . .

I Counter examples: lecture room automation system.

H. Bruyninckx, The Bayesian approach 5

Uncertainty (information, probability, possibility, plausibility,chance, . . . ):

I There can be uncertainty about: the values of thevariables; the structure of the relationships; the mostappropriate relationship in a set of possible relationships;etc.

I Examples: predicted weather; estimated speed of a car;identification of the car that drove too fast; probabilityof being guilty; correct physical theory of the origin ofthe universe; position of a car on the map; probabilitythat patient will die from cancer within six months; . . .

I Counter examples: value of the 1000th digit of π;number of millimeters in one meter; my name is“Herman”; . . .

H. Bruyninckx, The Bayesian approach 5

Page 3: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Reasoning:

I System executes “most appropriate” algorithmto find the answer to a question I pose.

I The questions can be of a very variable nature:What happens when I give this input? How canthe system incorporate this new information?What would be the optimal structure of thesystem for these information needs? . . .

I Learning is one important example of“reasoning”.

I Decision making is another important example.

I How to reason about concepts like: opinion;act of God; personal bias; prior knowledge; . . .

H. Bruyninckx, The Bayesian approach 5

Overview of the Bayesian approach

1. Modelling — How is the informationrepresented?

I Structure/Relationships.I Values.

2. Reasoning/Information processing — Whatquestions can be asked?

I learning (both values and structure), dataassociation, pattern recognition, classification,clustering, model building (= learning structure),. . .

3. Decision making — What do you do with theanswer?(Beyond the scope of this course.)

H. Bruyninckx, The Bayesian approach 6

1. Modelling

1.1 Mathematical representation of problem domain= parameters (“random variables”) + relationships(“graphical model,” “(Bayesian) network”).

H. Bruyninckx, The Bayesian approach 7

Page 4: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

1.2 Information

I Probability Density Function (PDF) overparameter space: function i.s.o. single number.

I Subjective representation of the human’sknowledge about—or interpretation of—theworld.

-8 -6 -4 -2 0 2 4 6 8-8-6

-4-2

0 2

4 6

8

0 0.005 0.01

0.015 0.02

0.025 0.03

0.035 0.04

0.045 0.05

Where am I?

H. Bruyninckx, The Bayesian approach 8

Joint PDF:

I captures uncertain/stochastic dependencebetween parameters.

I Graphical model = “bookkeeping” structure forcomputations of information processing =factorization of joint PDF.

p(a, b, c , x , . . . , α, β, η, θ) = p(x , y .z)p(c |x , y , z)p(a, b) . . .

H. Bruyninckx, The Bayesian approach 9

2. Information processing

2.1 Inference (deduction, abduction).2.2 Learning:

I parameters in given model.I structure of new model.

2.3 Estimation (data reduction).

Major computational algorithms:

I Bayes’ rule: incorporate new information.

I Marginalization: reduce existing information.

I Belief propagation in network: transportinformation.

H. Bruyninckx, The Bayesian approach 10

Page 5: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

3. Decision making

3.1 Utility functions:I Gives “value” to information in variables. . .I . . . to translate information into action.I Does not influence information content in network.

(At least not directly.)I Arbitrary!

3.2 Attached “somewhere” to graphical model.

3.3 Computation of decision making need not besynchronised with computation of informationprocessing.

This course: only estimation and model selection.

H. Bruyninckx, The Bayesian approach 11

Graphical models

H. Bruyninckx, The Bayesian approach 12

Graphical Models

I Best reference: Bishop 2006.I Graphical representation of structure of

relationships:I node contains variables.I arc (edge, link, arrow, . . . ) contains mathematical

representation of relationship.

I The model describes (a simplification of) theworld in a coordinate-free way. The samemodel can have various mathematicalrepresentations as well as various algorithmicimplementations.

I Not limited to Bayesian paradigm.

H. Bruyninckx, The Bayesian approach 13

Page 6: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Joint PDF

I Bayesian paradigm: each node = joint PDF.

I The graphical model encodes conditionalindependence: X and Y are conditionallyindependent if knowing Y doesn’t give anyinformation about X . This gives(computational) structure to a graphicalmodel: it’s useless to do computations thatinvolve X if one wants to know things aboutY , so one can “cut off” parts of the model.

I If X and Y are conditionally independent givenZ then Y gives us no new information about Xonce we know Z .

H. Bruyninckx, The Bayesian approach 14

Directed arc — “Arrow”

I Explicit relationship z = f (x , y).I Arrow:

I indicates which direction is “easy” to calculate.I is not necessarily physical causality.

I Factorization of the joint PDF:

p(x , y , z) = p(z |x , y)p(x , y).

H. Bruyninckx, The Bayesian approach 15

Undirected arc

I Implicit relationship f (x , y , z) = 0.

I Factorization: not possible to write p(x , y , z)as a factorized function of p(x , y) and p(z).

I Computationally more costly than explicitrelationship.

H. Bruyninckx, The Bayesian approach 16

Page 7: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Hidden (“latent”) variable

Sometimes one uses a graphical indication of whatparameters are “measured” and what parametersare “hidden”:

H. Bruyninckx, The Bayesian approach 17

Network with cycle

H. Bruyninckx, The Bayesian approach 18

Graphical models & Bayes paradigm

I Undirected arcs: “Markov Random Field”I Directed arcs: “Bayesian/Belief Network”I Both: “Chain Graph”

I Wider applicability than only Bayesianparadigm!

I Difference with neural networks: interpretationof nodes and edges in context of model.

I Most efficient: Directed Acyclic Graph (DAG).(Serialization of computations!)

I Can be extended with decision making nodes.

H. Bruyninckx, The Bayesian approach 19

Page 8: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Directed Acyclic Graph (DAG)

p(A,B , . . . ,G ) = p(A)p(B)p(C |AB)p(E |C )

p(D|C )p(F |E )p(G |DEF ).

H. Bruyninckx, The Bayesian approach 20

Example: burglar alarm

Bayesian knowledge system = graphical model +probability tables/functions over random variables innodes.

H. Bruyninckx, The Bayesian approach 21

Markov Random Field

Undirected arrows.

Two (sets of) nodes A and B are made conditionallyindependent by a third (set) C, if all paths betweennodes in A and B are separated by a node in C.

In some cases, extra hidden variables can transformMRF into Bayesian network.

H. Bruyninckx, The Bayesian approach 22

Page 9: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Bayes Net: Serial connection—Causal chain—

A and C are independent, given the information inB: p(C |B ,A) = p(C |B).

Information in B “blocks” transfer of informationbetween A and C.

A and C are d-separated by B.(“d” stands for “disclosure”.)

Example: knowing that there is an alarm, oneknows the chance that Mary calls, independently ofwhether there has been a burglary or not.

H. Bruyninckx, The Bayesian approach 23

Bayes Net: Diverging connections—Common cause—

B, C, . . . , X are independent, given the informationin A: p(B |A,C , . . . ,X ) = p(B |A).

Example: knowing that there is an alarm, oneknows the probabilities that John and Mary call,without them having to know about each other.

Terminology: fan-out = # of A’s outgoing arcs.

H. Bruyninckx, The Bayesian approach 24

Bayes Net: Converging connectionsCommon effect – Explaining away

B, C, . . . , X are dependent, only via information in A:p(A,B ,C ) = p(B)p(C )p(A|B ,C ).

The information in B, C, . . . , X becomes correlated byinformation in their common child parameter A.

Example: if one observes an alarm, then information aboutwhether a burglary has occurred gives information aboutwhether an earthquake has occurred.

Terminology: fan-in = # of A’s incoming arcs.H. Bruyninckx, The Bayesian approach 25

Page 10: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Bayesian Probability Theory

Nice properties:

I Unique: single answer.

I Consistent: independent of route to answer.

I Plausible: extension of 0/1 logic.

What’s in a name?:

I Information = belief = evidence = possibility= probability = . . .

I Inference = information processing = beliefpropagation = . . .

H. Bruyninckx, The Bayesian approach 26

Arbitrary choices in modelling

1. Variables.(You must choose them!)

2. Relationships.(You must choose them!)

3. Information = PDF.(You must choose its form!)

H. Bruyninckx, The Bayesian approach 27

What is a PDF?= function p(x) over space of variable x

+ measure “dx” that describes the “density” ofthe domain around the value x .

I Average/expected value weigthed by density:∫D

p(x) dx (1)

I Integral over whole variable space = 1:I value “1” is arbitrary choice!I only relative value is important.

I Example of non-uniform density: longitude +latitude.

H. Bruyninckx, The Bayesian approach 28

Page 11: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

(Figure source: Stefan Kuhn, Wikipedia.org.)

Mercator projection: the surface “density” is notconstant over the whole map.

H. Bruyninckx, The Bayesian approach 29

Example PDF: Gaussian

Gaussian (or Normal) PDF, with mean µ = 0 andvariance σ = 5, 10, 20, 30.

H. Bruyninckx, The Bayesian approach 30

Example PDFMultidimensional Gaussian

1√(2π)n ||P||1/2

exp

{−1

2(x− µ)T P−1(x− µ)

}

-8 -6 -4 -2 0 2 4 6 8-8-6

-4-2

0 2

4 6

8

0 0.005

0.01 0.015

0.02 0.025

0.03 0.035

0.04 0.045

0.05

H. Bruyninckx, The Bayesian approach 31

Page 12: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Mean µ—Covariance P

µ =

∫x p(x) dx, P =

∫(x−µ)(x−µ)T p(x) dx

-8 -6 -4 -2 0 2 4 6 8-8-6-4-2 0 2 4 6 8

σ1σ2

H. Bruyninckx, The Bayesian approach 32

Gaussian PDFs (2)

Advantages:

I Only two parameters needed(per dimension of domain).

I Information processing: (often) analyticallypossible.

Disadvantages;

I mono-modal; uni-variate; only one “peak”.

I extends until infinity.

Efficient extensions:

I Sum of Gaussians.

I Exponential PDFs: α h(x) exp{β g(x)}.H. Bruyninckx, The Bayesian approach 33

Joint PDF

The multi-dimensional, single-valued informationfunction p(X ,Y ,Z , . . . ) that describes the“statistical” dependencies between the variablesX ,Y ,Z , . . . in a model M .

⇒ p(X ,Y ,Z , . . . |M)

Model M = representation of the relationshipsbetween variables X ,Y ,Z , . . .

H. Bruyninckx, The Bayesian approach 34

Page 13: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Relationships= dependent variables

Two variables X and Y are (statistically) dependentif a change in the value of X is correlated with achange in the value of Y .

Correlation: one can derive (part of the) change ininformation about X from a change in informationabout Y .

Not necessarily physical causal connection betweenX and Y .

H. Bruyninckx, The Bayesian approach 35

Dynamic relationships:“state-space” description

{dxdt = f (x , θ, u)

y = g(x , θ, u)

I x : domain values.

I θ: model parameters (PDF, relationships).

I u: inputs.

I y : outputs values.

Time is explicit parameter, and “horizon” ofnetwork moves with time.

H. Bruyninckx, The Bayesian approach 36

Dynamic inference exampleMobile robot in corridor

H. Bruyninckx, The Bayesian approach 37

Page 14: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Bayesian mathematics1. Sum rule

p(x + y |H) = p(x |H) + p(y |H)− p(xy |H).

Notation: “+,” “or,” “∨”.

Full version: p(X = x ∨ Y = y |H).

In words: the value of the information one hasabout the random variable X having the value x orthe random variable Y having the value y , given thecontext H, is the sum of the information about theindividual random variables, if both randomvariables are independent in the given context.

H. Bruyninckx, The Bayesian approach 38

2. Product rule

p(xy |H) = p(x |yH) p(y |H).

Notation: “multiplication” “and,” “∧”.

Full version: p(X = x ∧ Y = y |H).

In words: the value of the information one hasabout the random variable X having the value xand the random variable Y having the value y , isthe product of the information one has about therandom variable X given the value of the randomvariable Y , multiplied by the information one hasabout the random variable Y , all in the context H.

H. Bruyninckx, The Bayesian approach 39

2b. Chain rule

Apply product rule multiple times:

p(xyz |H) = p(x |yzH) p(y |zH) p(z |H).

H. Bruyninckx, The Bayesian approach 40

Page 15: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

3. Marginalization

p(x |H) =

∫p(x , y , z , . . . |H) dy dz . . .

(∫ →∑

, for discrete PDFs.)

In words: finds information on x taking intoaccount information about all other variablesy , z , . . . with which x is related.

In other words: “averages out” influence of allother related variables.

H. Bruyninckx, The Bayesian approach 41

4. Bayes’ rule(Inference—“Inverse” probability)

p(xy |H) = p(yx |H)

(product rule) ⇓ (product rule)

p(x |y ,H)p(y |H) = p(y |x ,H)p(x |H)

p(x |y ,H) =p(y |x ,H)

p(y |H)p(x |H)

H. Bruyninckx, The Bayesian approach 42

Bayes’ rule interpretation

p(Model params|Data,H)

=p(Data|Model params,H)

p(Data|H)p(Model params|H).

“Posterior =Conditional data likelihood

Data Likelihood︸ ︷︷ ︸“Likelihood”

× Prior.”

Data: given; all factors: f (model parameters).p(Data|H) = (often) “normalization factor.”

H. Bruyninckx, The Bayesian approach 43

Page 16: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Bayes’ rule with Gaussians

H. Bruyninckx, The Bayesian approach 44

Important properties

I PDF on Model parameters “in” ⇒ PDF onModel parameters “out.”

I Integration of information is multiplicative.

I Computationally intensive for general PDFs.

I Easy for discrete PDFs and Gaussians.(And some other families of continuous PDFs.)

I p(Data|Model params): requires known tableor mathematical function Data= f (Modelparams) to predict Data from Model.

I Likelihood is not a PDF.

H. Bruyninckx, The Bayesian approach 45

Example with discrete PDF(Inverse probability example)

Data:

I In MAI 31% of students are non-European.

I At URKS 50% of students are non-European.

I At MAI 48% of students studies URKS.

Question:

I Probability that non-European student at MAIstudies URKS?

H. Bruyninckx, The Bayesian approach 46

Page 17: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Bayes rule with discrete PDF (2)

Model:

I A corresponds to “studying URKS in MAI.”

I B corresponds to “being non-European inMAI.”

Inference:

p(A) = .48p(B) = .31p(B |A) = .50

⇒ p(A|B) =p(B |A)

p(B)p(A) = .77.

H. Bruyninckx, The Bayesian approach 47

Intuitive interpretation of PDFs—The prosecutor’s fallacy—

The following links contain examples of courtreasoning that lead to very wrong conclusions, bymisunderstanding of the differences between p(x |y)and p(y |x), and between likelihood and PDF:

I http://en.wikipedia.org/wiki/

Prosecutor’s_fallacy

I http://en.wikipedia.org/wiki/

Conditional_probability#The_

conditional_probability_fallacy

H. Bruyninckx, The Bayesian approach 48

Inference in graphical models

“Infer hidden X given observed Y!”

Terminology:

I Prediction, deduction, forward reasoning:reason from cause to symptom.

I Diagnosis, abduction, backward reasoning:reason from sympton to cause.

Bayes p(X |Y ) =p(Y |X )

p(Y )p(X ) “inverts” causality.

H. Bruyninckx, The Bayesian approach 49

Page 18: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Example: burglar alarm (1)“What is the probability p(J ,M ,A,¬B ,¬E ), thatthere is no burglary, nor earthquake, but that thealarm went and both John and Mary called?”

p(J ,M ,A,B ,E )

= p(J ,M |A,B ,E )p(A,B ,E )

(product rule)

= p(J |A)p(M |A)p(A|B ,E )p(B ,E )

(product rule; independence via A)

= p(J |A)p(M |A)p(A|B ,E )p(B |E )p(E )

(product rule)

= p(J |A)p(M |A)p(A|B ,E )p(B)p(E ).

(independence of B and E )

H. Bruyninckx, The Bayesian approach 50

⇒ p(J = T ,M = T ,A = T ,B = F ,E = F )

= p(J = T |A = T )p(M = T |A = T )

p(A = T |B = F ,E = F )p(B = F )p(E = F )

= 0.90× 0.70× 0.001× 0.999× 0.998

= 0.00062811

No Bayes’ rule required, because everything isdeduction (“forward reasoning”).

H. Bruyninckx, The Bayesian approach 51

Example: burglar alarm (2)

“What is the probability that there is a burglary,given that John calls?”

Bayes’ rule: p(B |J) = p(J |B)p(B)/p(J).

1. p(B) = 0.001

2. p(J) =∑A,B,E

p(J ,A,B ,E ) (marginalisation)

=∑A,B,E

p(J |A,B ,E )p(A,B ,E ) (product)

=∑A,B,E

p(J |A)p(A|B ,E )p(B)p(E ). (id.)

H. Bruyninckx, The Bayesian approach 52

Page 19: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

This sum

p(J) =∑A,B,E

p(J |A)p(A|B ,E )p(B)p(E )

has only non-neglible terms when B ,E are False:

p(J) ≈ (0.90× 0.001 + 0.05× 0.999) 0.999︸ ︷︷ ︸B=F

× 0.998︸ ︷︷ ︸E=F

= 0.0507.

H. Bruyninckx, The Bayesian approach 53

3. p(J |B)

=∑A,E

p(J ,A,E |B) (marginalisation)

=∑A,E

p(J |A,B ,E )p(A|B ,E )p(E |B) (product)

=∑A,E

p(J |A)p(A|B ,E )p(E ) (indep.)

= 0.90(0.95× 0.002 + 0.94× 0.998)+ 0.05(0.05× 0.002 + 0.06× 0.998)

= 0.849.

⇒ p(B |J) = 0.849× 0.001/0.0507 = 0.01698.

H. Bruyninckx, The Bayesian approach 54

Example: DAG-with-“loop”

p(W , J ,B ,M) = p(W )p(J |W )p(B |W )p(M |J ,B).

H. Bruyninckx, The Bayesian approach 55

Page 20: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Inference example

“What are the probabilities that Mary missed herappointment because there was a traffic jam(p(M = T |J = T ))? Or because she has been outtoo late (p(M = T |B = T ))?”

1. p(M = T |J = T )= p(M = T , J = T )/p(J = T )= p(Mt , Jt)/p(Jt) (notation)

= t1t2

(notation)

H. Bruyninckx, The Bayesian approach 56

2. t1 =∑W ,B

p(W , Jt ,B ,Mt)

=∑W ,B

p(Mt |Jt ,B)p(Jt |W )p(B |W )p(W )

= p(Mt |Jt ,Bt)︸ ︷︷ ︸0.8

p(Jt |Wt)︸ ︷︷ ︸0.2

p(Bt |Wt)︸ ︷︷ ︸0.8

p(Wt)︸ ︷︷ ︸0.28

+ p(Mt |Jt ,Bt)︸ ︷︷ ︸0.8

p(Jt |Wf )︸ ︷︷ ︸0.9

p(Bt |Wf )︸ ︷︷ ︸0.1

p(Wf )︸ ︷︷ ︸0.72

+ p(Mt |Jt ,Bf )︸ ︷︷ ︸0.6

p(Jt |Wt)︸ ︷︷ ︸0.2

p(Bf |Wt)︸ ︷︷ ︸0.2

p(Wt)︸ ︷︷ ︸0.28

+ p(Mt |Jt ,Bf )︸ ︷︷ ︸0.6

p(Jt |Wf )︸ ︷︷ ︸0.9

p(Bf |Wf )︸ ︷︷ ︸0.9

p(Wf )︸ ︷︷ ︸0.72

= 0.444

H. Bruyninckx, The Bayesian approach 57

3. t2 =∑

W ,B,M

p(W , Jt ,B ,M)

=∑

W ,B,M

p(M |J ,B)p(J |W )p(B |W )p(W )

= . . .

H. Bruyninckx, The Bayesian approach 58

Page 21: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Bayesian mathematicsJaynes’ axiomatic foundation

Axioms for plausible Bayesian inference

I Degrees of plausibility are represented by real numbers.II Qualitative correspondence with common sense.

III If a conclusion can be reasoned out in more than one way,then every possible way must lead to the same result.

IV Always take into account all of the evidence one has.V Always represent equivalent states of knowledge by equiv-

alent plausibility assignments.

H. Bruyninckx, The Bayesian approach 59

Critique on Bayesian axioms

I Assigning equivalent probabilities for equivalentstates seems to assume that the modeller has“absolute knowledge,” since one often doesn’tknow that states are equivalent.

I Many people state that representinginformation by one single real number is notalways possible or desirable.

I Intuitive interpretation(?).

H. Bruyninckx, The Bayesian approach 60

Bayesian mathematicsCox’ derivation

Plausible functional relationships:

p(A,B) = f{

p(A), p(B |A)}

⇒ f{

p(A), p(B |A)}

= f{

p(A)}m

f{

p(B |A)}m

(Arbitrary choices m = 1, f (u) = u.)Negation:

∃function g : g

(g(p(A)

))= p(A)

g(p(A OR B)

)= g

(p(A)

)AND g

(p(B)

)H. Bruyninckx, The Bayesian approach 61

Page 22: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Information measures

I “How much information do I have?”

I Absolute measure = reduce PDF to one scalar.

I Relative measure = what is the informationdifference between two PDFs? (one scalar)

I Global measure = measures the whole PDF.

I Choice of information measure is arbitrary, but:

I Entropy is a natural measure. . .I . . . can be derived from first principles.

H. Bruyninckx, The Bayesian approach 62

Entropy example

.3.2 .2 .1 .1.1

.2 .2 .1.1 .2 .2

.2.2.1

.1.1 .2 .2 .2 .2

H=1 .6957

H=1 .8866 H=1 .6094

.2 .2 .1

H=1 .7480

H. Bruyninckx, The Bayesian approach 63

Logarithm as information measureGood’s axioms (1966)

I I (M :E AND F |C ) = f {I (M :E |C ), I (M :F |E AND C )}II I (M :E AND M |C ) = I (M |C )

III I (M :E |C ) is a strictly increasing function of its argumentsIV I (M1 AND M2:M1|C ) = I (M1:M1|C ) if M1 and M2 are

mutually irrelevant pieces of information.V I (M1 AND M2|M1 AND C ) = I (M2|C )

⇓. . . straightforward but tedious. . .

⇓Logarithm of PDF is additive measure

of the information contained in the PDF.

H. Bruyninckx, The Bayesian approach 64

Page 23: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Example of logarithmic measure:Shannon entropy

Interpretation:

I Physical: measures “uncertainty,” “disorder,”“chaos.”

I Bayesian: measures our (lack of) information,not the physical system’s lack of order.

I Composition of information = multiplicative.Composition of measures = additive (via log).

First case: discrete PDF

I {x1, . . . , xn}, with PDF p(x) = {p1, . . . , pn}.I Problem: define a measure H(p) of PDF p(x).

H. Bruyninckx, The Bayesian approach 65

Entropy information measure.Axiomatic approach

Discrete distributionsI H is a continuous function of p.

II If all n probabilities pi are equal (and henceequal to 1/n, if we choose them to have to sumto 1), the entropy H(1/n, . . . , 1/n) is a mono-tonically increasing function of n.

III H is an invariant, i.e., the uncertainty shouldnot depend on how one orders or groups theelements xi .

H. Bruyninckx, The Bayesian approach 66

Invariance of entropyMathematical discussion

Invariance w.r.t. grouping of variables expressedmathematically as:

H(p1, . . . , pn) =H(w1,w2, . . . )

+ w1H(p1|w1, . . . , pk |w1)

+ w2H(pk+1|w2, . . . , pk+m|w2)

+ . . . ,

where w1 is the probability of the set {x1, . . . , xk},w2 is the probability of the set {xk+1, . . . , xk+m}, . . .

H. Bruyninckx, The Bayesian approach 67

Page 24: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

w1w2

w3

pn

p1

pm

pm+1

H. Bruyninckx, The Bayesian approach 68

Invariance of entropyExample

p1 = 1/2, p2 = 1/3, p3 = 1/6

Choose grouping into sets:w1 = p1 = 1/2,w2 = p2 + p3 = 1/3 + 1/6 = 1/2

Then:H(1/2, 1/3, 1/6) = H(1/2, 1/2) + 1

2H(2/3, 1/3)

H. Bruyninckx, The Bayesian approach 69

Continuity & monotonicity of entropyMathematical discussion

I Continuous → rational numbers:pi = ni/

∑nj=1 nj

I Uniform P = (1/N , . . . , 1/N)→ H(p) = H(P)

(group into n1, n2, etc.)

I ⇒ H(n) + H(m) = H(mn)

I ⇒ H(n) = K ln(n)

H. Bruyninckx, The Bayesian approach 70

Page 25: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Entropy formulas

H(p1, . . . , pn) = −K∑

pi ln(pi).

I ln(pi) < 0, because (ni/∑

ni) < 1.

I limpi→0 pi ln(pi) = 0.

I H increases with uncertainty(Uncertainty ≈ lack of clear peak.)

I There is no absolute zero for entropy.

I Be careful to compare entropies of two PDFs:must be defined over same domain.

H. Bruyninckx, The Bayesian approach 71

Entropy for continuous PDF

I H(1/N , . . . , 1/N) not well defined for N →∞.

I PDF1 = p(x) dx , PDF2 = p(y) dy : x and yspan the same parameter space!

I Take ni/∑

ni in interval: density.

I Ratios of densities make sense (locally):

ni/∑

ni

mj/∑

mj→ dx

dy

H. Bruyninckx, The Bayesian approach 72

Mutual informationRelative entropy

Kullback-Leibler divergence

H(p, q) = −∫

p(x) ln

(p(x)

q(x)

)dx .

I Asymmetry between p(x) and q(x).

I Asymmetry ⇒ no “distance” function!(Distance = independent of direction.)

I Corresponds to intuition when going betweentwo states of information.

H. Bruyninckx, The Bayesian approach 73

Page 26: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Local information measure—Fisher information—

(No mathematical details!)I Logarithm-based measures are global!I Local measure = rate of change of information

in given direction v of local change in PDF:

H(PDF:PDF+εv) =1

2

∑i ,j

gij(PDF)v iv j+O(ε2).

Sensitivity of variance to small changes inmean.

I Fisher information H : is a matrix, not scalar⇒ scalar measures = f (H).

H. Bruyninckx, The Bayesian approach 74

Information measuresExample of Gaussian PDFs

N (µ,P) =1√

(2π)n ||P||1/2exp

{−1

2(x− µ)T P−1(x− µ)

}.

I Quadratic measure for “error” x− µ:

1

2(x− µ)T P−1(x− µ)

(= square of the “Mahalanobis distance”)I Covariance matrix P−1 is Fisher InformationI Is additive after taking the logarithm of PDF!

H. Bruyninckx, The Bayesian approach 75

Measures inspired by P−1

P−1 is matrix measure ⇒ derived scalar measures:

I Trace.

I Determinant.

I Ratio of singular values.

I . . .

−− No unique choice.

++ Computationally efficient.

Note: inverse P−1 of Covariance Matrix is called theInformation Matrix.

H. Bruyninckx, The Bayesian approach 76

Page 27: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Ignorance—No information

I Bayes approach: state of “No information”does not exist.Jaynes: “Merely knowing the physical meaningof our parameters in a model, alreadyconstitutes highly relevant prior informationwhich our intuition is able to use at once.”

I Approaches:I “Uniform” prior.

(Note: impossible on infinite domains.)I Requirement of Invariance under transformations⇒ Maximum Entropy (MaxEnt) PDF.

I Explicit “I don’t know” hypothesis.

H. Bruyninckx, The Bayesian approach 77

Bayes rule is information-preserving

I Same Data explained by models M1 and M2:

p(M1|Data) =p(Data|M1)

p(Data)p(M1),

p(M2|Data) =p(Data|M2)

p(Data)p(M2),

⇒ logp(M1|Data)

p(M2|Data)= log

p(Data|M1)

p(Data|M2)+log

p(M1)

p(M2).

⇒ Bayes’ rule does not add or remove(logarithmic) information.

H. Bruyninckx, The Bayesian approach 78

Goodness-of-fit vs Model complexity

Goodness-of-fit/Likelihood:how well can a given Model explain the Data?

Complexity of a model is higher if:

I it has a higher dimensional state space.

I it has more parameters in PDF representation.

⇒ same Data is less likely in more complex model.

How to trade-off complexity vs explanatory power oftwo models?⇒ Occam’s razor, Bayes’ factor, Occam’s factor,AIC, BIC, MDL, . . .

H. Bruyninckx, The Bayesian approach 79

Page 28: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Occam’s razor

Pluralitas non est ponenda sine necessitate.

(Given two equally predictive models, choose thesimpler one.)

Example: series −1, 3, 7, 11; what’s next?

Model 1: 15, 19, . . . f (x) = x + 4.

Model 2: −19.9, 1043.8, . . . :

f (x) = −x3

11+

9x2

11+

23

11f (−1) = 3, f (3) = 7, f (7) = 11, f (11) = −19.4, . . .

Model 1: 1 integer; Model 2: 3 rational numbers.

H. Bruyninckx, The Bayesian approach 80

Bayesian model selection(“Hypothesis test”)

Explicative capacity of two models M1 and M2 forthe same Data:

p(M1|Data)

p(M2|Data)=

p(Data|M1)

p(Data|M2)︸ ︷︷ ︸Bayes’ factor

p(M1)

p(M2)︸ ︷︷ ︸prior

.

(Single scalar when Data is given!)Bayes’ Factor: penalizes complex modelsautomatically in the trade-off between predictivepower and complexity (see later: Occam’s factor).

H. Bruyninckx, The Bayesian approach 81

Bayes’ Factor

“Average out” influence of all parameters θi inmodel Mi :

p(Data|M1)

p(Data|M2)︸ ︷︷ ︸Bayes’ factor

=

∫p(Data|θ1,M1)p(θ1|M1)dθ1∫p(Data|θ2,M2)︸ ︷︷ ︸

data fit

p(θ2|M2)dθ2︸ ︷︷ ︸Occam’s factor

.

Bayes’ factor is ratio of two scalars, so comparisonis possible.

Occam’s Factor: complexity part of Bayes’ Factor.

p(θ1|M1): after inference (“training”) with Data.

H. Bruyninckx, The Bayesian approach 82

Page 29: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Occam’s factor

= complexity measure in Bayes’ factor:

p(Data|M) =

∫p(Data|θ,M)︸ ︷︷ ︸

data fit

p(θ|M)dθ︸ ︷︷ ︸Occam’s factor

.

The larger the θ parameter space (i.e., the morecomplex the model), the smaller the regionp(θ|M)dθ where the largest probability mass isfound. (In case the model has “converged.”)

Occam’s factor includes density dθ!

H. Bruyninckx, The Bayesian approach 83

Bayes’ factor implementations

Akaike Information Criterion (AIC)

−2 ln p(Data|θml ,M) + 2k , (k = # parameters).

Bayesian Information Criterion (BIC):

−2 ln p(Data|θml ,M) + k ln(N), (N =

# data samples).

Minimum Description Length (MDL):

− ln p(Data|θml ,M) +k

2ln(N).

(Beware: these measures do note take density intoaccount!)

H. Bruyninckx, The Bayesian approach 84

Examples up to now: simple!

I Static network: all PDFs & nodes are fixed.

I PDFs are binary: true, false.

I Fully observed: all PDF tables are given.

I Only estimation, no learning.

But the real world is more complex:

I Dynamic networks: PDFs change over time,because new “data” comes in from the world.

I PDFs: continuous functions.

I Partially observed: some PDFs are unknown.

I Learning and estimation.

H. Bruyninckx, The Bayesian approach 85

Page 30: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

What makes uncertainty reasoning“Bayesian”?

I Use explicit models.

I Uncertainty = PDF.

I Allow a priori knowledge.

I Inference = calculate marginals of unobservedvariables, conditional on observed variables.

I Use Bayes’ rule i.s.o. only likelihood.

I Estimate PDF i.s.o. “point estimates”.(E.g., Maximum Likelihood, Maximum APosteriori)

H. Bruyninckx, The Bayesian approach 86

Maximum Likelihood EstimatorMaximum A Posterior

I Likelihood (from Bayes’ rule):

L(Model params) = p(Data|Model params).

I Maximum Likelihood Estimator (MLE):

arg maxModel params

L(Model params).

I Maximum A Posterior (MAP):

arg maxModel params

p(Model params|Data).

H. Bruyninckx, The Bayesian approach 87

MLE, MAP (2)

+ Often easy to calculate.

− Only point estimate.

− Doesn’t take density dx into account.

− Not very robust against small changes in PDF.

H. Bruyninckx, The Bayesian approach 88

Page 31: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Bayesian Learning

Things to learn:

I Structure of the graphical model.

I Parameters of the PDFs in the model.

Four categories, from simple to complex:

1. known model + full observations.(Done already! Only queries.)

2. known model + partial observations.(Inference of unobserved parameters.)

3. unknown model + full observations.

4. unknown model + partial observations.

H. Bruyninckx, The Bayesian approach 89

Bayesian Estimation

To treat parameters as additional unobservedvariables and to compute a full posterior distributionover all nodes conditional upon observed data, thento integrate out the parameters.

Point estimate x = X from PDF p(x , y , z , . . . ):

0. Learn p(x , y , z , . . . ).

1. Marginalize out all other variables:p(x) =

∫p(x , y , z , . . . ) dy dz . . .

2. (If point estimate is needed:)Weighted average over x (posterior mean):

X =∫

x p(x) dx .

H. Bruyninckx, The Bayesian approach 90

Bayesian model learning(Revisited)

Model learning = estimation in (much) larger space:

I New model = combination of primitive models.

I Combinations are represented by parameters.

I The combination parameters are learned.

Model learning is an order of magnitudemore complex than parameter estimation!

H. Bruyninckx, The Bayesian approach 91

Page 32: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Approximations

I Modelling:I Choice of model parameters.I Choice of graphical model structure

(= stochastic dependencies.)I Choice of PDF family.

I Assumptions:I Independence in joint PDFs.I Independence over time steps: Markov property.

(x(k + 1) = f (x(k)).)I Choice of priors.

Always requires application-dependent insights!

H. Bruyninckx, The Bayesian approach 92

Computational simplifications

I Integral needed during marginalization:I Replace integral by finite sum.I (Sample-based) Monte Carlo integration.

I Use ML instead of full PDF.

I Use explicit input/output relationships(y = f (x)) instead of implicit (h(x , y) = 0).

I Linearize non-linear relationships.

I Adapt scheduling of inference computations.

I Replace some steps by analytic approximations.

I . . .

Always requires application-dependent insights!

H. Bruyninckx, The Bayesian approach 93

Overview of “Bayesian” algorithms

0. Answering queries in Bayes net.I Network is given.I All PDFs are known

I Already explained earlier in course.

1. General inference in graphical models= to update PDFs when new data comes in:

I Bayesian network: Message passingI Markov Random Field: Junction tree

H. Bruyninckx, The Bayesian approach 94

Page 33: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Overview “Bayesian” algorithms (2)

2. Recursive estimation= inference + estimation in given model whennew data comes in at regular intervals:

I Kalman Filter.I Particle Filter.I Hidden Markov Model (Baum-Welch,

Viterbi).

3. Model approximation= to construct model that fits data “best”:

I Variational Bayes.I Expectation-Maximization (EM).

H. Bruyninckx, The Bayesian approach 95

Inference on Bayes NetworkMessage passing—Belief propagation

I Propagation of new information from one nodeto all others via message passing in forwardand backward direction of the directed graph.

I Messages are stored locally (at nodes) in ordernot to repeat some calculations.

I More efficient algorithms for graphs with morestructure (tree, DAG, . . . ) are limit cases ofthis message passing.

I No details given in this course. . .I Further reading: Bishop 2006 (Chapter 8).

H. Bruyninckx, The Bayesian approach 96

Markov blankets= all nodes influencing given node

MRF:all neighbours.

Bayes network:parents, children,and children’sparents.

H. Bruyninckx, The Bayesian approach 97

Page 34: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

H. Bruyninckx, The Bayesian approach 98

Inference on general graph—Junction Tree—

I Transform MRF graphical model into a tree bytaking nodes together.

I Transformed model is not unique!(Different arbitrary choices are possible.)

I Too complex for this introductory course. . .

I Approximation via Loopy Belief Propagation= apply Belief Propagation although graph isnot DAG.

H. Bruyninckx, The Bayesian approach 99

Dynamic/Recursive network1st-order Markov

“Arrows”:

Xk+1 = f (Xk ,Uk+1)

Yk = g(Xk ,Uk)

“1st-order Markov” = time-influence only one step.

H. Bruyninckx, The Bayesian approach 100

Page 35: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Dynamic/Recursive network2nd-order Markov

H. Bruyninckx, The Bayesian approach 101

Simplest dynamic network:Kalman Filter

Required inference = given Y and U , update X .

Assumptions made:

I 1st order Markov.

I PDF representation: Gaussian distributions!

I Functional relationships f (·), g(·): linear!

⇒ analytical solution possible for Bayes’ rule(“learning”) and marginalization (“estimation”)!

H. Bruyninckx, The Bayesian approach 102

Kalman Filters (cont’d)

Typical applications: tracking (= adapting to smalldeviations from previous values).

Simplifications to reduce computations:

I State-space domain Model: xk+1 = F xk + Qk .

I Gaussian “uncertainty” on xk : covariance Pk .

I Gaussian “process noise”: covariance Qk .

I Data: zk = H xk + Rk .

I Gaussian “measurement uncertainty” on zk :covariance Rk .

H. Bruyninckx, The Bayesian approach 103

Page 36: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

H. Bruyninckx, The Bayesian approach 104

Kalman Filters—Further readingTo derive Kalman Filter from Bayes’ rule:

Ho, Yu-Chi and Lee, R. C. K., A Bayesian Approach toProblems in Stochastic Estimation and Control, IEEETransactions on Automatic Control, 1964.

To derive Kalman Filter as optimal information processor:

Zellner, Arnold. Optimal Information Processing andBayes’s Theorem, The American Statistician,42:278–284, 1988.

To derive Kalman Filter as recursive least-squares filter:

Kalman, R. E. and Bucy, R. S. New results in linearfiltering and prediction theory, Transactions of theASME, Journal of Basic Engineering, 83:95–108, 1961.

H. Bruyninckx, The Bayesian approach 105

Sample-based inference(Monte Carlo methods)

Terminology: Sequential Monte Carlo, bootstrapfilter, condensation algorithm, particle filter,. . .

Approximated PDF = samples with a weight:

p(xk |zk)→ {x ik ,w

ik}Ns

i=1,∑

i

w ik = 1.

Operations on PDFs (marginalization, Bayes’rule, . . . ) → operations on samples. For example:

φ =

∫φ(x)p(x)dx ≈ 1

N

N∑i=1

φ(x i) =N∑

i=1

w iφ(x i).

H. Bruyninckx, The Bayesian approach 106

Page 37: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Importance of sample-based methods

I Can approximate any PDF. . .

I . . . including PDFs with mixeddiscrete-continuous parameters.

I Sequential method: Markov property leads toreduced computational complexity.

I Used where Kalman Filters are too “unimodal.”

I Accuracy doesn’t depend on dimension ofspace!

H. Bruyninckx, The Bayesian approach 107

How to generate samples?

I Inversion sampling.

I Proposal-based sampling:

I Importance sampling.I Rejection sampling.

I Markov Chain Monte Carlo (MCMC) sampling:

I Metropolis-Hastings sampling.I Gibbs sampling.

All methods use Random Number Generators.All methods assume evaluating p(x) is cheap.

H. Bruyninckx, The Bayesian approach 108

Inversion sampling via CDF

I (Only) sampling from uniform PDF is simple.I Searching in the Cumulative Density Function

can be costly.I Only some families op PDF allow simple

inversion sampling.

H. Bruyninckx, The Bayesian approach 109

Page 38: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Importance sampling

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

++ ++ ++ ++++ + + + ++ +++ ++ ++ ++ ++ ++ + + ++ ++ +++ ++ + ++ + +++ ++++

BetaGaussianBeta SamplesNormal samples

I Sample from simpler proposal: q(x).

I Adjust weight of each sample according top(x)/q(x).

I q(x) should be “similar” to p(x).

H. Bruyninckx, The Bayesian approach 110

Rejection sampling

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Rejection Sampling

x

fact

or *

dno

rm(x

, mu,

sig

ma) student t

Scaled Gaussian

I Sample from simpler proposal q(x).I q(x) should be “similar” to p(x).I Sample from cq(x), c > 1.I Accept if smaller than p(x).

H. Bruyninckx, The Bayesian approach 111

MCMC sampling(Markov Chain Monte Carlo)

I Finding appropriate “proposal density” q(x):difficult!Especially in higher dimensions.

I PDF is found as the limit of a Markov Chain.

I Can take long time before chain has“converged.”

I Two major approaches:I Metropolis-Hastings sampling.I Gibbs sampling.

H. Bruyninckx, The Bayesian approach 112

Page 39: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Example: Visual feature tracking

Movie

I Low-level visual features are tracked.

I “Statistical weight” is built up over time.

I Some features become permanent.

http://www.doc.ic.ac.uk/~ajd/

H. Bruyninckx, The Bayesian approach 113

Example: robot localization withlaser distance scanner

Movie

I Uniform initial sample set.

I Some rooms are almost identical → multiple“peaks.”

http://www.cs.washington.edu/ai/Mobile_

Robotics/mcl/

H. Bruyninckx, The Bayesian approach 114

Hidden Markov modelSystem behaves as a State Machine:

(Source: Wikipedia)

I x: states.

I y: possibleobservations.

I a: state transitionprobabilities.

I b: observationprobabilities.

This is not a graphical model!

Example: speech recognition.

H. Bruyninckx, The Bayesian approach 115

Page 40: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Graphical model of HMM

“Easy” calculation direction:

p(Y ) =∑X

p(Y |X )p(X ).

Even this explodes for real systems. . .

H. Bruyninckx, The Bayesian approach 116

Fastest Bayesian Matching—Viterbi Algorithm—

(Also known as: forward algorithm.)

I Given the HMM model parameters (transitionand output probabilities), find the most likelysequence of (hidden) states (e.g., spokenwords) which could have generated a givenoutput sequence (e.g., measured acousticsignal).

I Maximum Likelihood, not full-PDF:

argmax p(hidden state sequence|observed sequence).

H. Bruyninckx, The Bayesian approach 117

HMM properties that avoid exhaustive search:

I observed and hidden events: in alignedsequences.(= form of graphical model)

I observed event corresponds to exactly onehidden event.(= state dynamics is Finite State Machine)

I computing most likely hidden sequence up totime t only depends on (i) observed event at t,and (ii) most likely sequence up to time t − 1.(= dynamic programming)

H. Bruyninckx, The Bayesian approach 118

Page 41: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Fastest Bayesian Learning—Baum-Welch—

(Also known as: forward-backward algorithm.)

Learning of the probabilities aij and bij in a HMM:

I Maximum Likelihood, not full PDF.

I Makes forward steps starting from assumed MLparameters.

I Compares it to the measured outputs.

I Makes backward steps to adapt MLparameters.

I Until convergence, to local optimum!

H. Bruyninckx, The Bayesian approach 119

Approximate learning algorithm—Variational Bayes—

I Ever-returning problem: estimate p(z |x) wherex are the observed variables (“data”), and zare the hidden variables.

I Approximate “true” p(z |x) by a memberq(z |θ) from a θ-parameterized family of PDFs.For example: sum of Gaussians, with theweights as parameters.

I Approximation is guided by minimization ofKullback-Leibler divergence between true ep(z |x) and approximated q(z |θ).

H. Bruyninckx, The Bayesian approach 120

Approximate learning algorithm—Expectation-Maximization—

I EM is “abstract algorithm”, not reallyexecutable in itself.

I Gives Maximum Likelihood estimate of θ inq(z |θ).

I Works in two steps:I Expectation step.I Maximization step.

I Can be proven to converge. . .

I . . . but not necessarily to global optimum.

H. Bruyninckx, The Bayesian approach 121

Page 42: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Outline of EMData x are given; θ and z are alternatively updated.

Expectation step k :I Assume θ has given value θ(k).I Then p(z |x , θ(k)) results from inference.I The “total” log-likelihood function log p(x , z |θ)

is function of z and θ.I Marginalizing z away gives a function of θ:

Q(θ) =

∫p(z |x , θ(k)) log p(x , z |θ)dz .

Maximization step k : find the maximum of Q(θ):

θ(k + 1) = arg maxθ

Q(θ).

H. Bruyninckx, The Bayesian approach 122

Alternative explanation of EM

System spaces:

I Total space S : all possible PDFs over statespace, modelled by parameters φ.

I Model space M : all PDFs that are in the θfamily, i.e., subspace of S , represented byparameters φ(θ).

I Data space D: all PDFs p(x , z) that arepossible given the observed + hiddenparameters.

D need not be part of M!

H. Bruyninckx, The Bayesian approach 123

EM = Iterative Maximum Likelihood steps:

I Inference with last P → Q ∈ D.

I Find “closest” representative in M= “project” PDF Q onto M= P ∈ M : minP KL(Q,P).(KL = Kullback-Leibler divergence)

H. Bruyninckx, The Bayesian approach 124

Page 43: The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye › urks-slides2007-1x3.pdf · The Bayesian approach Herman Bruyninckx Department of Mechanical Engineering

Graphical model learning

Unknown model + full or partial observations

Much harder than parameter learning!

Generic approach to model learning: find “best”parameters in combinations of more primitivesub-models.

Major problem: create sub-models + dataassociation of (hidden) data to sub-models.

Major algorithm: EM, with more complex θ’s:

1. discrete selection of sub-models.

2. continuous information in each selection.

H. Bruyninckx, The Bayesian approach 125

H. Bruyninckx, The Bayesian approach 126