The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye ›...

Uncertainty Reasoning inKnowledge Systems

—The Bayesian approach

Herman BruyninckxDepartment of Mechanical Engineering

K.U.Leuven, Belgium

Autumn 2007

H. Bruyninckx, The Bayesian approach 1

Further reading

++ Christopher M. Bishop, Pattern Recognition andMachine Learning, 2006.(Permanently available in CBA library: 1 681.3*I5 2006.)

+ Jensen, Finn V. and Nielsen, Thomas D., BayesianNetworks and Decision Graphs, 2001.(Available in CBA library: 1 519.2 2001.)

+ David MacKay, Information Theory, Inference, andLearning Algorithms, Cambridge University Press.(Available in CBA library: 1 519.72 2003.Available electronically at http://www.inference.phy.cam.ac.uk/mackay/itila/book.html.)


Further reading (2)

I Sebastian Thrun, Wolfram Burgard and Dieter Fox,Probabilistic Robotics, MIT Press, 2005.(Available in CBA library: 1 681.3* I29 2005.)

I The http://en.wikipedia.org contains many goodarticles on Bayesian concepts and algorithms, and a (stillrather poor) overview: http:

//en.wikipedia.org/wiki/Bayesian_probability.

I Kevin P. Murphy, A Brief Introduction to GraphicalModels and Bayesian Networks, 1998. Availableelectronically at http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html.


Further reading (3)

I On-line facsimile van Bayes’ paper: http:

//www.stat.ucla.edu/history/essay.pdf.

I Stephen E. Fienberg, When Did BayesianInference Become “Bayesian”? BayesianAnalysis, 1(1):1-40, 2006.http://lib.stat.cmu.edu/~fienberg/

fienberg-BA-06-Bayesian.pdf


What is URKS?Knowledge System (intelligent system; reasoning system):

I Computer program that can answer questions about theinformation in a real-world system (engineered, social).

I It uses relationships between variables that represent thereal-world system in a mathematical way.

I Learning of the variables as well as of the relationships isan important (but not necessary) part of a knowledgesystem.

I Examples: weather forecasting; traffic speed detector;medical diagnosis system; legal court reasoning system;robot moving around in unknown environment; GPSnavigation system; . . .

I Counter examples: lecture room automation system.


Uncertainty (information, probability, possibility, plausibility,chance, . . . ):

I There can be uncertainty about: the values of thevariables; the structure of the relationships; the mostappropriate relationship in a set of possible relationships;etc.

I Examples: predicted weather; estimated speed of a car;identification of the car that drove too fast; probabilityof being guilty; correct physical theory of the origin ofthe universe; position of a car on the map; probabilitythat patient will die from cancer within six months; . . .

I Counter examples: value of the 1000th digit of π;number of millimeters in one meter; my name is“Herman”; . . .


Reasoning:

I System executes “most appropriate” algorithmto find the answer to a question I pose.

I The questions can be of a very variable nature:What happens when I give this input? How canthe system incorporate this new information?What would be the optimal structure of thesystem for these information needs? . . .

I Learning is one important example of“reasoning”.

I Decision making is another important example.

I How to reason about concepts like: opinion;act of God; personal bias; prior knowledge; . . .


Overview of the Bayesian approach

1. Modelling — How is the informationrepresented?

I Structure/Relationships.I Values.

2. Reasoning/Information processing — Whatquestions can be asked?

I learning (both values and structure), dataassociation, pattern recognition, classification,clustering, model building (= learning structure),. . .

3. Decision making — What do you do with theanswer?(Beyond the scope of this course.)


1. Modelling

1.1 Mathematical representation of problem domain= parameters (“random variables”) + relationships(“graphical model,” “(Bayesian) network”).


1.2 Information

I Probability Density Function (PDF) overparameter space: function i.s.o. single number.

I Subjective representation of the human’sknowledge about—or interpretation of—theworld.

-8 -6 -4 -2 0 2 4 6 8-8-6

-4-2

0 2

4 6

8

0 0.005 0.01

0.015 0.02

0.025 0.03

0.035 0.04

0.045 0.05

Where am I?


Joint PDF:

I captures uncertain/stochastic dependencebetween parameters.

I Graphical model = “bookkeeping” structure forcomputations of information processing =factorization of joint PDF.

p(a, b, c , x , . . . , α, β, η, θ) = p(x , y .z)p(c |x , y , z)p(a, b) . . .


2. Information processing

2.1 Inference (deduction, abduction).2.2 Learning:

I parameters in given model.I structure of new model.

2.3 Estimation (data reduction).

Major computational algorithms:

I Bayes’ rule: incorporate new information.

I Marginalization: reduce existing information.

I Belief propagation in network: transportinformation.


3. Decision making

3.1 Utility functions:I Gives “value” to information in variables. . .I . . . to translate information into action.I Does not influence information content in network.

(At least not directly.)I Arbitrary!

3.2 Attached “somewhere” to graphical model.

3.3 Computation of decision making need not besynchronised with computation of informationprocessing.

This course: only estimation and model selection.


Graphical models


Graphical Models

I Best reference: Bishop 2006.I Graphical representation of structure of

relationships:I node contains variables.I arc (edge, link, arrow, . . . ) contains mathematical

representation of relationship.

I The model describes (a simplification of) theworld in a coordinate-free way. The samemodel can have various mathematicalrepresentations as well as various algorithmicimplementations.

I Not limited to Bayesian paradigm.


Joint PDF

I Bayesian paradigm: each node = joint PDF.

I The graphical model encodes conditionalindependence: X and Y are conditionallyindependent if knowing Y doesn’t give anyinformation about X . This gives(computational) structure to a graphicalmodel: it’s useless to do computations thatinvolve X if one wants to know things aboutY , so one can “cut off” parts of the model.

I If X and Y are conditionally independent givenZ then Y gives us no new information about Xonce we know Z .


Directed arc — “Arrow”

I Explicit relationship z = f (x , y).I Arrow:

I indicates which direction is “easy” to calculate.I is not necessarily physical causality.

I Factorization of the joint PDF:

p(x , y , z) = p(z |x , y)p(x , y).


Undirected arc

I Implicit relationship f (x , y , z) = 0.

I Factorization: not possible to write p(x , y , z)as a factorized function of p(x , y) and p(z).

I Computationally more costly than explicitrelationship.


Hidden (“latent”) variable

Sometimes one uses a graphical indication of whatparameters are “measured” and what parametersare “hidden”:


Network with cycle


Graphical models & Bayes paradigm

I Undirected arcs: “Markov Random Field”I Directed arcs: “Bayesian/Belief Network”I Both: “Chain Graph”

I Wider applicability than only Bayesianparadigm!

I Difference with neural networks: interpretationof nodes and edges in context of model.

I Most efficient: Directed Acyclic Graph (DAG).(Serialization of computations!)

I Can be extended with decision making nodes.


Directed Acyclic Graph (DAG)

p(A,B , . . . ,G ) = p(A)p(B)p(C |AB)p(E |C )

p(D|C )p(F |E )p(G |DEF ).


Example: burglar alarm

Bayesian knowledge system = graphical model +probability tables/functions over random variables innodes.


Markov Random Field

Undirected arrows.

Two (sets of) nodes A and B are made conditionallyindependent by a third (set) C, if all paths betweennodes in A and B are separated by a node in C.

In some cases, extra hidden variables can transformMRF into Bayesian network.


Bayes Net: Serial connection—Causal chain—

A and C are independent, given the information inB: p(C |B ,A) = p(C |B).

Information in B “blocks” transfer of informationbetween A and C.

A and C are d-separated by B.(“d” stands for “disclosure”.)

Example: knowing that there is an alarm, oneknows the chance that Mary calls, independently ofwhether there has been a burglary or not.


Bayes Net: Diverging connections—Common cause—

B, C, . . . , X are independent, given the informationin A: p(B |A,C , . . . ,X ) = p(B |A).

Example: knowing that there is an alarm, oneknows the probabilities that John and Mary call,without them having to know about each other.

Terminology: fan-out = # of A’s outgoing arcs.


Bayes Net: Converging connectionsCommon effect – Explaining away

B, C, . . . , X are dependent, only via information in A:p(A,B ,C ) = p(B)p(C )p(A|B ,C ).

The information in B, C, . . . , X becomes correlated byinformation in their common child parameter A.

Example: if one observes an alarm, then information aboutwhether a burglary has occurred gives information aboutwhether an earthquake has occurred.

Terminology: fan-in = # of A’s incoming arcs.H. Bruyninckx, The Bayesian approach 25

Bayesian Probability Theory

Nice properties:

I Unique: single answer.

I Consistent: independent of route to answer.

I Plausible: extension of 0/1 logic.

What’s in a name?:

I Information = belief = evidence = possibility= probability = . . .

I Inference = information processing = beliefpropagation = . . .


Arbitrary choices in modelling

1. Variables.(You must choose them!)

2. Relationships.(You must choose them!)

3. Information = PDF.(You must choose its form!)


What is a PDF?= function p(x) over space of variable x

+ measure “dx” that describes the “density” ofthe domain around the value x .

I Average/expected value weigthed by density:∫D

p(x) dx (1)

I Integral over whole variable space = 1:I value “1” is arbitrary choice!I only relative value is important.

I Example of non-uniform density: longitude +latitude.


(Figure source: Stefan Kuhn, Wikipedia.org.)

Mercator projection: the surface “density” is notconstant over the whole map.


Example PDF: Gaussian

Gaussian (or Normal) PDF, with mean µ = 0 andvariance σ = 5, 10, 20, 30.


Example PDFMultidimensional Gaussian

1√(2π)n ||P||1/2

exp

{−1

2(x− µ)T P−1(x− µ)

}

-8 -6 -4 -2 0 2 4 6 8-8-6

-4-2

0 2

4 6

8

0 0.005

0.01 0.015

0.02 0.025

0.03 0.035

0.04 0.045

0.05


Mean µ—Covariance P

µ =

∫x p(x) dx, P =

∫(x−µ)(x−µ)T p(x) dx

-8 -6 -4 -2 0 2 4 6 8-8-6-4-2 0 2 4 6 8

σ1σ2


Gaussian PDFs (2)

Advantages:

I Only two parameters needed(per dimension of domain).

I Information processing: (often) analyticallypossible.

Disadvantages;

I mono-modal; uni-variate; only one “peak”.

I extends until infinity.

Efficient extensions:

I Sum of Gaussians.

I Exponential PDFs: α h(x) exp{β g(x)}.H. Bruyninckx, The Bayesian approach 33

Joint PDF

The multi-dimensional, single-valued informationfunction p(X ,Y ,Z , . . . ) that describes the“statistical” dependencies between the variablesX ,Y ,Z , . . . in a model M .

⇒ p(X ,Y ,Z , . . . |M)

Model M = representation of the relationshipsbetween variables X ,Y ,Z , . . .


Relationships= dependent variables

Two variables X and Y are (statistically) dependentif a change in the value of X is correlated with achange in the value of Y .

Correlation: one can derive (part of the) change ininformation about X from a change in informationabout Y .

Not necessarily physical causal connection betweenX and Y .


Dynamic relationships:“state-space” description

{dxdt = f (x , θ, u)

y = g(x , θ, u)

I x : domain values.

I θ: model parameters (PDF, relationships).

I u: inputs.

I y : outputs values.

Time is explicit parameter, and “horizon” ofnetwork moves with time.


Dynamic inference exampleMobile robot in corridor


Bayesian mathematics1. Sum rule

p(x + y |H) = p(x |H) + p(y |H)− p(xy |H).

Notation: “+,” “or,” “∨”.

Full version: p(X = x ∨ Y = y |H).

In words: the value of the information one hasabout the random variable X having the value x orthe random variable Y having the value y , given thecontext H, is the sum of the information about theindividual random variables, if both randomvariables are independent in the given context.


2. Product rule

p(xy |H) = p(x |yH) p(y |H).

Notation: “multiplication” “and,” “∧”.

Full version: p(X = x ∧ Y = y |H).

In words: the value of the information one hasabout the random variable X having the value xand the random variable Y having the value y , isthe product of the information one has about therandom variable X given the value of the randomvariable Y , multiplied by the information one hasabout the random variable Y , all in the context H.


2b. Chain rule

Apply product rule multiple times:

p(xyz |H) = p(x |yzH) p(y |zH) p(z |H).


3. Marginalization

p(x |H) =

∫p(x , y , z , . . . |H) dy dz . . .

(∫ →∑

, for discrete PDFs.)

In words: finds information on x taking intoaccount information about all other variablesy , z , . . . with which x is related.

In other words: “averages out” influence of allother related variables.


4. Bayes’ rule(Inference—“Inverse” probability)

p(xy |H) = p(yx |H)

(product rule) ⇓ (product rule)

p(x |y ,H)p(y |H) = p(y |x ,H)p(x |H)

⇓

p(x |y ,H) =p(y |x ,H)

p(y |H)p(x |H)


Bayes’ rule interpretation

p(Model params|Data,H)

=p(Data|Model params,H)

p(Data|H)p(Model params|H).

“Posterior =Conditional data likelihood

Data Likelihood︸︷︷︸“Likelihood”

× Prior.”

Data: given; all factors: f (model parameters).p(Data|H) = (often) “normalization factor.”


Bayes’ rule with Gaussians


Important properties

I PDF on Model parameters “in” ⇒ PDF onModel parameters “out.”

I Integration of information is multiplicative.

I Computationally intensive for general PDFs.

I Easy for discrete PDFs and Gaussians.(And some other families of continuous PDFs.)

I p(Data|Model params): requires known tableor mathematical function Data= f (Modelparams) to predict Data from Model.

I Likelihood is not a PDF.


Example with discrete PDF(Inverse probability example)

Data:

I In MAI 31% of students are non-European.

I At URKS 50% of students are non-European.

I At MAI 48% of students studies URKS.

Question:

I Probability that non-European student at MAIstudies URKS?


Bayes rule with discrete PDF (2)

Model:

I A corresponds to “studying URKS in MAI.”

I B corresponds to “being non-European inMAI.”

Inference:

p(A) = .48p(B) = .31p(B |A) = .50

⇒ p(A|B) =p(B |A)

p(B)p(A) = .77.


Intuitive interpretation of PDFs—The prosecutor’s fallacy—

The following links contain examples of courtreasoning that lead to very wrong conclusions, bymisunderstanding of the differences between p(x |y)and p(y |x), and between likelihood and PDF:

I http://en.wikipedia.org/wiki/

Prosecutor’s_fallacy

I http://en.wikipedia.org/wiki/

Conditional_probability#The_

conditional_probability_fallacy


Inference in graphical models

“Infer hidden X given observed Y!”

Terminology:

I Prediction, deduction, forward reasoning:reason from cause to symptom.

I Diagnosis, abduction, backward reasoning:reason from sympton to cause.

Bayes p(X |Y ) =p(Y |X )

p(Y )p(X ) “inverts” causality.


Example: burglar alarm (1)“What is the probability p(J ,M ,A,¬B ,¬E ), thatthere is no burglary, nor earthquake, but that thealarm went and both John and Mary called?”

p(J ,M ,A,B ,E )

= p(J ,M |A,B ,E )p(A,B ,E )

(product rule)

= p(J |A)p(M |A)p(A|B ,E )p(B ,E )

(product rule; independence via A)

= p(J |A)p(M |A)p(A|B ,E )p(B |E )p(E )

(product rule)

= p(J |A)p(M |A)p(A|B ,E )p(B)p(E ).

(independence of B and E )


⇒ p(J = T ,M = T ,A = T ,B = F ,E = F )

= p(J = T |A = T )p(M = T |A = T )

p(A = T |B = F ,E = F )p(B = F )p(E = F )

= 0.90× 0.70× 0.001× 0.999× 0.998

= 0.00062811

No Bayes’ rule required, because everything isdeduction (“forward reasoning”).


Example: burglar alarm (2)

“What is the probability that there is a burglary,given that John calls?”

Bayes’ rule: p(B |J) = p(J |B)p(B)/p(J).

1. p(B) = 0.001

2. p(J) =∑A,B,E

p(J ,A,B ,E ) (marginalisation)

=∑A,B,E

p(J |A,B ,E )p(A,B ,E ) (product)

=∑A,B,E

p(J |A)p(A|B ,E )p(B)p(E ). (id.)


Bayesian mathematicsJaynes’ axiomatic foundation

Axioms for plausible Bayesian inference

I Degrees of plausibility are represented by real numbers.II Qualitative correspondence with common sense.

III If a conclusion can be reasoned out in more than one way,then every possible way must lead to the same result.

IV Always take into account all of the evidence one has.V Always represent equivalent states of knowledge by equiv-

alent plausibility assignments.


Critique on Bayesian axioms

I Assigning equivalent probabilities for equivalentstates seems to assume that the modeller has“absolute knowledge,” since one often doesn’tknow that states are equivalent.

I Many people state that representinginformation by one single real number is notalways possible or desirable.

I Intuitive interpretation(?).


Bayesian mathematicsCox’ derivation

Plausible functional relationships:

p(A,B) = f{

p(A), p(B |A)}

⇒ f{

p(A), p(B |A)}

= f{

p(A)}m

f{

p(B |A)}m

(Arbitrary choices m = 1, f (u) = u.)Negation:

∃function g : g

(g(p(A)

))= p(A)

g(p(A OR B)

)= g

(p(A)

)AND g

(p(B)

)H. Bruyninckx, The Bayesian approach 61

Information measures

I “How much information do I have?”

I Absolute measure = reduce PDF to one scalar.

I Relative measure = what is the informationdifference between two PDFs? (one scalar)

I Global measure = measures the whole PDF.

I Choice of information measure is arbitrary, but:

I Entropy is a natural measure. . .I . . . can be derived from first principles.


Entropy example

.3.2 .2 .1 .1.1

.2 .2 .1.1 .2 .2

.2.2.1

.1.1 .2 .2 .2 .2

H=1 .6957

H=1 .8866 H=1 .6094

.2 .2 .1

H=1 .7480


Logarithm as information measureGood’s axioms (1966)

I I (M :E AND F |C ) = f {I (M :E |C ), I (M :F |E AND C )}II I (M :E AND M |C ) = I (M |C )

III I (M :E |C ) is a strictly increasing function of its argumentsIV I (M1 AND M2:M1|C ) = I (M1:M1|C ) if M1 and M2 are

mutually irrelevant pieces of information.V I (M1 AND M2|M1 AND C ) = I (M2|C )

⇓. . . straightforward but tedious. . .

⇓Logarithm of PDF is additive measure

of the information contained in the PDF.


Example of logarithmic measure:Shannon entropy

Interpretation:

I Physical: measures “uncertainty,” “disorder,”“chaos.”

I Bayesian: measures our (lack of) information,not the physical system’s lack of order.

I Composition of information = multiplicative.Composition of measures = additive (via log).

First case: discrete PDF

I {x1, . . . , xn}, with PDF p(x) = {p1, . . . , pn}.I Problem: define a measure H(p) of PDF p(x).


Entropy information measure.Axiomatic approach

Discrete distributionsI H is a continuous function of p.

II If all n probabilities pi are equal (and henceequal to 1/n, if we choose them to have to sumto 1), the entropy H(1/n, . . . , 1/n) is a mono-tonically increasing function of n.

III H is an invariant, i.e., the uncertainty shouldnot depend on how one orders or groups theelements xi .


Invariance of entropyMathematical discussion

Invariance w.r.t. grouping of variables expressedmathematically as:

H(p1, . . . , pn) =H(w1,w2, . . . )

+ w1H(p1|w1, . . . , pk |w1)

+ w2H(pk+1|w2, . . . , pk+m|w2)

+ . . . ,

where w1 is the probability of the set {x1, . . . , xk},w2 is the probability of the set {xk+1, . . . , xk+m}, . . .


w1w2

w3

pn

p1

pm

pm+1


Invariance of entropyExample

p1 = 1/2, p2 = 1/3, p3 = 1/6

Choose grouping into sets:w1 = p1 = 1/2,w2 = p2 + p3 = 1/3 + 1/6 = 1/2

Then:H(1/2, 1/3, 1/6) = H(1/2, 1/2) + 1

2H(2/3, 1/3)


Continuity & monotonicity of entropyMathematical discussion

I Continuous → rational numbers:pi = ni/

∑nj=1 nj

I Uniform P = (1/N , . . . , 1/N)→ H(p) = H(P)

(group into n1, n2, etc.)

I ⇒ H(n) + H(m) = H(mn)

I ⇒ H(n) = K ln(n)


Entropy formulas

H(p1, . . . , pn) = −K∑

pi ln(pi).

I ln(pi) < 0, because (ni/∑

ni) < 1.

I limpi→0 pi ln(pi) = 0.

I H increases with uncertainty(Uncertainty ≈ lack of clear peak.)

I There is no absolute zero for entropy.

I Be careful to compare entropies of two PDFs:must be defined over same domain.


Entropy for continuous PDF

I H(1/N , . . . , 1/N) not well defined for N →∞.

I PDF1 = p(x) dx , PDF2 = p(y) dy : x and yspan the same parameter space!

I Take ni/∑

ni in interval: density.

I Ratios of densities make sense (locally):

ni/∑

ni

mj/∑

mj→ dx

dy


Mutual informationRelative entropy

Kullback-Leibler divergence

H(p, q) = −∫

p(x) ln

(p(x)

q(x)

)dx .

I Asymmetry between p(x) and q(x).

I Asymmetry ⇒ no “distance” function!(Distance = independent of direction.)

I Corresponds to intuition when going betweentwo states of information.


Local information measure—Fisher information—

(No mathematical details!)I Logarithm-based measures are global!I Local measure = rate of change of information

in given direction v of local change in PDF:

H(PDF:PDF+εv) =1

2

∑i ,j

gij(PDF)v iv j+O(ε2).

Sensitivity of variance to small changes inmean.

I Fisher information H : is a matrix, not scalar⇒ scalar measures = f (H).


Information measuresExample of Gaussian PDFs

N (µ,P) =1√

(2π)n ||P||1/2exp

{−1

2(x− µ)T P−1(x− µ)

}.

I Quadratic measure for “error” x− µ:

1

2(x− µ)T P−1(x− µ)

(= square of the “Mahalanobis distance”)I Covariance matrix P−1 is Fisher InformationI Is additive after taking the logarithm of PDF!


Measures inspired by P−1

P−1 is matrix measure ⇒ derived scalar measures:

I Trace.

I Determinant.

I Ratio of singular values.

I . . .

−− No unique choice.

++ Computationally efficient.

Note: inverse P−1 of Covariance Matrix is called theInformation Matrix.


Ignorance—No information

I Bayes approach: state of “No information”does not exist.Jaynes: “Merely knowing the physical meaningof our parameters in a model, alreadyconstitutes highly relevant prior informationwhich our intuition is able to use at once.”

I Approaches:I “Uniform” prior.

(Note: impossible on infinite domains.)I Requirement of Invariance under transformations⇒ Maximum Entropy (MaxEnt) PDF.

I Explicit “I don’t know” hypothesis.


Bayes rule is information-preserving

I Same Data explained by models M1 and M2:

p(M1|Data) =p(Data|M1)

p(Data)p(M1),

p(M2|Data) =p(Data|M2)

p(Data)p(M2),

⇒ logp(M1|Data)

p(M2|Data)= log

p(Data|M1)

p(Data|M2)+log

p(M1)

p(M2).

⇒ Bayes’ rule does not add or remove(logarithmic) information.


Goodness-of-fit vs Model complexity

Goodness-of-fit/Likelihood:how well can a given Model explain the Data?

Complexity of a model is higher if:

I it has a higher dimensional state space.

I it has more parameters in PDF representation.

⇒ same Data is less likely in more complex model.

How to trade-off complexity vs explanatory power oftwo models?⇒ Occam’s razor, Bayes’ factor, Occam’s factor,AIC, BIC, MDL, . . .


Occam’s razor

Pluralitas non est ponenda sine necessitate.

(Given two equally predictive models, choose thesimpler one.)

Example: series −1, 3, 7, 11; what’s next?

Model 1: 15, 19, . . . f (x) = x + 4.

Model 2: −19.9, 1043.8, . . . :

f (x) = −x3

11+

9x2

11+

23

11f (−1) = 3, f (3) = 7, f (7) = 11, f (11) = −19.4, . . .

Model 1: 1 integer; Model 2: 3 rational numbers.


Bayesian model selection(“Hypothesis test”)

Explicative capacity of two models M1 and M2 forthe same Data:

p(M1|Data)

p(M2|Data)=

p(Data|M1)

p(Data|M2)︸︷︷︸Bayes’ factor

p(M1)

p(M2)︸︷︷︸prior

.

(Single scalar when Data is given!)Bayes’ Factor: penalizes complex modelsautomatically in the trade-off between predictivepower and complexity (see later: Occam’s factor).


Bayes’ Factor

“Average out” influence of all parameters θi inmodel Mi :

p(Data|M1)

p(Data|M2)︸︷︷︸Bayes’ factor

=

∫p(Data|θ1,M1)p(θ1|M1)dθ1∫p(Data|θ2,M2)︸︷︷︸

data fit

p(θ2|M2)dθ2︸︷︷︸Occam’s factor

.

Bayes’ factor is ratio of two scalars, so comparisonis possible.

Occam’s Factor: complexity part of Bayes’ Factor.

p(θ1|M1): after inference (“training”) with Data.


Occam’s factor

= complexity measure in Bayes’ factor:

p(Data|M) =

∫p(Data|θ,M)︸︷︷︸

data fit

p(θ|M)dθ︸︷︷︸Occam’s factor

.

The larger the θ parameter space (i.e., the morecomplex the model), the smaller the regionp(θ|M)dθ where the largest probability mass isfound. (In case the model has “converged.”)

Occam’s factor includes density dθ!


Bayes’ factor implementations

Akaike Information Criterion (AIC)

−2 ln p(Data|θml ,M) + 2k , (k = # parameters).

Bayesian Information Criterion (BIC):

−2 ln p(Data|θml ,M) + k ln(N), (N =

# data samples).

Minimum Description Length (MDL):

− ln p(Data|θml ,M) +k

2ln(N).

(Beware: these measures do note take density intoaccount!)


Examples up to now: simple!

I Static network: all PDFs & nodes are fixed.

I PDFs are binary: true, false.

I Fully observed: all PDF tables are given.

I Only estimation, no learning.

But the real world is more complex:

I Dynamic networks: PDFs change over time,because new “data” comes in from the world.

I PDFs: continuous functions.

I Partially observed: some PDFs are unknown.

I Learning and estimation.


What makes uncertainty reasoning“Bayesian”?

I Use explicit models.

I Uncertainty = PDF.

I Allow a priori knowledge.

I Inference = calculate marginals of unobservedvariables, conditional on observed variables.

I Use Bayes’ rule i.s.o. only likelihood.

I Estimate PDF i.s.o. “point estimates”.(E.g., Maximum Likelihood, Maximum APosteriori)


Maximum Likelihood EstimatorMaximum A Posterior

I Likelihood (from Bayes’ rule):

L(Model params) = p(Data|Model params).

I Maximum Likelihood Estimator (MLE):

arg maxModel params

L(Model params).

I Maximum A Posterior (MAP):

arg maxModel params

p(Model params|Data).


MLE, MAP (2)

+ Often easy to calculate.

− Only point estimate.

− Doesn’t take density dx into account.

− Not very robust against small changes in PDF.


Bayesian Learning

Things to learn:

I Structure of the graphical model.

I Parameters of the PDFs in the model.

Four categories, from simple to complex:

1. known model + full observations.(Done already! Only queries.)

2. known model + partial observations.(Inference of unobserved parameters.)

3. unknown model + full observations.

4. unknown model + partial observations.


Bayesian Estimation

To treat parameters as additional unobservedvariables and to compute a full posterior distributionover all nodes conditional upon observed data, thento integrate out the parameters.

Point estimate x = X from PDF p(x , y , z , . . . ):

0. Learn p(x , y , z , . . . ).

1. Marginalize out all other variables:p(x) =

∫p(x , y , z , . . . ) dy dz . . .

2. (If point estimate is needed:)Weighted average over x (posterior mean):

X =∫

x p(x) dx .


Bayesian model learning(Revisited)

Model learning = estimation in (much) larger space:

I New model = combination of primitive models.

I Combinations are represented by parameters.

I The combination parameters are learned.

Model learning is an order of magnitudemore complex than parameter estimation!


Approximations

I Modelling:I Choice of model parameters.I Choice of graphical model structure

(= stochastic dependencies.)I Choice of PDF family.

I Assumptions:I Independence in joint PDFs.I Independence over time steps: Markov property.

(x(k + 1) = f (x(k)).)I Choice of priors.

Always requires application-dependent insights!


Computational simplifications

I Integral needed during marginalization:I Replace integral by finite sum.I (Sample-based) Monte Carlo integration.

I Use ML instead of full PDF.

I Use explicit input/output relationships(y = f (x)) instead of implicit (h(x , y) = 0).

I Linearize non-linear relationships.

I Adapt scheduling of inference computations.

I Replace some steps by analytic approximations.

I . . .

Always requires application-dependent insights!


Overview of “Bayesian” algorithms

0. Answering queries in Bayes net.I Network is given.I All PDFs are known

I Already explained earlier in course.

1. General inference in graphical models= to update PDFs when new data comes in:

I Bayesian network: Message passingI Markov Random Field: Junction tree


Overview “Bayesian” algorithms (2)

2. Recursive estimation= inference + estimation in given model whennew data comes in at regular intervals:

I Kalman Filter.I Particle Filter.I Hidden Markov Model (Baum-Welch,

Viterbi).

3. Model approximation= to construct model that fits data “best”:

I Variational Bayes.I Expectation-Maximization (EM).


Inference on Bayes NetworkMessage passing—Belief propagation

I Propagation of new information from one nodeto all others via message passing in forwardand backward direction of the directed graph.

I Messages are stored locally (at nodes) in ordernot to repeat some calculations.

I More efficient algorithms for graphs with morestructure (tree, DAG, . . . ) are limit cases ofthis message passing.

I No details given in this course. . .I Further reading: Bishop 2006 (Chapter 8).


Markov blankets= all nodes influencing given node

MRF:all neighbours.

Bayes network:parents, children,and children’sparents.



Inference on general graph—Junction Tree—

I Transform MRF graphical model into a tree bytaking nodes together.

I Transformed model is not unique!(Different arbitrary choices are possible.)

I Too complex for this introductory course. . .

I Approximation via Loopy Belief Propagation= apply Belief Propagation although graph isnot DAG.


Dynamic/Recursive network1st-order Markov

“Arrows”:

Xk+1 = f (Xk ,Uk+1)

Yk = g(Xk ,Uk)

“1st-order Markov” = time-influence only one step.


Dynamic/Recursive network2nd-order Markov


Simplest dynamic network:Kalman Filter

Required inference = given Y and U , update X .

Assumptions made:

I 1st order Markov.

I PDF representation: Gaussian distributions!

I Functional relationships f (·), g(·): linear!

⇒ analytical solution possible for Bayes’ rule(“learning”) and marginalization (“estimation”)!


Kalman Filters (cont’d)

Typical applications: tracking (= adapting to smalldeviations from previous values).

Simplifications to reduce computations:

I State-space domain Model: xk+1 = F xk + Qk .

I Gaussian “uncertainty” on xk : covariance Pk .

I Gaussian “process noise”: covariance Qk .

I Data: zk = H xk + Rk .

I Gaussian “measurement uncertainty” on zk :covariance Rk .



Kalman Filters—Further readingTo derive Kalman Filter from Bayes’ rule:

Ho, Yu-Chi and Lee, R. C. K., A Bayesian Approach toProblems in Stochastic Estimation and Control, IEEETransactions on Automatic Control, 1964.

To derive Kalman Filter as optimal information processor:

Zellner, Arnold. Optimal Information Processing andBayes’s Theorem, The American Statistician,42:278–284, 1988.

To derive Kalman Filter as recursive least-squares filter:

Kalman, R. E. and Bucy, R. S. New results in linearfiltering and prediction theory, Transactions of theASME, Journal of Basic Engineering, 83:95–108, 1961.


Sample-based inference(Monte Carlo methods)

Terminology: Sequential Monte Carlo, bootstrapfilter, condensation algorithm, particle filter,. . .

Approximated PDF = samples with a weight:

p(xk |zk)→ {x ik ,w

ik}Ns

i=1,∑

i

w ik = 1.

Operations on PDFs (marginalization, Bayes’rule, . . . ) → operations on samples. For example:

φ =

∫φ(x)p(x)dx ≈ 1

N

N∑i=1

φ(x i) =N∑

i=1

w iφ(x i).


Importance of sample-based methods

I Can approximate any PDF. . .

I . . . including PDFs with mixeddiscrete-continuous parameters.

I Sequential method: Markov property leads toreduced computational complexity.

I Used where Kalman Filters are too “unimodal.”

I Accuracy doesn’t depend on dimension ofspace!


How to generate samples?

I Inversion sampling.

I Proposal-based sampling:

I Importance sampling.I Rejection sampling.

I Markov Chain Monte Carlo (MCMC) sampling:

I Metropolis-Hastings sampling.I Gibbs sampling.

All methods use Random Number Generators.All methods assume evaluating p(x) is cheap.


Inversion sampling via CDF

I (Only) sampling from uniform PDF is simple.I Searching in the Cumulative Density Function

can be costly.I Only some families op PDF allow simple

inversion sampling.


Importance sampling

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

++ ++ ++ ++++ + + + ++ +++ ++ ++ ++ ++ ++ + + ++ ++ +++ ++ + ++ + +++ ++++

BetaGaussianBeta SamplesNormal samples

I Sample from simpler proposal: q(x).

I Adjust weight of each sample according top(x)/q(x).

I q(x) should be “similar” to p(x).


Rejection sampling

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Rejection Sampling

x

fact

or *

dno

rm(x

, mu,

sig

ma) student t

Scaled Gaussian

I Sample from simpler proposal q(x).I q(x) should be “similar” to p(x).I Sample from cq(x), c > 1.I Accept if smaller than p(x).


MCMC sampling(Markov Chain Monte Carlo)

I Finding appropriate “proposal density” q(x):difficult!Especially in higher dimensions.

I PDF is found as the limit of a Markov Chain.

I Can take long time before chain has“converged.”

I Two major approaches:I Metropolis-Hastings sampling.I Gibbs sampling.


Example: Visual feature tracking

Movie

I Low-level visual features are tracked.

I “Statistical weight” is built up over time.

I Some features become permanent.

http://www.doc.ic.ac.uk/~ajd/


Example: robot localization withlaser distance scanner

Movie

I Uniform initial sample set.

I Some rooms are almost identical → multiple“peaks.”

http://www.cs.washington.edu/ai/Mobile_

Robotics/mcl/


Hidden Markov modelSystem behaves as a State Machine:

(Source: Wikipedia)

I x: states.

I y: possibleobservations.

I a: state transitionprobabilities.

I b: observationprobabilities.

This is not a graphical model!

Example: speech recognition.


Graphical model of HMM

“Easy” calculation direction:

p(Y ) =∑X

p(Y |X )p(X ).

Even this explodes for real systems. . .


Fastest Bayesian Matching—Viterbi Algorithm—

(Also known as: forward algorithm.)

I Given the HMM model parameters (transitionand output probabilities), find the most likelysequence of (hidden) states (e.g., spokenwords) which could have generated a givenoutput sequence (e.g., measured acousticsignal).

I Maximum Likelihood, not full-PDF:

argmax p(hidden state sequence|observed sequence).


HMM properties that avoid exhaustive search:

I observed and hidden events: in alignedsequences.(= form of graphical model)

I observed event corresponds to exactly onehidden event.(= state dynamics is Finite State Machine)

I computing most likely hidden sequence up totime t only depends on (i) observed event at t,and (ii) most likely sequence up to time t − 1.(= dynamic programming)


Fastest Bayesian Learning—Baum-Welch—

(Also known as: forward-backward algorithm.)

Learning of the probabilities aij and bij in a HMM:

I Maximum Likelihood, not full PDF.

I Makes forward steps starting from assumed MLparameters.

I Compares it to the measured outputs.

I Makes backward steps to adapt MLparameters.

I Until convergence, to local optimum!


Approximate learning algorithm—Variational Bayes—

I Ever-returning problem: estimate p(z |x) wherex are the observed variables (“data”), and zare the hidden variables.

I Approximate “true” p(z |x) by a memberq(z |θ) from a θ-parameterized family of PDFs.For example: sum of Gaussians, with theweights as parameters.

I Approximation is guided by minimization ofKullback-Leibler divergence between true ep(z |x) and approximated q(z |θ).


Approximate learning algorithm—Expectation-Maximization—

I EM is “abstract algorithm”, not reallyexecutable in itself.

I Gives Maximum Likelihood estimate of θ inq(z |θ).

I Works in two steps:I Expectation step.I Maximization step.

I Can be proven to converge. . .

I . . . but not necessarily to global optimum.


Outline of EMData x are given; θ and z are alternatively updated.

Expectation step k :I Assume θ has given value θ(k).I Then p(z |x , θ(k)) results from inference.I The “total” log-likelihood function log p(x , z |θ)

is function of z and θ.I Marginalizing z away gives a function of θ:

Q(θ) =

∫p(z |x , θ(k)) log p(x , z |θ)dz .

Maximization step k : find the maximum of Q(θ):

θ(k + 1) = arg maxθ

Q(θ).


Alternative explanation of EM

System spaces:

I Total space S : all possible PDFs over statespace, modelled by parameters φ.

I Model space M : all PDFs that are in the θfamily, i.e., subspace of S , represented byparameters φ(θ).

I Data space D: all PDFs p(x , z) that arepossible given the observed + hiddenparameters.

D need not be part of M!


EM = Iterative Maximum Likelihood steps:

I Inference with last P → Q ∈ D.

I Find “closest” representative in M= “project” PDF Q onto M= P ∈ M : minP KL(Q,P).(KL = Kullback-Leibler divergence)


Graphical model learning

Unknown model + full or partial observations

Much harder than parameter learning!

Generic approach to model learning: find “best”parameters in combinations of more primitivesub-models.

Major problem: create sub-models + dataassociation of (hidden) data to sub-models.

Major algorithm: EM, with more complex θ’s:

1. discrete selection of sub-models.

2. continuous information in each selection.



The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye ›...

Documents

Transcript of The Bayesian approach - KU Leuvenpeople.cs.kuleuven.be › ~danny.deschreye ›...