A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian...

46
A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Transcript of A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian...

Page 1: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

A Tutorial on Dynamic Bayesian Networks

Kevin P. Murphy

MIT AI lab

12 November 2002

Page 2: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Modelling sequential data

• Sequential data is everywhere, e.g.,

– Sequence data (offline): Biosequence analysis,

text processing, ...

– Temporal data (online): Speech recognition, visual

tracking, financial forecasting, ...

• Problems: classification, segmentation, state estimation,

fault diagnosis, prediction, ...

• Solution: build/learn generative models, then compute

P (quantity of interest|evidence) using Bayes rule.

1

Page 3: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Outline of talk

• Representation

– What are DBNs, and what can we use them for?

• Inference

– How do we compute P (Xt|y1:t) and related quantities?

• Learning

– How do we estimate parameters and model structure?

2

Page 4: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Representation

• Hidden Markov Models (HMMs).

• Dynamic Bayesian Networks (DBNs).

• Modelling HMM variants as DBNs.

• State space models (SSMs).

• Modelling SSMs and variants as DBNs.

3

Page 5: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Hidden Markov Models (HMMs)

• An HMM is a stochastic finite automaton, where each state

generates (emits) an observation.

• Let Xt ∈ {1, . . . , K} represent the hidden state at time t, and

Yt represent the observation.

• e.g., X = phones, Y = acoustic feature vector.

• Transition model: A(i, j)4= P (Xt = j|Xt−1 = i).

• Observation model: B(i, k)4= P (Yt = k|Xt = i).

• Initial state distribution: π(i)4= P (X0 = i).

4

Page 6: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

HMM state transition diagram

• Nodes represent states.

• There is an arrow from i to j iff A(i, j) > 0.

1 2

3

0.91.0

0.80.2

0.1

5

Page 7: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

The 3 main tasks for HMMs

• Computing likelihood: P (y1:t) =∑

i P (Xt = i, y1:t)

• Viterbi decoding (most likely explanation): argmaxx1:t P (x1:t|y1:t)

• Learning: θ̂ML = argmaxθ P (y1:T |θ), where θ = (A, B, π).

– Learning can be done with Baum-Welch (EM).

– Learning uses inference as a subroutine.

– Inference (forwards-backwards) takes O(TK2) time, where

K is the number of states and T is sequence length.

6

Page 8: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

The problem with HMMs

• Suppose we want to track the state (e.g., the position) of D

objects in an image sequence.

• Let each object be in K possible states.

• Then Xt = (X(1)t , . . . , X

(D)t ) can have KD possible values.

⇒ Inference takes O(T (KD)2

)time and O

(TKD

)space.

⇒ P (Xt|Xt−1) needs O(K2D) parameters to specify.

7

Page 9: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

DBNs vs HMMs

• An HMM represents the state of the world using a single

discrete random variable, Xt ∈ {1, . . . , K}.

• A DBN represents the state of the world using a set of ran-

dom variables, X(1)t , . . . , X

(D)t (factored/ distributed

representation).

• A DBN represents P (Xt|Xt−1) in a compact way using a

parameterized graph.

⇒ A DBN may have exponentially fewer parameters than its

corresponding HMM.

⇒ Inference in a DBN may be exponentially faster than in the

corresponding HMM.

8

Page 10: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

DBNs are a kind of graphical model

• In a graphical model, nodes represent random variables, and

(lack of) arcs represents conditional independencies.

• Directed graphical models = Bayes nets = belief nets.

• DBNs are Bayes nets for dynamic processes.

• Informally, an arc from Xi to Xj means Xi “causes” Xj.

(Graph must be acyclic!)

9

Page 11: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

HMM represented as a DBN

. . .X X X

Y Y Y

1 2 3 4

1 3 4

X

2Y

• This graph encodes the assumptions

Yt ⊥ Yt′|Xt and Xt+1 ⊥ Xt−1|Xt (Markov)

• Shaded nodes are observed, unshaded are hidden.

• Structure and parameters repeat over time.

10

Page 12: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

HMM variants represented as DBNs

HMM MixGauss HMMIO−HMM

AR−HMM

⇒ The same code can do inference and learning in all of these

models.

11

Page 13: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Factorial HMMs

X11

X2

X3

X12

2

21

1X2

X3

Y1

Y2

12

Page 14: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Factorial HMMs vs HMMs

• Let us compare a factorial HMM with D chains, each with

K values, to its equivalent HMM.

• Num. parameters to specify P (Xt|Xt−1):

– HMM: O(K2D).

– DBN: O(DK2).

• Computational complexity of exact inference:

– HMM: O(TK2D).

– DBN: O(TDKD+1).

13

Page 15: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Coupled HMMs

14

Page 16: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Semi-Markov HMMs

q1

y1 y2 y3 y4 y5 y6 y7 y8

q2 q3

• Each state emits a sequence.

• Explicit-duration HMM:

P (Yt−l+1:l|Qt, Lt = l) =∏l

i=1 P (Yi|Qt)

• Segment HMM: P (Yt−l+1:l|Qt, Lt = l) modelled by an HMM

or SSM.

• Multigram: P (Yt−l+1:l|Qt, Lt = l) is deterministic string, and

segments are independent.

15

Page 17: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Explicit duration HMMs

Q1 Q2 Q3

F1 F2 F3

L1 L2 L3

Y1 Y2 Y3

16

Page 18: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Segment HMMs

Q1 Q2 Q3

F1 F2 F3

L1 L2 L3

Q21 Q2

2 Q23

Y1 Y2 Y3

17

Page 19: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Hierarchical HMMs

• Each state can emit an HMM, which can generate sequences.

• Duration of segments implicitly defined by when sub-HMM

enters finish state.

Y Y6 7

q11

q21

q1

Y1

q2

2

q22

q1

Y3

q32

Y4

q33

Y5

q12

33 3

Y

q2 q243

18

Page 20: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Hierarchical HMMs

Q11 Q1

2 Q13

F21 F2

2 F23

Q21 Q2

2 Q23

F31 F3

2 F33

Q31 Q3

2 Q33

Y1 Y2 Y3

19

Page 21: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

State Space Models (SSMs)

• Also known as linear dynamical system, dynamic linear model,

Kalman filter model, etc.

• Xt ∈ RD, Yt ∈ RM and

P (Xt|Xt−1) = N (Xt;AXt−1, Q)

P (Yt|Xt) = N (Yt;BXt, R)

• The Kalman filter can compute P (Xt|y1:t) in O(min{M3, D2})

operations per time step.

20

Page 22: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Factored linear-Gaussian models produce sparse matrices

• Directed arc from Xt−1(i) to Xt(j) iff A(i, j) > 0.

• Undirected between Xt(i) and Xt(j) iff Σ−1(i, j) > 0.

• e.g., consider a 2-chain factorial SSM

with P (Xit |X

it−1) = N (Xi

t;AiXt−1, Qi)

P (X1t , X1

t |X1t−1, X1

t−1) = N

((X1

tX2

t

);

(A1 0

0 A2

)(X1

t−1X2

t−1

),

(Q−1

1 0

0 Q−12

))

21

Page 23: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Factored discrete-state models do NOT produce sparse

transition matrices

e.g., consider a 2-chain factorial HMM

P (X1t , X1

t |X1t−1, X1

t−1) = P (X1t |X

1t−1)P (X2

t |X2t−1)

01

0 1

01

0 1

00011011

00 01 10 11

22

Page 24: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Problems with SSMs

• Non-linearity

• Non-Gaussianity

• Multi-modality

23

Page 25: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Switching SSMs

Y Y Y1 2 3

X X X

Z Z Z

1 2 3

1 2 3

3

P (Xt|Xt−1, Zt = j) = N (Xt;AjXt−1, Qj)

P (Yt|Xt, Zt = j) = N (Yt;BjXt, Rj)

P (Zt = j|Zt−1 = i) = M(i, j)

• Useful for modelling multiple (linear) regimes/modes, fault

diagnosis, data association ambiguity, etc.

• Unfortunately number of modes in posterior grows exponen-

tially, i.e., exact inference takes O(Kt) time.

24

Page 26: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Kinds of inference for DBNs

t

t

t

Tt

P(X(t) | y(1:t))

P(X(t) | y(1:T))

P(X(t-L) | y(1:t))

L

H

P(X(t+H) | y(1:t))

filtering

prediction

fixed-lag

smoothing(offline)

smoothing

fixed interval

25

Page 27: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Complexity of inference in factorial HMMs

X11

X2

X3

X12

2

21

1X2

X3

Y1

Y2

• X(1)t , . . . , X

(D)t become corrrelated due to “explaining away”.

• Hence belief state P (Xt|y1:t) has size O(KD).

26

Page 28: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Complexity of inference in coupled HMMs

• Even with local connectivity, everything becomes correlated

due to shared common influences in the past. c.f., MRF.

27

Page 29: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Approximate filtering

• Many possible representations for belief state, αt4= P (Xt|y1:t):

• Discrete distribution (histogram)

• Gaussian

• Mixture of Gaussians

• Set of samples (particles)

28

Page 30: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Belief state = discrete distribution

• Discrete distribution is non-parametric (flexible), but intractable.

• Only consider k most probable values — Beam search.

• Approximate joint as product of factors

(ADF/BK approximation)

αt ≈ α̃t =C∏

i=1

P (Xit |y1:t)

29

Page 31: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Assumed Density Filtering (ADF)

α̂t α̂t+1 exact

α̃t−1 α̃t α̃t+1 approx

U P U P

30

Page 32: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Belief state = Gaussian distribution

• Kalman filter — exact for SSM.

• Extended Kalman filter — linearize dynamics.

• Unscented Kalman filter — pipe mean ± sigma points through

nonlinearity, and fit Gaussian.

31

Page 33: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Unscented transformActual (sampling) Linearized (EKF) UT

� � � ��� � � � � � � � �

���� � � �� �� ��� � �

� � � �! "

#�$ % & #

sigma points

true mean

UT mean

and covarianceweighted sample mean

mean

UT covariance

covariance

true covariance

transformedsigma points

32

Page 34: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Belief state = mixture of Gaussians

• Hard in general.

• For switching SSMs, can apply ADF: collapse mixture of

K Gaussians to best single Gaussian by moment matching

(GPB/IMM algorithm).

b1t−1

b2t−1

����

@@@R

����

@@@R

F1

F2

F1

F2

-

-

-

-

b1,1t

b1,2t

b2,1t

b2,2t

@@@R

BBBBBBBN

��������

����

M

M

-

-

b1t

b2t

33

Page 35: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Belief state = set of samples

Particle filtering, sequential Monte Carlo, condensation,

SISR, survival of the fittest, etc.P(x(t-1) | y(1:t-1))

P(x(t) | y(1:t))

P(x(t) | y(1:t))

P(x(t) | y(1:t-1))

unweightedposterior

unweightedprediction

weighted prior

weightedposterior

P(x(t)|x(t-1))

P(y(t) | x(t))

resample

weighting

proposal

34

Page 36: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Rao-Blackwellised particle filtering (RBPF)

• Particle filtering in high dimensional spaces has high variance.

• Suppose we partition Xt = (Ut, Vt).

• If V1:t can be integrated out analytically, conditional on U1:t

and Y1:t, we only need to sample U1:t.

• Integrating out V1:t reduces the size of the state space, and

provably reduces the number of particles needed to achieve

a given variance.

35

Page 37: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

RBPF for switching SSMs

Y Y Y1 2 3

X X X

Z Z Z

1 2 3

1 2 3

3

• Given Z1:t, we can use a Kalman filter to compute P (Xt|y1:t, z1:t).

• Each particle represents (w, z1:t, E[Xt|y1:t, z1:t],Var[Xt|y1:t, z1:t]).

• c.f., stochastic bank of Kalman filters.

36

Page 38: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Approximate smoothing (offline)

• Two-filter smoothing

• Loopy belief propagation

• Variational methods

• Gibbs sampling

• Can combine exact and approximate methods

• Used as a subroutine for learning

37

Page 39: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Learning (frequentist)

• Parameter learning

θ̂MAP = argmaxθ

logP (θ|D, M) = argmaxθ

log(D|θ, M)+logP (θ|M)

where

logP (D|θ, M) =∑

d

logP (Xd|θ, M)

• Structure learning

M̂MAP = argmaxM

logP (M |D) = argmaxM

logP (D|M)+logP (M)

where

logP (D|M) = log

∫P (D|θ, M)P (θ|M)P (M)dθ

38

Page 40: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Parameter learning: full observability

• If every node is observed in every case, the likelihood decom-

poses into a sum of terms, one per node:

logP (D|θ, M) =∑

d

logP (Xd|θ, M)

=∑

d

log∏

i

P (Xd,i|πd,i, θi, M)

=∑

i

d

logP (Xd,i|πd,i, θi, M)

where πd,i are the values of the parents of node i in case d,

and θi are the parameters associated with CPD i.

39

Page 41: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Parameter learning: partial observability

• If some nodes are sometimes hidden, the likelihood does not

decompose.

logP (D|θ, M) =∑

d

log∑

h

P (H = h, V = vd|θ, M)

• In this case, can use gradient descent or EM to find local

maximum.

• EM iteratively maximizes the expected complete-data log-

likelihood, which does decompose into a sum of local terms.

40

Page 42: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Structure learning (model selection)

• How many nodes?

• Which arcs?

• How many values (states) per node?

• How many levels in the hierarchical HMM?

• Which parameter tying pattern?

• Structural zeros:

– In a (generalized) linear model, zeros correspond to absent

directed arcs (feature selection).

– In an HMM, zeros correspond to impossible transitions.

41

Page 43: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Structure learning (model selection)

• Basic approach: search and score.

• Scoring function is marginal likelihood, or an approximation

such as penalized likelihood or cross-validated likelihood

logP (D|M) = log

∫P (D|θ, M)P (θ|M)P (M)dθ

BIC≈ logP (D|θ̂ML, M) −

dim(M)

2log |D|

• Search algorithms: bottom up, top down, middle out.

• Initialization very important.

• Avoiding local minima very important.

42

Page 44: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Summary

• Representation

– What are DBNs, and what can we use them for?

• Inference

– How do we compute P (Xt|y1:t) and related quantities?

• Learning

– How do we estimate parameters and model structure?

43

Page 45: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Open problems

• Representing richer models, e.g., relational models, SCFGs.

• Efficient inference in large discrete models.

• Inference in models with non-linear, non-Gaussian CPDs.

• Online inference in models with variable-sized state-spaces,

e.g., tracking objects and their relations.

• Parameter learning for undirected and chain graph models.

• Structure learning. Discriminative learning.

Bayesian learning. Online learning. Active learning. etc.

44

Page 46: A Tutorial on Dynamic Bayesian Networksmurphyk/Papers/dbntalk.pdf · A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

The end

45