A Tutorial on Dynamic Bayesian Networks

46
A Tutorial on Dynamic Bayesian Networks Kevin P. Murphy MIT AI lab 12 November 2002

Transcript of A Tutorial on Dynamic Bayesian Networks

Page 1: A Tutorial on Dynamic Bayesian Networks

A Tutorial on Dynamic Bayesian Networks

Kevin P. Murphy

MIT AI lab

12 November 2002

Page 2: A Tutorial on Dynamic Bayesian Networks

Modelling sequential data

• Sequential data is everywhere, e.g.,

– Sequence data (offline): Biosequence analysis,

text processing, ...

– Temporal data (online): Speech recognition, visual

tracking, financial forecasting, ...

• Problems: classification, segmentation, state estimation,

fault diagnosis, prediction, ...

• Solution: build/learn generative models, then compute

P (quantity of interest|evidence) using Bayes rule.

1

Page 3: A Tutorial on Dynamic Bayesian Networks

Outline of talk

• Representation

– What are DBNs, and what can we use them for?

• Inference

– How do we compute P (Xt|y1:t) and related quantities?

• Learning

– How do we estimate parameters and model structure?

2

Page 4: A Tutorial on Dynamic Bayesian Networks

Representation

• Hidden Markov Models (HMMs).

• Dynamic Bayesian Networks (DBNs).

• Modelling HMM variants as DBNs.

• State space models (SSMs).

• Modelling SSMs and variants as DBNs.

3

Page 5: A Tutorial on Dynamic Bayesian Networks

Hidden Markov Models (HMMs)

• An HMM is a stochastic finite automaton, where each state

generates (emits) an observation.

• Let Xt ∈ {1, . . . , K} represent the hidden state at time t, and

Yt represent the observation.

• e.g., X = phones, Y = acoustic feature vector.

• Transition model: A(i, j)4= P (Xt = j|Xt−1 = i).

• Observation model: B(i, k)4= P (Yt = k|Xt = i).

• Initial state distribution: π(i)4= P (X0 = i).

4

Page 6: A Tutorial on Dynamic Bayesian Networks

HMM state transition diagram

• Nodes represent states.

• There is an arrow from i to j iff A(i, j) > 0.

1 2

3

0.91.0

0.80.2

0.1

5

Page 7: A Tutorial on Dynamic Bayesian Networks

The 3 main tasks for HMMs

• Computing likelihood: P (y1:t) =∑

i P (Xt = i, y1:t)

• Viterbi decoding (most likely explanation): argmaxx1:t P (x1:t|y1:t)

• Learning: θ̂ML = argmaxθ P (y1:T |θ), where θ = (A, B, π).

– Learning can be done with Baum-Welch (EM).

– Learning uses inference as a subroutine.

– Inference (forwards-backwards) takes O(TK2) time, where

K is the number of states and T is sequence length.

6

Page 8: A Tutorial on Dynamic Bayesian Networks

The problem with HMMs

• Suppose we want to track the state (e.g., the position) of D

objects in an image sequence.

• Let each object be in K possible states.

• Then Xt = (X(1)t , . . . , X

(D)t ) can have KD possible values.

⇒ Inference takes O(T (KD)2

)time and O

(TKD

)space.

⇒ P (Xt|Xt−1) needs O(K2D) parameters to specify.

7

Page 9: A Tutorial on Dynamic Bayesian Networks

DBNs vs HMMs

• An HMM represents the state of the world using a single

discrete random variable, Xt ∈ {1, . . . , K}.

• A DBN represents the state of the world using a set of ran-

dom variables, X(1)t , . . . , X

(D)t (factored/ distributed

representation).

• A DBN represents P (Xt|Xt−1) in a compact way using a

parameterized graph.

⇒ A DBN may have exponentially fewer parameters than its

corresponding HMM.

⇒ Inference in a DBN may be exponentially faster than in the

corresponding HMM.

8

Page 10: A Tutorial on Dynamic Bayesian Networks

DBNs are a kind of graphical model

• In a graphical model, nodes represent random variables, and

(lack of) arcs represents conditional independencies.

• Directed graphical models = Bayes nets = belief nets.

• DBNs are Bayes nets for dynamic processes.

• Informally, an arc from Xi to Xj means Xi “causes” Xj.

(Graph must be acyclic!)

9

Page 11: A Tutorial on Dynamic Bayesian Networks

HMM represented as a DBN

. . .X X X

Y Y Y

1 2 3 4

1 3 4

X

2Y

• This graph encodes the assumptions

Yt ⊥ Yt′|Xt and Xt+1 ⊥ Xt−1|Xt (Markov)

• Shaded nodes are observed, unshaded are hidden.

• Structure and parameters repeat over time.

10

Page 12: A Tutorial on Dynamic Bayesian Networks

HMM variants represented as DBNs

HMM MixGauss HMMIO−HMM

AR−HMM

⇒ The same code can do inference and learning in all of these

models.

11

Page 13: A Tutorial on Dynamic Bayesian Networks

Factorial HMMs

X11

X2

X3

X12

2

21

1X2

X3

Y1

Y2

12

Page 14: A Tutorial on Dynamic Bayesian Networks

Factorial HMMs vs HMMs

• Let us compare a factorial HMM with D chains, each with

K values, to its equivalent HMM.

• Num. parameters to specify P (Xt|Xt−1):

– HMM: O(K2D).

– DBN: O(DK2).

• Computational complexity of exact inference:

– HMM: O(TK2D).

– DBN: O(TDKD+1).

13

Page 15: A Tutorial on Dynamic Bayesian Networks

Coupled HMMs

14

Page 16: A Tutorial on Dynamic Bayesian Networks

Semi-Markov HMMs

q1

y1 y2 y3 y4 y5 y6 y7 y8

q2 q3

• Each state emits a sequence.

• Explicit-duration HMM:

P (Yt−l+1:l|Qt, Lt = l) =∏l

i=1 P (Yi|Qt)

• Segment HMM: P (Yt−l+1:l|Qt, Lt = l) modelled by an HMM

or SSM.

• Multigram: P (Yt−l+1:l|Qt, Lt = l) is deterministic string, and

segments are independent.

15

Page 17: A Tutorial on Dynamic Bayesian Networks

Explicit duration HMMs

Q1 Q2 Q3

F1 F2 F3

L1 L2 L3

Y1 Y2 Y3

16

Page 18: A Tutorial on Dynamic Bayesian Networks

Segment HMMs

Q1 Q2 Q3

F1 F2 F3

L1 L2 L3

Q21 Q2

2 Q23

Y1 Y2 Y3

17

Page 19: A Tutorial on Dynamic Bayesian Networks

Hierarchical HMMs

• Each state can emit an HMM, which can generate sequences.

• Duration of segments implicitly defined by when sub-HMM

enters finish state.

Y Y6 7

q11

q21

q1

Y1

q2

2

q22

q1

Y3

q32

Y4

q33

Y5

q12

33 3

Y

q2 q243

18

Page 20: A Tutorial on Dynamic Bayesian Networks

Hierarchical HMMs

Q11 Q1

2 Q13

F21 F2

2 F23

Q21 Q2

2 Q23

F31 F3

2 F33

Q31 Q3

2 Q33

Y1 Y2 Y3

19

Page 21: A Tutorial on Dynamic Bayesian Networks

State Space Models (SSMs)

• Also known as linear dynamical system, dynamic linear model,

Kalman filter model, etc.

• Xt ∈ RD, Yt ∈ RM and

P (Xt|Xt−1) = N (Xt;AXt−1, Q)

P (Yt|Xt) = N (Yt;BXt, R)

• The Kalman filter can compute P (Xt|y1:t) in O(min{M3, D2})

operations per time step.

20

Page 22: A Tutorial on Dynamic Bayesian Networks

Factored linear-Gaussian models produce sparse matrices

• Directed arc from Xt−1(i) to Xt(j) iff A(i, j) > 0.

• Undirected between Xt(i) and Xt(j) iff Σ−1(i, j) > 0.

• e.g., consider a 2-chain factorial SSM

with P (Xit |X

it−1) = N (Xi

t;AiXt−1, Qi)

P (X1t , X1

t |X1t−1, X1

t−1) = N

((X1

tX2

t

);

(A1 0

0 A2

)(X1

t−1X2

t−1

),

(Q−1

1 0

0 Q−12

))

21

Page 23: A Tutorial on Dynamic Bayesian Networks

Factored discrete-state models do NOT produce sparse

transition matrices

e.g., consider a 2-chain factorial HMM

P (X1t , X1

t |X1t−1, X1

t−1) = P (X1t |X

1t−1)P (X2

t |X2t−1)

01

0 1

01

0 1

00011011

00 01 10 11

22

Page 24: A Tutorial on Dynamic Bayesian Networks

Problems with SSMs

• Non-linearity

• Non-Gaussianity

• Multi-modality

23

Page 25: A Tutorial on Dynamic Bayesian Networks

Switching SSMs

Y Y Y1 2 3

X X X

Z Z Z

1 2 3

1 2 3

3

P (Xt|Xt−1, Zt = j) = N (Xt;AjXt−1, Qj)

P (Yt|Xt, Zt = j) = N (Yt;BjXt, Rj)

P (Zt = j|Zt−1 = i) = M(i, j)

• Useful for modelling multiple (linear) regimes/modes, fault

diagnosis, data association ambiguity, etc.

• Unfortunately number of modes in posterior grows exponen-

tially, i.e., exact inference takes O(Kt) time.

24

Page 26: A Tutorial on Dynamic Bayesian Networks

Kinds of inference for DBNs

t

t

t

Tt

P(X(t) | y(1:t))

P(X(t) | y(1:T))

P(X(t-L) | y(1:t))

L

H

P(X(t+H) | y(1:t))

filtering

prediction

fixed-lag

smoothing(offline)

smoothing

fixed interval

25

Page 27: A Tutorial on Dynamic Bayesian Networks

Complexity of inference in factorial HMMs

X11

X2

X3

X12

2

21

1X2

X3

Y1

Y2

• X(1)t , . . . , X

(D)t become corrrelated due to “explaining away”.

• Hence belief state P (Xt|y1:t) has size O(KD).

26

Page 28: A Tutorial on Dynamic Bayesian Networks

Complexity of inference in coupled HMMs

• Even with local connectivity, everything becomes correlated

due to shared common influences in the past. c.f., MRF.

27

Page 29: A Tutorial on Dynamic Bayesian Networks

Approximate filtering

• Many possible representations for belief state, αt4= P (Xt|y1:t):

• Discrete distribution (histogram)

• Gaussian

• Mixture of Gaussians

• Set of samples (particles)

28

Page 30: A Tutorial on Dynamic Bayesian Networks

Belief state = discrete distribution

• Discrete distribution is non-parametric (flexible), but intractable.

• Only consider k most probable values — Beam search.

• Approximate joint as product of factors

(ADF/BK approximation)

αt ≈ α̃t =C∏

i=1

P (Xit |y1:t)

29

Page 31: A Tutorial on Dynamic Bayesian Networks

Assumed Density Filtering (ADF)

α̂t α̂t+1 exact

α̃t−1 α̃t α̃t+1 approx

U P U P

30

Page 32: A Tutorial on Dynamic Bayesian Networks

Belief state = Gaussian distribution

• Kalman filter — exact for SSM.

• Extended Kalman filter — linearize dynamics.

• Unscented Kalman filter — pipe mean ± sigma points through

nonlinearity, and fit Gaussian.

31

Page 33: A Tutorial on Dynamic Bayesian Networks

Unscented transformActual (sampling) Linearized (EKF) UT

� � � ��� � � � � � � � �

���� � � �� �� ��� � �

� � � �! "

#�$ % & #

sigma points

true mean

UT mean

and covarianceweighted sample mean

mean

UT covariance

covariance

true covariance

transformedsigma points

32

Page 34: A Tutorial on Dynamic Bayesian Networks

Belief state = mixture of Gaussians

• Hard in general.

• For switching SSMs, can apply ADF: collapse mixture of

K Gaussians to best single Gaussian by moment matching

(GPB/IMM algorithm).

b1t−1

b2t−1

����

@@@R

����

@@@R

F1

F2

F1

F2

-

-

-

-

b1,1t

b1,2t

b2,1t

b2,2t

@@@R

BBBBBBBN

��������

����

M

M

-

-

b1t

b2t

33

Page 35: A Tutorial on Dynamic Bayesian Networks

Belief state = set of samples

Particle filtering, sequential Monte Carlo, condensation,

SISR, survival of the fittest, etc.P(x(t-1) | y(1:t-1))

P(x(t) | y(1:t))

P(x(t) | y(1:t))

P(x(t) | y(1:t-1))

unweightedposterior

unweightedprediction

weighted prior

weightedposterior

P(x(t)|x(t-1))

P(y(t) | x(t))

resample

weighting

proposal

34

Page 36: A Tutorial on Dynamic Bayesian Networks

Rao-Blackwellised particle filtering (RBPF)

• Particle filtering in high dimensional spaces has high variance.

• Suppose we partition Xt = (Ut, Vt).

• If V1:t can be integrated out analytically, conditional on U1:t

and Y1:t, we only need to sample U1:t.

• Integrating out V1:t reduces the size of the state space, and

provably reduces the number of particles needed to achieve

a given variance.

35

Page 37: A Tutorial on Dynamic Bayesian Networks

RBPF for switching SSMs

Y Y Y1 2 3

X X X

Z Z Z

1 2 3

1 2 3

3

• Given Z1:t, we can use a Kalman filter to compute P (Xt|y1:t, z1:t).

• Each particle represents (w, z1:t, E[Xt|y1:t, z1:t],Var[Xt|y1:t, z1:t]).

• c.f., stochastic bank of Kalman filters.

36

Page 38: A Tutorial on Dynamic Bayesian Networks

Approximate smoothing (offline)

• Two-filter smoothing

• Loopy belief propagation

• Variational methods

• Gibbs sampling

• Can combine exact and approximate methods

• Used as a subroutine for learning

37

Page 39: A Tutorial on Dynamic Bayesian Networks

Learning (frequentist)

• Parameter learning

θ̂MAP = argmaxθ

logP (θ|D, M) = argmaxθ

log(D|θ, M)+logP (θ|M)

where

logP (D|θ, M) =∑

d

logP (Xd|θ, M)

• Structure learning

M̂MAP = argmaxM

logP (M |D) = argmaxM

logP (D|M)+logP (M)

where

logP (D|M) = log

∫P (D|θ, M)P (θ|M)P (M)dθ

38

Page 40: A Tutorial on Dynamic Bayesian Networks

Parameter learning: full observability

• If every node is observed in every case, the likelihood decom-

poses into a sum of terms, one per node:

logP (D|θ, M) =∑

d

logP (Xd|θ, M)

=∑

d

log∏

i

P (Xd,i|πd,i, θi, M)

=∑

i

d

logP (Xd,i|πd,i, θi, M)

where πd,i are the values of the parents of node i in case d,

and θi are the parameters associated with CPD i.

39

Page 41: A Tutorial on Dynamic Bayesian Networks

Parameter learning: partial observability

• If some nodes are sometimes hidden, the likelihood does not

decompose.

logP (D|θ, M) =∑

d

log∑

h

P (H = h, V = vd|θ, M)

• In this case, can use gradient descent or EM to find local

maximum.

• EM iteratively maximizes the expected complete-data log-

likelihood, which does decompose into a sum of local terms.

40

Page 42: A Tutorial on Dynamic Bayesian Networks

Structure learning (model selection)

• How many nodes?

• Which arcs?

• How many values (states) per node?

• How many levels in the hierarchical HMM?

• Which parameter tying pattern?

• Structural zeros:

– In a (generalized) linear model, zeros correspond to absent

directed arcs (feature selection).

– In an HMM, zeros correspond to impossible transitions.

41

Page 43: A Tutorial on Dynamic Bayesian Networks

Structure learning (model selection)

• Basic approach: search and score.

• Scoring function is marginal likelihood, or an approximation

such as penalized likelihood or cross-validated likelihood

logP (D|M) = log

∫P (D|θ, M)P (θ|M)P (M)dθ

BIC≈ logP (D|θ̂ML, M) −

dim(M)

2log |D|

• Search algorithms: bottom up, top down, middle out.

• Initialization very important.

• Avoiding local minima very important.

42

Page 44: A Tutorial on Dynamic Bayesian Networks

Summary

• Representation

– What are DBNs, and what can we use them for?

• Inference

– How do we compute P (Xt|y1:t) and related quantities?

• Learning

– How do we estimate parameters and model structure?

43

Page 45: A Tutorial on Dynamic Bayesian Networks

Open problems

• Representing richer models, e.g., relational models, SCFGs.

• Efficient inference in large discrete models.

• Inference in models with non-linear, non-Gaussian CPDs.

• Online inference in models with variable-sized state-spaces,

e.g., tracking objects and their relations.

• Parameter learning for undirected and chain graph models.

• Structure learning. Discriminative learning.

Bayesian learning. Online learning. Active learning. etc.

44

Page 46: A Tutorial on Dynamic Bayesian Networks

The end

45