EM Algorithm

EM Algorithm

Jur van den Berg

Kalman Filtering vs. Smoothing

• Dynamics and Observation model

• Kalman Filter:– Compute– Real-time, given data so far

• Kalman Smoother:– Compute – Post-processing, given all data

ttt YYX yy ,,| 00

TtYYX TTt 0,,,| 00 yy

),(~),(~

,,1

RNVQNW

CA

tt

tt

tt

tt

t

t

0v0w

vxwx

yx

EM Algorithm

• Kalman smoother: – Compute distributions X0, …, Xt

given parameters A, C, Q, R, and data y0, …, yt.

• EM Algorithm:– Simultaneously optimize X0, …, Xt and A, C, Q, R

given data y0, …, yt.

),(~),(~

,,1

RNVQNW

CA

tt

tt

tt

tt

t

t

0v0w

vxwx

yx

Probability vs. Likelihood

• Probability: predict unknown outcomes based on known parameters: – p(x | q)

• Likelihood: estimate unknown parameters based on known outcomes: – L(q | x) = p(x | q)

• Coin-flip example:– q is probability of “heads” (parameter)– x = HHHTTH is outcome

Likelihood for Coin-flip Example • Probability of outcome given parameter:– p(x = HHHTTH | q = 0.5) = 0.56 = 0.016

• Likelihood of parameter given outcome:– L(q = 0.5 | x = HHHTTH) = p(x | q) = 0.016

• Likelihood maximal when q = 0.6666… • Likelihood function not a probability density

Likelihood for Cont. Distributions • Six samples {-3, -2, -1, 1, 2, 3} believed to be

drawn from some Gaussian N(0, s2)

• Likelihood of s:

• Maximum likelihood:

16.26

321)1()2()3( 222222

s

)|3()|2()|3(})3,2,1,1,2,3{|( ssss xpxpxpL

Likelihood for Stochastic Model

• Dynamics model

• Suppose xt and yt are given for 0 ≤ t ≤ T, what is likelihood of A, C, Q and R?

• • Compute log-likelihood:

),(~),(~

,,1

RNVQNW

CA

tt

tt

tt

tt

t

t

0v0w

vxwx

yx

T

ttttt ppRQCApRQCAL

01 )|()|(),,,|,(),|,,,( xyxxyxyx

),,,|,(log RQCAp yx

Log-likelihood

• Multivariate normal distribution N(m, S) has pdf:

• From model:

...)|(log)|(log

)|()|(log),,,|,(log

1

0 01

01

T

t

T

ttttt

T

ttttt

pp

ppRQCAp

xyxx

xyxxyx

),(~1 QAN tt xx ),(~ RCN tt xy

))()(exp()2()( 121

2/112/ μxμxx SS Tkp

const)()(21log

21

)()(21log

21

1

0

1

11

1

1

0

1

ttT

tt

T

t

ttT

tt

T

t

CRCR

AQAQ

xyxy

xxxx

Log-likelihood #2

• a = Tr(a) if a is scalar• Bring summation inward

...const)()(21log

21

)()(21log

21

1

0

1

11

1

1

0

1

ttT

tt

T

t

ttT

tt

T

t

CRCR

AQAQ

xyxy

xxxx

const))()Tr((21log

21

))()Tr((21log

2

0

11

1

01

11

1

T

ttt

Ttt

T

ttt

Ttt

CRCRT

AQAQT

xyxy

xxxx

Log-likelihood #3

• Tr(AB) = Tr(BA)• Tr(A) + Tr(B) = Tr(A+B)

...const))()Tr((21log

21

))()Tr((21log

2

0

11

1

01

11

1

T

ttt

Ttt

T

ttt

Ttt

CRCRT

AQAQT

xyxy

xxxx

const))((Tr21log

21

))((Tr21log

2

0

11

1

011

11

T

t

Ttttt

T

t

Ttttt

CCRRT

AAQQT

xyxy

xxxx

Log-likelihood #4

• Expand

constTr21log

21

Tr21log

2

),|,,,(

01

11

1

01111

11

T

t

TTtt

Ttt

TTtt

Ttt

T

t

TTtt

Ttt

TTtt

Ttt

CCCCRRT

AAAAQQT

RQCAl

xxyxxyyy

xxxxxxxx

yx

...const))((Tr21log

21

))((Tr21log

2

0

11

1

011

11

T

t

Ttttt

T

t

Ttttt

CCRRT

AAQQT

xyxy

xxxx

Maximize likelihood

• log is monotone function– max log(f(x)) max f(x)

• Maximize l(A, C, Q, R | x, y) in turn for A, C, Q and R.– Solve for A– Solve for C– Solve for Q– Solve for R

0),|,,,(

CyxRQCAl

0),|,,,(

QyxRQCAl

0),|,,,(

RyxRQCAl

0),|,,,(

AyxRQCAl

Matrix derivatives• Defined for scalar functions f : Rn*m -> R

• Key identities

T

TTT

TTT

TTT

AAA

BAAB

ABA

AAB

AABBABB

AAA

log

)(Tr)(Tr)(Tr

)(

)(xxxx

Optimizing A

• Derivative

• Maximizer

1

01

1 2221),|,,,( T

t

Ttt

Ttt AQ

AyxRQCAl xxxx

11

0

1

01

T

t

Ttt

T

t

TttA xxxx

Optimizing C

• Derivative

• Maximizer

T

t

Ttt

Ttt CR

CyxRQCAl

0

1 2221),|,,,( xxxy

1

00

T

t

Ttt

T

t

TttC xxxy

Optimizing Q

• Derivative with respect to inverse

• Maximizer

TT

t

TTtt

Ttt

TTtt

Ttt AAAAQT

QyxRQCAl

1

011111 2

12

),|,,,( xxxxxxxx

1

01111

1 T

t

TTtt

Ttt

TTtt

Ttt AAAA

TQ xxxxxxxx

Optimizing R

• Derivative with respect to inverse

• Maximizer

TT

t

TTtt

Ttt

TTtt

Ttt CCCCRT

RyxRQCAl

01 2

121),|,,,( xxyxxyyy

T

t

TTtt

Ttt

TTtt

Ttt CCCC

TR

011 xxyxxyyy

EM-algorithm

• Initial guesses of A, C, Q, R• Kalman smoother (E-step): – Compute distributions X0, …, XT

given data y0, …, yT and A, C, Q, R.

• Update parameters (M-step):– Update A, C, Q, R such that

expected log-likelihood is maximized• Repeat until convergence (local optimum)

),(~),(~

,,1

RNVQNW

CA

tt

tt

tt

tt

t

t

0v0w

vxwx

yx

Kalman Smoother• for (t = 0; t < T; ++t) // Kalman filter

• for (t = T – 1; t ≥ 0; --t) // Backward pass

QAAPP

AT

tttt

tttt

||1

||1 ˆˆ xx

ttttttt

tttttttt

Ttt

Tttt

CPKPPCKRCCPCPK

|11|11|1

|111|11|1

1|1|11

ˆˆˆ

xyxx

TtttTttttTt

ttTttttTt

ttT

ttt

LPPLPPL

PAPL

)(ˆˆˆˆ

|1|1||

|1|1||

1|1|

xxxx

Update Parameters• Likelihood in terms of x, but only X available

• Likelihood-function linear in • Expected likelihood: replace them with:

• Use maximizers to update A, C, Q and R.

constTr21log

21

Tr21log

2

),|,,,(

01

11

1

01111

11

T

t

TTtt

Ttt

TTtt

Ttt

T

t

TTtt

Ttt

TTtt

Ttt

CCCCRRT

AAAAQQT

RQCAl

xxyxxyyy

xxxxxxxx

yx

Ttt

Tttt 1,, xxxxx

TTtttTtTtt

TTttt

Ttt

TTtTtTt

Ttt

Ttt

PLXXE

PXXE

XE

|1|1|1|1|1|1

|||

|

ˆ)ˆˆ(ˆˆ)|(

ˆˆ)|(

ˆ)|(

xxxxxy

xxy

xy

Convergence

• Convergence is guaranteed to local optimum• Similar to coordinate ascent

Conclusion

• EM-algorithm to simultaneously optimize state estimates and model parameters

• Given ``training data’’, EM-algorithm can be used (off-line) to learn the model for subsequent use in (real-time) Kalman filters

Next time

• Learning from demonstrations• Dynamic Time Warping

EM Algorithm

Documents

Transcript of EM Algorithm