Forward–backward algorithm.pdf

6
Forward–backward algorithm The forward–backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions o 1:t := o 1 ,...,o t , i.e. it computes, for all hidden state variables X k {X 1 ,...,X t } , the distribution P (X k | o 1:t ) . This in- ference task is usually called smoothing. The algorithm makes use of the principle of dynamic programming to compute efficiently the values that are required to ob- tain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name forward–backward al- gorithm. The term forward–backward algorithm is also used to re- fer to any algorithm belonging to the general class of al- gorithms that operate on sequence models in a forward– backward manner. In this sense, the descriptions in the remainder of this article refer but to one specific instance of this class. 1 Overview In the first pass, the forward–backward algorithm com- putes a set of forward probabilities which provide, for all k ∈{1,...,t} , the probability of ending up in any par- ticular state given the first k observations in the sequence, i.e. P (X k | o 1:k ) . In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point k , i.e. P (o k+1:t | X k ) . These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence: P (X k | o 1:t )= P (X k | o 1:k ,o k+1:t ) P (o k+1:t | X k )P (X k | o 1:k ) The last step follows from an application of the Bayes’ rule and the conditional independence of o k+1:t and o 1:k given X k . As outlined above, the algorithm involves three steps: 1. computing forward probabilities 2. computing backward probabilities 3. computing smoothed values. The forward and backward steps may also be called “for- ward message pass” and “backward message pass” - these terms are due to the message-passing used in general belief propagation approaches. At each single observa- tion in the sequence, probabilities to be used for calcula- tions at the next observation are computed. The smooth- ing step can be calculated simultaneously during the back- ward pass. This step allows the algorithm to take into ac- count any past observations of output for computing more accurate results. The forward–backward algorithm can be used to find the most likely state for any point in time. It cannot, however, be used to find the most likely sequence of states (see Viterbi algorithm). 2 Forward probabilities The following description will use matrices of probability values rather than probability distributions, although in general the forward-backward algorithm can be applied to continuous as well as discrete probability models. We transform the probability distributions related to a given hidden Markov model into matrix notation as fol- lows. The transition probabilities P(X t | X t-1 ) of a given random variable X t representing all possible states in the hidden Markov model will be represented by the matrix T where the row index, i, will represent the start state and the column index, j, represents the target state. The example below represents a system where the prob- ability of staying in the same state after each step is 70% and the probability of transitioning to the other state is 30%. The transition matrix is then: T = ( 0.7 0.3 0.3 0.7 ) In a typical Markov model we would multiply a state vec- tor by this matrix to obtain the probabilities for the sub- sequent state. In a hidden Markov model the state is un- known, and we instead observe events associated with the possible states. An event matrix of the form: B = ( 0.9 0.1 0.2 0.8 ) provides the probabilities for observing events given a particular state. In the above example, event 1 will be 1

Transcript of Forward–backward algorithm.pdf

Page 1: Forward–backward algorithm.pdf

Forward–backward algorithm

The forward–backward algorithm is an inferencealgorithm for hiddenMarkov models which computes theposterior marginals of all hidden state variables given asequence of observations/emissions o1:t := o1, . . . , ot, i.e. it computes, for all hidden state variables Xk ∈{X1, . . . , Xt} , the distribution P (Xk | o1:t) . This in-ference task is usually called smoothing. The algorithmmakes use of the principle of dynamic programming tocompute efficiently the values that are required to ob-tain the posterior marginal distributions in two passes.The first pass goes forward in time while the second goesbackward in time; hence the name forward–backward al-gorithm.The term forward–backward algorithm is also used to re-fer to any algorithm belonging to the general class of al-gorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in theremainder of this article refer but to one specific instanceof this class.

1 Overview

In the first pass, the forward–backward algorithm com-putes a set of forward probabilities which provide, for allk ∈ {1, . . . , t} , the probability of ending up in any par-ticular state given the first k observations in the sequence,i.e. P (Xk | o1:k) . In the second pass, the algorithmcomputes a set of backward probabilities which providethe probability of observing the remaining observationsgiven any starting point k , i.e. P (ok+1:t | Xk) . Thesetwo sets of probability distributions can then be combinedto obtain the distribution over states at any specific pointin time given the entire observation sequence:

P (Xk | o1:t) = P (Xk | o1:k, ok+1:t) ∝ P (ok+1:t |Xk)P (Xk | o1:k)

The last step follows from an application of the Bayes’rule and the conditional independence of ok+1:t and o1:kgiven Xk .As outlined above, the algorithm involves three steps:

1. computing forward probabilities

2. computing backward probabilities

3. computing smoothed values.

The forward and backward steps may also be called “for-ward message pass” and “backward message pass” - theseterms are due to the message-passing used in generalbelief propagation approaches. At each single observa-tion in the sequence, probabilities to be used for calcula-tions at the next observation are computed. The smooth-ing step can be calculated simultaneously during the back-ward pass. This step allows the algorithm to take into ac-count any past observations of output for computingmoreaccurate results.The forward–backward algorithm can be used to find themost likely state for any point in time. It cannot, however,be used to find the most likely sequence of states (seeViterbi algorithm).

2 Forward probabilities

The following description will use matrices of probabilityvalues rather than probability distributions, although ingeneral the forward-backward algorithm can be appliedto continuous as well as discrete probability models.We transform the probability distributions related to agiven hidden Markov model into matrix notation as fol-lows. The transition probabilities P(Xt | Xt−1) of agiven random variableXt representing all possible statesin the hidden Markov model will be represented by thematrix T where the row index, i, will represent the startstate and the column index, j, represents the target state.The example below represents a system where the prob-ability of staying in the same state after each step is 70%and the probability of transitioning to the other state is30%. The transition matrix is then:

T =

(0.7 0.30.3 0.7

)In a typical Markov model we would multiply a state vec-tor by this matrix to obtain the probabilities for the sub-sequent state. In a hidden Markov model the state is un-known, and we instead observe events associated with thepossible states. An event matrix of the form:

B =

(0.9 0.10.2 0.8

)provides the probabilities for observing events given aparticular state. In the above example, event 1 will be

1

Page 2: Forward–backward algorithm.pdf

2 3 BACKWARD PROBABILITIES

observed 90% of the time if we are in state 1 while event2 has a 10% probability of occurring in this state. In con-trast, event 1 will only be observed 20% of the time if weare in state 2 and event 2 has an 80% chance of occurring.Given a state vector ( π ), the probability of observingevent j is then:

P(O = j) =∑i

πibi,j

This can be represented in matrix form by multiplyingthe state vector ( π ) by an observation matrix ( Oj =diag(b∗,oj ) ) containing only diagonal entries. Each entryis the probability of the observed event given each state.Continuing the above example, an observation of event 1would be:

O1 =

(0.9 0.00.0 0.2

)This allows us to calculate the probabilities associatedwith transitioning to a new state and observing the givenevent as:

f0:1 = πTO1

The probability vector that results contains entries indi-cating the probability of transitioning to each state andobserving the given event. This process can be carriedforward with additional observations using:

f0:t = f0:t−1TOt

This value is the forward probability vector. The i'th entryof this vector provides:

f0:t(i) = P(o1, o2, . . . , ot, Xt = xi|π)

Typically, wewill normalize the probability vector at eachstep so that its entries sum to 1. A scaling factor is thusintroduced at each step such that:

f̂0:t = c−1t f̂0:t−1TOt

where f̂0:t−1 represents the scaled vector from the previ-ous step and ct represents the scaling factor that causesthe resulting vector’s entries to sum to 1. The product ofthe scaling factors is the total probability for observingthe given events irrespective of the final states:

P(o1, o2, . . . , ot|π) =t∏

s=1

cs

This allows us to interpret the scaled probability vectoras:

f̂0:t(i) =f0:t(i)∏ts=1 cs

=P(o1, o2, . . . , ot, Xt = xi|π)

P(o1, o2, . . . , ot|π)= P(Xt = xi|o1, o2, . . . , ot, π)

We thus find that the product of the scaling factors pro-vides us with the total probability for observing the givensequence up to time t and that the scaled probability vec-tor provides us with the probability of being in each stateat this time.

3 Backward probabilities

A similar procedure can be constructed to find backwardprobabilities. These intend to provide the probabilities:

bt:T(i) = P(ot+1, ot+2, . . . , oT |Xt = xi)

That is, we now want to assume that we start in a partic-ular state ( Xt = xi ), and we are now interested in theprobability of observing all future events from this state.Since the initial state is assumed as given (i.e. the priorprobability of this state = 100%), we begin with:

bT:T = [1 1 1 . . . ]T

Notice that we are now using a column vector while theforward probabilities used row vectors. We can then workbackwards using:

bt−1:T = TOtbt:T

While we could normalize this vector as well so that itsentries sum to one, this is not usually done. Noting thateach entry contains the probability of the future event se-quence given a particular initial state, normalizing thisvector would be equivalent to applying Bayes’ theoremto find the likelihood of each initial state given the futureevents (assuming uniform priors for the final state vector).However, it is more common to scale this vector using thesame ct constants used in the forward probability calcu-lations. bT:T is not scaled, but subsequent operations use:

b̂t−1:T = c−1t TOtb̂t:T

where b̂t:T represents the previous, scaled vector. Thisresult is that the scaled probability vector is related to thebackward probabilities by:

b̂t:T(i) =bt:T(i)∏Ts=t+1 cs

Page 3: Forward–backward algorithm.pdf

3

This is useful because it allows us to find the total prob-ability of being in each state at a given time, t, by multi-plying these values:

γt(i) = P(Xt = xi|o1, o2, . . . , oT , π) =P(o1, o2, . . . , oT , Xt = xi|π)

P(o1, o2, . . . , oT |π)=

f0:t(i) · bt:T(i)∏Ts=1 cs

= f̂0:t(i)·b̂t:T(i)

To understand this, we note that f0:t(i) · bt:T(i) providesthe probability for observing the given events in a way thatpasses through state xi at time t. This probability includesthe forward probabilities covering all events up to time tas well as the backward probabilities which include allfuture events. This is the numerator we are looking for inour equation, and we divide by the total probability of theobservation sequence to normalize this value and extractonly the probability that Xt = xi . These values aresometimes called the “smoothed values” as they combinethe forward and backward probabilities to compute a finalprobability.The values γt(i) thus provide the probability of being ineach state at time t. As such, they are useful for deter-mining the most probable state at any time. It shouldbe noted, however, that the term “most probable state”is somewhat ambiguous. While the most probable stateis the most likely to be correct at a given point, the se-quence of individually probable states is not likely to bethemost probable sequence. This is because the probabil-ities for each point are calculated independently of eachother. They do not take into account the transition proba-bilities between states, and it is thus possible to get statesat two moments (t and t+1) that are both most probableat those time points but which have very little probabilityof occurring together, i.e. P(Xt = xi, Xt+1 = xj) ̸=P(Xt = xi)P(Xt+1 = xj) . The most probable se-quence of states that produced an observation sequencecan be found using the Viterbi algorithm.

4 Example

This example takes as its basis the umbrella world inRussell & Norvig 2010 Chapter 15 pp. 566 in whichwe would like to infer the weather given observation ofa man either carrying or not carrying an umbrella. Weassume two possible states for the weather: state 1 = rain,state 2 = no rain. We assume that the weather has a 70%chance of staying the same each day and a 30% chanceof changing. The transition probabilities are then:

T =

(0.7 0.30.3 0.7

)We also assume each state generates 2 events: event 1 =umbrella, event 2 = no umbrella. The conditional proba-bilities for these occurring in each state are given by theprobability matrix:

B =

(0.9 0.10.2 0.8

)We then observe the following sequence of events: {um-brella, umbrella, no umbrella, umbrella, umbrella} whichwe will represent in our calculations as:

O1 =

(0.9 0.00.0 0.2

)O2 =

(0.9 0.00.0 0.2

)O3 =

(0.1 0.00.0 0.8

)O4 =

(0.9 0.00.0 0.2

)O5 =

(0.9 0.00.0 0.2

)Note that O3 differs from the others because of the “noumbrella” observation.In computing the forward probabilities we begin with:

f0:0 =(0.5 0.5

)which is our prior state vector indicating that we don'tknow which state the weather is in before our observa-tions. While a state vector should be given as a row vec-tor, we will use the transpose of the matrix so that thecalculations below are easier to read. Our calculationsare then written in the form:

(̂f0:t)T = c−1Ot(T)T (̂f0:t−1)T

instead of:

f̂0:t = c−1f̂0:t−1TOt

Notice that the transformation matrix is also transposed,but in our example the transpose is equal to the originalmatrix. Performing these calculations and normalizingthe results provides:

(̂f0:1)T = c−11

(0.9 0.00.0 0.2

)(0.7 0.30.3 0.7

)(0.50000.5000

)= c−1

1

(0.45000.1000

)=

(0.81820.1818

)

(̂f0:2)T = c−12

(0.9 0.00.0 0.2

)(0.7 0.30.3 0.7

)(0.81820.1818

)= c−1

2

(0.56450.0745

)=

(0.88340.1166

)(̂f0:3)T = c−1

3

(0.1 0.00.0 0.8

)(0.7 0.30.3 0.7

)(0.88340.1166

)= c−1

3

(0.06530.2772

)=

(0.19070.8093

)(̂f0:4)T = c−1

4

(0.9 0.00.0 0.2

)(0.7 0.30.3 0.7

)(0.19070.8093

)= c−1

4

(0.33860.1247

)=

(0.73080.2692

)(̂f0:5)T = c−1

5

(0.9 0.00.0 0.2

)(0.7 0.30.3 0.7

)(0.73080.2692

)= c−1

5

(0.53310.0815

)=

(0.86730.1327

)For the backward probabilities we start with:

b5:5 =(1.01.0

)

Page 4: Forward–backward algorithm.pdf

4 7 PYTHON EXAMPLE

We are then able to compute (using the observations inreverse order and normalizing with different constants):

b̂4:5 = α

(0.7 0.30.3 0.7

)(0.9 0.00.0 0.2

)(1.00001.0000

)= α

(0.69000.4100

)=

(0.62730.3727

)

b̂3:5 = α

(0.7 0.30.3 0.7

)(0.9 0.00.0 0.2

)(0.62730.3727

)= α

(0.41750.2215

)=

(0.65330.3467

)b̂2:5 = α

(0.7 0.30.3 0.7

)(0.1 0.00.0 0.8

)(0.65330.3467

)= α

(0.12890.2138

)=

(0.37630.6237

)b̂1:5 = α

(0.7 0.30.3 0.7

)(0.9 0.00.0 0.2

)(0.37630.6237

)= α

(0.27450.1889

)=

(0.59230.4077

)b̂0:5 = α

(0.7 0.30.3 0.7

)(0.9 0.00.0 0.2

)(0.59230.4077

)= α

(0.39760.2170

)=

(0.64690.3531

)Finally, we will compute the smoothed probability values.These result alsomust be scaled so that its entries sum to 1because we did not scale the backward probabilities withthe ct 's found earlier. The backward probability vectorsabove thus actually represent the likelihood of each stateat time t given the future observations. Because these vec-tors are proportional to the actual backward probabilities,the result has to be scaled an additional time.

(γ0)T = α

(0.50000.5000

)◦(0.64690.3531

)= α

(0.32350.1765

)=

(0.64690.3531

)

(γ1)T = α

(0.81820.1818

)◦(0.59230.4077

)= α

(0.48460.0741

)=

(0.86730.1327

)(γ2)

T = α

(0.88340.1166

)◦(0.37630.6237

)= α

(0.33240.0728

)=

(0.82040.1796

)(γ3)

T = α

(0.19070.8093

)◦(0.65330.3467

)= α

(0.12460.2806

)=

(0.30750.6925

)(γ4)

T = α

(0.73080.2692

)◦(0.62730.3727

)= α

(0.45840.1003

)=

(0.82040.1796

)(γ5)

T = α

(0.86730.1327

)◦(1.00001.0000

)= α

(0.86730.1327

)=

(0.86730.1327

)Notice that the value of γ0 is equal to b̂0:5 and that γ5is equal to f̂0:5 . This follows naturally because both f̂0:5and b̂0:5 begin with uniform priors over the initial and fi-nal state vectors (respectively) and take into account all ofthe observations. However, γ0 will only be equal to b̂0:5when our initial state vector represents a uniform prior(i.e. all entries are equal). When this is not the case b̂0:5needs to be combined with the initial state vector to findthe most likely initial state. We thus find that the forwardprobabilities by themselves are sufficient to calculate themost likely final state. Similarly, the backward proba-bilities can be combined with the initial state vector toprovide the most probable initial state given the observa-tions. The forward and backward probabilities need onlybe combined to infer the most probable states betweenthe initial and final points.

The calculations above reveal that the most probableweather state on every day except for the third one was“rain.” They tell us more than this, however, as they nowprovide a way to quantify the probabilities of each state atdifferent times. Perhaps most importantly, our value at γ5quantifies our knowledge of the state vector at the end ofthe observation sequence. We can then use this to predictthe probability of the various weather states tomorrow aswell as the probability of observing an umbrella.

5 Performance

The brute-force procedure for the solution of this prob-lem is the generation of all possible NT state sequencesand calculating the joint probability of each state se-quence with the observed series of events. This approachhas time complexity O(T · NT ) , where T is the lengthof sequences andN is the number of symbols in the statealphabet. This is intractable for realistic problems, as thenumber of possible hidden node sequences typically is ex-tremely high. However, the forward–backward algorithmhas time complexity O(N2T ) .An enhancement to the general forward-backward algo-rithm, called the Island algorithm, trades smaller mem-ory usage for longer running time, takingO(N2T logT )time and O(N2 logT ) memory. On a computer with anunlimited number of processors, this can be reduced toO(N2T ) total time, while still taking onlyO(N2 logT )memory.[1]

In addition, algorithms have been developed to computef0:t+1 efficiently through online smoothing such as thefixed-lag smoothing (FLS) algorithm Russell & Norvig2010 Figure 15.6 pp. 580.

6 Pseudocode

ForwardBackward(guessState, sequenceIndex): if se-quenceIndex is past the end of the sequence, return 1if (guessState, sequenceIndex) has been seen before, re-turn saved result result = 0 for each neighboring state n:result = result + (transition probability from guessStateto n given observation element at sequenceIndex) *ForwardBackward(n, sequenceIndex+1) save result for(guessState, sequenceIndex) return result

7 Python example

Given HMM (just like in Viterbi algorithm) representedin the Python programming language:states = ('Healthy', 'Fever') end_state = 'E' observa-tions = ('normal', 'cold', 'dizzy') start_probability ={'Healthy': 0.6, 'Fever': 0.4} transition_probability =

Page 5: Forward–backward algorithm.pdf

5

{ 'Healthy' : {'Healthy': 0.69, 'Fever': 0.3, 'E': 0.01},'Fever' : {'Healthy': 0.4, 'Fever': 0.59, 'E': 0.01}, }emission_probability = { 'Healthy' : {'normal': 0.5,'cold': 0.4, 'dizzy': 0.1}, 'Fever' : {'normal': 0.1, 'cold':0.3, 'dizzy': 0.6}, }

We can write implementation like this:def fwd_bkw(x, states, a_0, a, e, end_st): L = len(x)fwd = [] f_prev = {} # forward part of the algorithmfor i, x_i in enumerate(x): f_curr = {} for st in states:if i == 0: # base case for the forward part prev_f_sum= a_0[st] else: prev_f_sum = sum(f_prev[k]*a[k][st]for k in states) f_curr[st] = e[st][x_i] * prev_f_sumsum_prob = sum(f_curr.values()) for st in states:f_curr[st] /= sum_prob # normalising to make sum== 1 fwd.append(f_curr) f_prev = f_curr p_fwd =sum(f_curr[k]*a[k][end_st] for k in states) bkw = []b_prev = {} # backward part of the algorithm fori, x_i_plus in enumerate(reversed(x[1:]+(None,))):b_curr = {} for st in states: if i == 0: # base casefor backward part b_curr[st] = a[st][end_st] else:b_curr[st] = sum(a[st][l]*e[l][x_i_plus]*b_prev[l] forl in states) sum_prob = sum(b_curr.values()) for st instates: b_curr[st] /= sum_prob # normalising to makesum == 1 bkw.insert(0,b_curr) b_prev = b_curr p_bkw= sum(a_0[l] * e[l][x[0]] * b_curr[l] for l in states) #merging the two parts posterior = [] for i in range(L):posterior.append({st: fwd[i][st]*bkw[i][st]/p_fwd forst in states}) assert p_fwd == p_bkw return fwd, bkw,posterior

The function fwd_bkw takes the following arguments: xis the sequence of observations, e.g. ['normal', 'cold','dizzy']; states is the set of hidden states; a_0 is the startprobability; a are the transition probabilities; and e arethe emission probabilities.For simplicity of code, we assume that the observationsequence x is non-empty and that a[i][j] and e[i][j] is de-fined for all states i,j.In the running example, the forward-backward algorithmis used as follows:def example(): return fwd_bkw(observations, states,start_probability, transition_probability, emis-sion_probability, end_state) for line in example():print(' '.join(map(str, line)))

8 See also

• Baum-Welch algorithm

• Viterbi algorithm

• BCJR algorithm

9 References[1] J. Binder, K. Murphy and S. Russell. Space-Efficient In-

ference in Dynamic Probabilistic Networks. Int'l, JointConf. on Artificial Intelligence, 1997.

• Lawrence R. Rabiner, A Tutorial on HiddenMarkovModels and Selected Applications in Speech Recog-nition. Proceedings of the IEEE, 77 (2), p. 257–286,February 1989. 10.1109/5.18626

• Lawrence R. Rabiner, B. H. Juang (January 1986).“An introduction to hidden Markov models”. IEEEASSP Magazine: 4–15.

• Eugene Charniak (1993). Statistical LanguageLearning. Cambridge, Massachusetts: MIT Press.ISBN 978-0-262-53141-2.

• Stuart Russell and Peter Norvig (2010). Arti-ficial Intelligence A Modern Approach 3rd Edi-tion. Upper Saddle River, New Jersey: PearsonEducation/Prentice-Hall. ISBN 978-0-13-604259-4.

10 External links• An interactive spreadsheet for teaching the forward–backward algorithm (spreadsheet and article withstep-by-step walk-through)

• Tutorial of hidden Markov models including theforward–backward algorithm

• Collection of AI algorithms implemented in Java(including HMM and the forward–backward algo-rithm)

Page 6: Forward–backward algorithm.pdf

6 11 TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES

11 Text and image sources, contributors, and licenses

11.1 Text• Forward–backward algorithm Source: http://en.wikipedia.org/wiki/Forward%E2%80%93backward%20algorithm?oldid=654797958Contributors: The Anome, Lunchboxhero, Giftlite, Clemwang, Andreas Kaufmann, Rich Farmbrough, Cyc, Male1979, Qwertyus, Tony1,SmackBot, Mcld, Tekhnofiend, Benabik, Harej bot, Skittleys, Obiwankenobi, FirefoxRocks, HebrewHammerTime, VolkovBot, TXiKi-BoT, Donjrude, Bodhi.root, Logan, Enrique.benimeli, Agentx3r, Addbot, DOI bot, BJJV, Yobot, Citation bot, LilHelpa, Xqbot, Nakane,Prijutme4ty, Kbranson, Varnava.a, Citation bot 1, Bjoern popoern, Wiki.Tango.Foxtrot, ZéroBot, Chire, Xiaohan2012, Max Libbrecht,Fzambetta, TwoTwoHello, Volkhin, Dr.death.tm, Monkbot, Vishwash Raj Verma and Anonymous: 53

11.2 Images

11.3 Content license• Creative Commons Attribution-Share Alike 3.0