HMM Applications - Paul Murphy

download HMM Applications - Paul Murphy

of 4

Transcript of HMM Applications - Paul Murphy

  • 7/28/2019 HMM Applications - Paul Murphy

    1/4

    AN INVESTIGATTION ON THE HIDDEN MARKOV MODEL AND SELECTED

    APPLICATIONS

    Paul Murphy

    Digital Signal Processing Principles, University of Strathclyde

    University of Strathclyde, G1 1XQ, Glasgow, Scotlandphone: + (44) 7894 911746, email: [email protected]

    ABSTRACT

    A Hidden Markov Model (HMM) is a specific example of a sto-chastic system in which the observable output of the system is de-

    pendant only on the immediately preceding state. First proposed inthe mid-60s the model has been employed with greatsuccess in

    recent years to a broad range of applications, including biomet-rics, DNA-sequencing and speech/handwriting recognition. This

    paper will introduce the theoretical definitions of the HMM byexample along with mathematical descriptions with a view to high-

    lighting the problems inferred by the models and by their solutiongo examine how the model can be successfully applied to temporalpattern recognition.

    1. INTRODUCTIONThe problem of characterizing real world signals in terms of statis-

    tical signal models is one of great interest to numerous fields ofscience and computing. It allows signals to be processed to achieve

    a desired output such as noise removal or distortion reversal, aswell as offering a method of describing the source without the ne-

    cessity of having the source available. This type of characterizationcan and has been approach from several points of view with vary-

    ing degrees of success depending on the application. One suchapproach is through the use of deterministic models [1]. By thesemethods a signal can be defined by known parameters (e.g., fre-quency, amplitude, sine/sum of exponentials etc.). A second

    method involves trying to extract the statistical model of the signal

    and define the stochastic process which describes it, based on theassumption that the process can be estimated with good precision.In terms of speech recognition technology, for example, both

    methods (and more) have been successfully applied but for generalTemporal Pattern Recognition (TPR), a term which incorporates

    not only speech but handwriting and gesture recognition, probabil-istic methods is ubiquitous. In this paper stochastic methods, more

    specifically HMMs, will be the focus. The principles of stochastic

    models are best communicated through example, and the HMM isparticularly well covered in this respect. For this reason section 2will deal with two classic examples: the toss of one or several

    coins, which is a staple example of probability theory, and the clas-sic ball-in-urn example after Baum et.al who were very much the

    forerunners in research of HMMs. From these examples a mathe-matical model, which can readily be applied in software (indeedMatlab includes a number of HMM functions in the statistical tool-

    box), will be developed. The culmination of this will be an appli-cation example to show how HMMs have and are being applied to

    real-world technologies and a summary of proposed future activi-ties.

    2. HIDDEN MARKOV MODELA Markov model is one which possesses the Markov property.

    This is the definition that next state of the model can be predicted

    based on the present state with no less accuracy than if all previous

    states were known. This property, named for Russian mathematicianAndrey Markov, had been known since the early part of the 20 th

    century but it was in the 1960s when Leonard E. Baum and his col-leagues at Princeton published a series of papers detailing the use of

    Markov chains to solve probabilistic functions. These chains are aMarkov model in which the state of the system is directly observ-able. It is important to first understand these models before extend-ing the theory to HMMs.

    2.1 Markov ChainsConsider a coin toss experiment consisting of two biased coins.

    One coin will be tossed first, whether at the request of the observer(dependant) or by some probability (autonomous). The coin will

    display either heads () or tails () according to a known weighting.Next, the following coin to be tossed will be selected according to a

    second probability weighting, either the first coin again (S1) or theother (S2). An example of the observation sequence (O) and state

    series (S) might be:S= S1 S1 S2 S2 S1 S1... S2O= ...

    As discussed, the state transition is dependent only upon the current

    state and the transition probabilities. Equally, the observation in thecurrent state depends entirely upon the coins bias. We can showthese properties diagrammatically as in figure 1. 1 and2 represent

    the coins and aii is the probability of changing coin. Beneath that

    are the probability functions for the output dependent on each state.The parameters of the model must obey standard stochastic con-straints (1a, b).

    Figure 1- Two coin probability model

    (1a)

    (1b)

    From this it is possible to describe the model fully using matrix

    notation. Let A be the transition probability matrix:

  • 7/28/2019 HMM Applications - Paul Murphy

    2/4

    (2)

    Where qtis the state at timet:

    (3)

    (Note: for equation (3) and similar, read the Probability that the next statewill be state j given that the current stateis state i .)Now let B equal the output probability for each state (emissionsmatrix):

    (4)

    Where vk is the observable output.From this the calculation of the example output sequence probabil-

    ity is trivial, being the product of the transition emission coefficientsfor each consecutive state and output.

    The coin example gives a reasonable insight into how discreteMarkov models can be defined and solved for given conditions.

    This predictability is lost somewhat in many real systems in thatoften the state of the system is not directly observable (for exampleif the coins were tossed behind a curtain and only the results called

    over). Given that a number of observable outputs are observable forany state, and those outputs are not exclusive to that state, it cannot

    be said with absolute certainty how the observation sequence wasgenerated. A model of this type in which the states are not observ-

    able is called a Hidden Markov Model.

    2.2 Hidden Markov ModelsThe previous example considered a 2-state (N=2) 2-emission

    (M=2) model. For an appreciation of the scaling effect on the mod-

    els, now consider the classic ball-in-urn model presented by Jack

    Ferguson and much referenced in relation to HMMs and depicted infigure 2.

    There are N urns in a room. Each urn contains an assortment of

    balls of M different colours. A genie enters the room. By some

    pseudo-random process the genie selects an urn and takes a ballfrom it, placing it on a conveyor belt and taken past the view of theobserver who takes note of it. It is then returned to the urn and thegenie selects a new urn (possibly the same urn) according to the

    random selection process given by the current urn. The ball selec-tion process is then repeated. The sequence noted by the observer isthe observation sequence. We would like to model this system as aHMM. In order to do this we must know the parameters. It should

    be noted that even if all of the parameters of the model are known,

    the Markov model is still hidden.

    Figure 2- generalising model parameters using Ball-in-Urn example

    2.2.1 Number of States -N

    Denoted by N, this is the number of possible states the

    model can be in at any time, t.

    Individual states are denoted asS= [S1, S2, S3...SN];

    The state time is denoted as qtat time t2.2.2 Discrete Alphabet-M

    This is the number of possible values associated with an

    observation e.g., for the coin toss, 2 (heads or tails). The individual characters are denoted V=[ v1, v2, v3,...vM];

    2.2.3 State Transition Probability-A

    The probability associated with a change from one state to

    another.

    As defined by equations 2 and 3

    In the generalised example it is assumed that the modelmay reach any other state from the current state, thus aij >0.

    It should be noted that this need not always be the case.2.2.4 Symbol Emission Probability- B

    The probability of any character vk being output in any

    state.

    Defined in equation 4

    2.2.5 Initial State Probability-

    As previously alluded to, for an autonomous model the ini-tial state is stochastically defined.

    This parameter is a vector is relevant only to Si at q1.

    Thus given appropriate values for A, B, N, M and it is possible to

    generate the sequence

    O= [O1, O2...OT]Where O is some value from the alphabet of v and T is the numberof observations in the sequence. The process for generating the ob-

    servation sequence is given by figure 3 and can readily be imple-

    mented in software for large values of M, N and T.

    Figure 3- Sequence generation algorithm

    It can be shown from this that the full description of the ball-in-urnexample can be expanded to:

    Which for any system, can be written shorthand as:

    From this, the state sequence can be obtained:

  • 7/28/2019 HMM Applications - Paul Murphy

    3/4

    The probability of the above state sequence coinciding with theobservation sequence is the product of the corresponding equations:

    (5)In order to compute the probability of a particular observation se-

    quence, it would be necessary to repeat equation 5 for all possiblestate sequences and sum the results.

    (6)

    2.3 HMM ProblemsIt is evident from the theory presented in sections 2.1 and 2.2

    that the use of HMMs for practical purposes presents a number of

    problems. These were set out by Jack Ferguson in a series of lec-tures at Bell Laboratories.

    Problem 1: For a given observation sequence O and a model ,how can the probability of the sequence be efficiently computed?

    Problem 2: For the sequence, O, and model , which state se-quence best explains the observation?

    Problem 3: How can the parameters of be adjusted to maxi-

    mize P(O|)?

    The solutions to these problems provide the tools necessary to

    implement HMMs effectively in real applications.

    Figure 4- Forward Variable Diagram

    Solution 1: From the derivation culminating in equation 6 it is

    evident that this type of solution scales poorly with T and N in thatthe full solution requires 2T.NT multiplications given that there areN possible states for every observation T. For this reason the solu-tion to equation 6 becomes unfeasible for even small values of N

    and T (say the ball-in-urn example of N=5, taking 50 observations

    would require 2x50x550 computations ~9x1036). Fortunately a moreefficient computational method exists, the Forward-Backwardmethod, or Baum/Welch method. In order to prevent the paper be-coming overly mathematical in content, the method is described

    here referring to principles already discussed with the proofs (whichare readily available elsewhere) omitted.

    The Markov property defines that the state of the model de-pends only upon the preceding state. From this principle it follows

    that the probability of all sequences can be expressed in terms of the

    partially observed sequence until the current time and the current

    state. This forward variable is calculated from the transition prob-ability from all states to arrive at the next state, shown in figure 4,(for the initial state it is calculated from ) and the output probabil-

    ity. The forward variable, t+1(j) is the sum of all transitions to state

    j multiplied by the transition probability for O t+1. From the calcu-lated s, the probability P(O|) is the sum of all forward variablesat the terminal (T).

    By this reasoning, we need perform N multiplications for eachstate of interest, of which there are N, and T multiplications corre-

    sponding to each observation. Thus by this method N2T calculationsare necessary as opposed to 2TNT (using the previous numbers,

    52x50 ~1250 versus 9x10^36).This result can be used for predicting state sequences, but can

    also be applied to scoring how well a HMM matches a given obser-

    vation set, thereby choosing the best model for an application.

    Solution 2: This problem deals with trying to estimate the statesequence which produced a given sequence O.The solution to this problem, also known as the Decoding Prob-

    lem, is useful in learning about the structure of the model, for ex-

    ample to find the state sequence which best describes a speech pat-

    tern for speech recognition. There are several options available fordetermining the best match, and so it is important to know the pur-pose of the system and its qualities in order to generate the best

    definition. A simplistic but effective method would involve select-ing states most likely to generate an observation. This can be ex-tended to include part of the forward backward method described

    above such that the likely hood accounts for previous states when

    determining the model. Using this method produces the maximumnumber of correct states for a general model such as that describedby figure 4. However, as previously mentioned, the special casewhere all states my transition to all other states, known as an ergodic

    model (figure 5a), is not always true. For this reason it is possiblefor the above method to deliver a state sequence which is not possi-ble by the constraints of the transmission matrix.For a better estimate of the state sequence the Baum-Welch method,

    or a refined forward backward method called the Viterbi Algorithmcan be used. Similar to the Baum-Welch method, Viterbis usesonly the maximum forward variable as opposed to summing themfor each transition to select the single best path through the observa-

    tion sequence.

    Figure 5- a) A 4-state Ergodic HMM b) A 4-state left-right (Bakis) HMM

    Solution 3: Known as the Training Problem, this solution

    involves finding methods by which to maximize a models likeli-

    a)

    b)

  • 7/28/2019 HMM Applications - Paul Murphy

    4/4

    hood of generating a given observation sequence by adjusting thetransmission/emission matrices. If it were possible to design themodel with absolute confidence the system would not be stochastic

    and thus of no value for statistics. However to take the example of

    speech recognition, it is possible that given speech sample may bebroadly similar to an expected string (observation sequence) andthus the ability of the model to return the closest approximation is

    more desirable than matching the string precisely (i.e. eliminatingums, ehs and unexpected pauses).

    As with problem 2, there are a number of training methods availabledepending on the application but the all involve some form of ex-

    pectation-modification (EM) algorithm. As these algorithms areapplication specific they will be discussed in the following sectionon applications rather than being analysed exhaustively here.

    From this overview of HMMs it should be clear that the statistical

    value of the models is significant. Having identified and solved thefundamental problems associated with a doubly stochastic processwith an underlying stochastic process which is not observable

    the merits of the models from an analytical point of view

    are evident.

    3. PART OF SPEECH TAGGING (POST)A computer based field which has enjoyed great success in em-ploying HMMs is that of Part of Speech Tagging (POS tagging orPOST). POST is the process of assigning words with in a corpus

    (body of text) a tag based on the respective part of speech (noun,

    verb, adverb etc.). It finds use in a number of specific applications,such as spelling and grammar checkers, and provides a good ex-ample of HMMs in context.

    When approaching part of speech tagging, there are two mainmethods which have been employed: rule based and probability

    based. While rule based methods are extremely accurate for wordsin the lookup tables, probability models allow for greater scope in

    terms of categorizing words by their context even if the word itself

    has not been trained.The difficulty in this application comes in the form that many

    words can be assigned a different POS dependent on context (seefigure 6 for POS abbreviations), e.g.:

    I/PN can/MD can/VB a/AT can/NN

    The POS can be resolved in several ways, two common methods

    being:1. Some POS sequences are more common than others: AT

    JJ NN more than AT JJ VBP a new book2. Some forms of a word are more common than others:

    flour is more often a noun than a verb.

    In the case of the POST, the observation sequence O is word se-quence and the states S we wish to know are the tag types. Such a

    model would be extensively trained using the Viterbi algorithm. Intests HMM POSTs have been shown to exhibit 96-98% accuracy for

    trained words and as high as 86-7% accuracy for unknown words(i.e. contextual classification).

    As with many applications, there exist special cases where not for-mal definition can be applied and so no matter how good the modelis the correct result cannot be stated with certainty (the Duchess

    was entertaining last night is equal valid when entertaining is a

    verb and an adjective). The principles applied in this application aresimilar to those by which HMMs are applied to speech processingand gesture recognition, for example, and these and many more

    areas have used such models to great success.

    4. CONCLUSIONAs a probability concept, HMMs are an interesting case. We

    have shown how analysis and derivations of simple Markov

    models can be readily extended to more advanced cases.

    Having looked at some problems associated with the models

    and the respective solutions, it has been shown that HMMs

    can be, and have been, applied to a wide range of real world

    applications

    REFERENCES

    [1] L.R. Rabiner, A Tutorial on HiddenMarkov Models and Selected Applications in

    Speech Recognition, IEEE, vol. 77, pp. 257

    286, Feb. 1989.

    [2] X. Zhu, "Hidden Markov Model,'' Wisconsin Journal, vol. 1,Jan. 2007.

    [4] L. E. Baum, T. Petrie, G. Soules, and N. Weiss, "A maximi zation technique occurring in the statistical analysis of prob

    abilistic functions of Markov chains," Ann. Math. Stat., vol.41, no. 1, pp. 164-171, 1970.[5] A. W. Drake, "Discrete-state Markov processes," Chapter5inFundamentals of Applied Probability Theory. New York,

    NY: McGraw-Hill, 1967.

    SELECTED PART OF SPEECH TAGS

    Figure 6-selected tags for POST