Corpora and Statistical Methods Lecture...

Post on 06-Sep-2021

2 views 0 download

Transcript of Corpora and Statistical Methods Lecture...

Albert Gatt

Corpora and Statistical Methods

Lecture 8

Markov and Hidden Markov Models: Conceptual

Introduction

Part 2

In this lecture

We focus on (Hidden) Markov Models

conceptual intro to Markov Models

relevance to NLP

Hidden Markov Models

algorithms

Acknowledgement

Some of the examples in this lecture are taken from a tutorial

on HMMs by Wolgang Maass

Talking about the weather Suppose we want to predict tomorrow’s weather. The

possible predictions are: sunny foggy rainy

We might decide to predict tomorrow’s outcome based on earlier weather if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it

had been rainy all week how far back do we want to go to predict tomorrow’s weather?

Statistical weather model Notation:

S: the state space, a set of possible values for the weather: {sunny, foggy, rainy}

(each state is identifiable by an integer i)

X: a sequence of random variables, each taking a value from S

these model weather over a sequence of days

t is an integer standing for time

(X1, X2, X3, ... XT) models the value of a series of random variables

each takes a value from S with a certain probability P(X=si)

the entire sequence tells us the weather over T days

Statistical weather model

If we want to predict the weather for day t+1, our model might

look like this:

E.g. P(weather tomorrow = sunny), conditional on the weather

in the past t days.

Problem: the larger t gets, the more calculations we have to

make.

)...|( 11 tkt XXsXP

Markov Properties I: Limited horizon

The probability that we’re in state si at time t+1 only

depends on where we were at time t:

Given this assumption, the probability of any sequence is

just:

)|()...|( 111 tittit XsXPXXsXP

)|(),...,( 1

1

1

ii

T

i

T XXPXXP

Markov Properties II: Time invariance

The probability of being in state si given the previous state does

not change over time:

)|()|( 121 XsXPXsXP itit

Concrete instantiation

Day t Day t+1

sunny rainy foggy

sunny 0.8 0.05 0.15

rainy 0.2 0.6 0.2

foggy 0.2 0.3 0.5

This is essentially a transition matrix, which gives us probabilities of going from one state to the other.

We can denote state transition probabilities as aij (prob. of going from state i to state j)

Graphical view

Components of the model:

1. states (s)

2. transitions

3. transition probabilities

4. initial probability distribution

for states

Essentially, a non-deterministic finite

state automaton.

Example continued

If the weather today (Xt) is sunny, what’s the probability that

tomorrow (Xt+1) is sunny and the day after (Xt+2) is rainy?

04.0

8.005.0

)|()|(

)|(),|(

)|,(

112

112

21

sXsXPsXrXP

sXsXPsXsXrXP

sXrXsXP

tttt

ttttt

ttt

Markov assumption

Formal definition

A Markov Model is a triple (S, , A) where:

S is the set of states

are the probabilities of being initially in some state

A are the transition probabilities

Hidden Markov Models

A slight variation on the example You’re locked in a room with no windows

You can’t observe the weather directly

You only observe whether the guy who brings you food is carrying an umbrella or not

Need a model telling you the probability of seeing the umbrella, given the weather

distinction between observations and their underlying emitting state.

Define:

Ot as an observation at time t

K = {+umbrella, -umbrella} as the possible outputs

We’re interested in P(Ot=k|Xt=si) i.e. p. of a given observation at t given that the underlying weather state at t is si

Symbol emission probabilities

weather Probability of umbrella

sunny 0.1

rainy 0.8

foggy 0.3

This is the hidden model, telling us the probability that Ot = k given that Xt = si

We assume that each underlying state Xt = si emits an observation with a given probability.

Using the hidden model

Model gives:P(Ot=k|Xt=si)

Then, by Bayes’ Rule we can compute: P(Xt=si|Ot=k)

Generalises easily to an entire sequence

)(

)()|()|(

kOP

sXPsXkOPkOsXP

t

ititttit

HMM in graphics

Circles indicate states

Arrows indicate probabilistic dependencies between states

HMM in graphics

Green nodes are hidden states

Each hidden state depends only on the previous state (Markov assumption)

Why HMMs?

HMMs are a way of thinking of underlying events

probabilistically generating surface events.

Example: Parts of speech

a POS is a class or set of words

we can think of language as an underlying Markov Chain of

parts of speech from which actual words are generated

(“emitted”)

So what are our hidden states here, and what are the

observations?

HMMs in POS Tagging

ADJ N VDET

Hidden layer (constructed through training)

Models the sequence of POSs in the training corpus

HMMs in POS Tagging

ADJ

tall

N

lady

V

is

DET

the

Observations are words.

They are “emitted” by their corresponding hidden state.

The state depends on its previous state.

Why HMMs

There are efficient algorithms to train HMMs using

Expectation Maximisation

General idea:

training data is assumed to have been generated by some HMM

(parameters unknown)

try and learn the unknown parameters in the data

Similar idea is used in finding the parameters of some n-gram

models, especially those that use interpolation.

Formalisation of a Hidden Markov model

Crucial ingredients (familiar) Underlying states: S = {s1,…,sN}

Output alphabet (observations): K = {k1,…,kM}

State transition probabilities:

A = {aij}, i,j Є S

State sequence: X = (X1,…,XT+1)

+ a function mapping each Xt to a state s

Output sequence: O = (O1,…,OT) where each ot Є K

Crucial ingredients (additional) Initial state probabilities:

Π = {πi}, i Є S

(tell us the initial probability of each state)

Symbol emission probabilities:

B = {bijk}, i,j Є S, k Є K

(tell us the probability b of seeing observation Ot=k, given that Xt=si and Xt+1 = sj)

Trellis diagram of an HMM

s1

s2

s3

a1,1

a1,2

a1,3

Trellis diagram of an HMM

s1

s2

s3

a1,1

a1,2

a1,3

o1 o2 o3Obs. seq:

time: t1 t2t3

Trellis diagram of an HMM

s1

s2

s3

a1,1

a1,2

a1,3

o1 o2 o3Obs. seq:

time: t1 t2t3

b1,1,k b1,1,k

b1,2,k

b1,3,k

The fundamental questions for HMMs

1. Given a model μ = (A, B, Π), how do we compute the likelihood of an observation P(O| μ)?

2. Given an observation sequence O, and model μ, which is the state sequence (X1,…,Xt+1) that best explains the observations?

This is the decoding problem

3. Given an observation sequence O, and a space of possible models μ = (A, B, Π), which model best explains the observed data?

Application of question 1 (ASR)

Given a model μ = (A, B, Π), how do we compute the

likelihood of an observation P(O| μ)?

Input of an ASR system: a continuous stream of sound

waves, which is ambiguous

Need to decode it into a sequence of phones.

is the input the sequence [n iy d] or [n iy]?

which sequence is the most probable?

Application of question 2 (POS Tagging) Given an observation sequence O, and model μ, which is the state sequence

(X1,…,Xt+1) that best explains the observations?

this is the decoding problem

Consider a POS Tagger

Input observation sequence:

I can read

need to find the most likely sequence of underlying POS tags:

e.g. is can a modal verb, or the noun?

how likely is it that can is a noun, given that the previous word is a pronoun?

Summary

HMMs are a way of representing:

sequences of observations arising from

sequences of states

states are the variables of interest, giving rise to the

observations

Next up:

algorithms for answering the fundamental questions about

HMMs