Sequence Models With slides by me, Joshua Goodman, Fei Xia.
-
Upload
nathaniel-anthony -
Category
Documents
-
view
226 -
download
0
Transcript of Sequence Models With slides by me, Joshua Goodman, Fei Xia.
![Page 1: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/1.jpg)
Sequence Models
With slides by me, Joshua Goodman, Fei Xia
![Page 2: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/2.jpg)
Outline
• Language Modeling• Ngram Models• Hidden Markov Models– Supervised Parameter Estimation– Probability of a sequence– Viterbi (or decoding)– Baum-Welch
![Page 3: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/3.jpg)
3
A bad language model
![Page 4: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/4.jpg)
4
A bad language model
![Page 5: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/5.jpg)
5
A bad language model
![Page 6: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/6.jpg)
6
A bad language model
![Page 7: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/7.jpg)
What is a language model?
Language Model: A distribution that assigns a probability to language utterances.
e.g., PLM(“zxcv ./,mwea afsido”) is zero;
PLM(“mat cat on the sat”) is tiny;
PLM(“Colorless green ideas sleeps furiously”) is bigger;
PLM(“A cat sat on the mat.”) is bigger still.
![Page 8: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/8.jpg)
What’s a language model for?
• Information Retrieval• Handwriting recognition• Speech Recognition• Spelling correction• Optical character recognition• Machine translation• …
![Page 9: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/9.jpg)
Example Language Model Application
Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).
Straightforward model:
But this can be hard to train effectively (although see CRFs later).
)|( soundtextP
![Page 10: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/10.jpg)
Example Language Model Application
Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file).
Traditional solution: Bayes’ Rule
)(
)()|()|(
soundP
textPtextsoundPsoundtextP
Ignore: doesn’t matter for picking a good text
Acoustic Model(easier to train)
Language Model
![Page 11: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/11.jpg)
Importance of Sequence
So far, we’ve been making the exchangeability, or bag-of-words, assumption:
The order of words is not important.
It turns out, that’s actually not true (duh!).
“cat mat on the sat” ≠ “the cat sat on the mat”
“Mary loves John” ≠ “John loves Mary”
![Page 12: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/12.jpg)
Language Models with Sequence Information
Problem: How can we define a model that
• assigns probability to sequences of words (a language model)
• the probability depends on the order of the words• the model can be trained and computed tractably?
![Page 13: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/13.jpg)
Outline
• Language Modeling• Ngram Models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence (decoding)– Viterbi (Best hidden layer sequence)– Baum-Welch
• Conditional Random Fields
![Page 14: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/14.jpg)
14
Smoothing: Kneser-Ney
P(Francisco | eggplant) vs P(stew | eggplant)• “Francisco” is common, so backoff,
interpolated methods say it is likely• But it only occurs in context of “San”• “Stew” is common, and in many contexts• Weight backoff by number of contexts word
occurs in
![Page 15: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/15.jpg)
15
Kneser-Ney smoothing (cont)
Interpolation:
Backoff:
![Page 16: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/16.jpg)
Outline
• Language Modeling• Ngram Models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence (decoding)– Viterbi (Best hidden layer sequence)– Baum-Welch
• Conditional Random Fields
![Page 17: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/17.jpg)
The Hidden Markov Model
A dynamic Bayes Net (dynamic because the size can change).
The Oi nodes are called observed nodes.The Si nodes are called hidden nodes.
NLP 17
S1
O1
S2
O2
Sn
On…
…
![Page 18: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/18.jpg)
HMMs and Language Processing
• HMMs have been used in a variety of applications, but especially:– Speech recognition
(hidden nodes are text words, observations are spoken words)
– Part of Speech Tagging(hidden nodes are parts of speech, observations are words)
NLP 18
S1
O1
S2
O2
Sn
On…
…
![Page 19: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/19.jpg)
HMM Independence Assumptions
HMMs assume that:• Si is independent of S1 through Si-2, given Si-1 (Markov assump.)• Oi is independent of all other nodes, given Si
• P(Si | Si-1) and P(Oi | Si) do not depend on i
Not very realistic assumptions about language – but HMMs are often good enough, and very convenient.
NLP 19
S1
O1
S2
O2
Sn
On…
…
![Page 20: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/20.jpg)
HMM Formula
An HMM predicts that the probability of observing a sequence o = <o1, o2, …, oT> with a particular set of hidden states s = <s1, … sT> is:
To calculate, we need: - Prior: P(s1) for all values of s1
- Observation: P(oi|si) for all values of oi and si
- Transition: P(si|si-1) for all values of si and si-1
T
iiiii soPssPsoPsPP
21111 )|()|()|()(),( so
![Page 21: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/21.jpg)
HMM: Pieces1) A set of hidden states H = {h1, …, hN} that are the values which
hidden nodes may take.
2) A vocabulary, or set of states V = {v1, …, vM} that are the values which an observed node may take.
3) Initial probabilities P(s1=hi) for all i- Written as a vector of N initial probabilities, called π
4) Transition probabilities P(st=hi | st-1=hj) for all i, j- Written as an NxN ‘transition matrix’ A
5) Observation probabilities P(ot=vj|st=hi) for all j, i- written as an MxN ‘observation matrix’ B
![Page 22: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/22.jpg)
HMM for POS Tagging1) S = {DT, NN, VB, IN, …}, the set of all POS tags.
2) V = the set of all words in English.
3) Initial probabilities πi are the probability that POS tag can start a sentence.
4) Transition probabilities Aij represent the probability that one tag can follow another
5) Observation probabilities Bij represent the probability that a tag will generate a particular.
![Page 23: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/23.jpg)
Outline
• Graphical Models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi: what’s the best hidden state sequence?– Baum-Welch: unsupervised parameter estimation
• Conditional Random Fields
![Page 24: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/24.jpg)
Supervised Parameter Estimation
• Given an observation sequence and states, find the HMM model (π, A, and B) that is most likely to produce the sequence.
• For example, POS-tagged data from the Penn Treebank
A
B
AAA
BBB B
oTo1 otot-1 ot+1
x1 xt-1 xt xt+1 xT
![Page 25: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/25.jpg)
Bayesian Parameter EstimationA
B
AAA
BBB B
oTo1 otot-1 ot+1
x1 xt-1 xt xt+1 xT
sentences#
state with starting sentences#ˆ
ii
data in the is times#
by followed is times#ˆ
i
jiaij
data in the is times#
produces times#ˆi
kibik
![Page 26: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/26.jpg)
Outline
• Graphical Models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi– Baum-Welch
• Conditional Random Fields
![Page 27: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/27.jpg)
What’s the probability of a sentence?
Suppose I asked you, ‘What’s the probability of seeing a sentence w1, …, wT on the web?’
If we have an HMM model of English, we can use it to estimate the probability.
(In other words, HMMs can be used as language models.)
![Page 28: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/28.jpg)
Conditional Probability of a Sentence
• If we knew the hidden states that generated each word in the sentence, it would be easy:
T
iii
T
iii
T
iiiii
T
TTTT
swP
ssPsP
swPssPswPsP
ssP
sswwPsswwP
1
211
21111
1
1111
)|(
)|()(
)|()|()|()(
),...,(
),...,,,...,(),...,|,...,(
![Page 29: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/29.jpg)
Probability of a Sentence
Via marginalization, we have:
Unfortunately, if there are N values for each ai (s1 through sN),
Then there are NT values for a1,…,aT.
Brute-force computation of this sum is intractable.
T
T
aa
T
iiiii
aaTTT
awPaaPawPaP
aawwPwwP
,..., 21111
,...,111
1
1
)|()|()|()(
),...,,,...,(),...,(
![Page 30: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/30.jpg)
)|,...()( 1 ixooPt tti
Forward Procedure
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
• Special structure gives us an efficient solution using dynamic programming.
• Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences.
• Define:
![Page 31: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/31.jpg)
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
![Page 32: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/32.jpg)
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
![Page 33: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/33.jpg)
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
![Page 34: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/34.jpg)
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
)1( tj
)|(),...(
)()|()|...(
)()|...(
),...(
1111
11111
1111
111
jxoPjxooP
jxPjxoPjxooP
jxPjxooP
jxooP
tttt
ttttt
ttt
tt
![Page 35: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/35.jpg)
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
![Page 36: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/36.jpg)
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
![Page 37: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/37.jpg)
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
![Page 38: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/38.jpg)
Nijoiji
ttttNi
tt
tttNi
ttt
ttNi
ttt
tbat
jxoPixjxPixooP
jxoPixPixjxooP
jxoPjxixooP
...1
111...1
1
11...1
11
11...1
11
1)(
)|()|(),...(
)|()()|,...(
)|(),,...(
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Forward Procedure
![Page 39: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/39.jpg)
)|...()( ixooPt tTti
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Backward Procedure
1)1( Ti
Nj
jioiji tbatt
...1
)1()(
Probability of the rest of the states given the first state
![Page 40: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/40.jpg)
oTo1 otot-1 ot+1
x1 xt+1 xTxtxt-1
Decoding Solution
N
ii TOP
1
)()|(
N
iiiOP
1
)1()|(
)()()|(1
ttOP i
N
ii
Forward Procedure
Backward Procedure
Combination
![Page 41: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/41.jpg)
Outline
• Graphical Models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi: what’s the best hidden state sequence?– Baum-Welch
• Conditional Random Fields
![Page 42: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/42.jpg)
oTo1 otot-1 ot+1
Best State Sequence
• Find the hidden state sequence that best explains the observations
• Viterbi algorithm
)|(maxarg OXPX
![Page 43: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/43.jpg)
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttxx
j ojxooxxPtt
The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t
x1 xt-1 j
![Page 44: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/44.jpg)
oTo1 otot-1 ot+1
Viterbi Algorithm
),,...,...(max)( 1111... 11
ttttxx
j ojxooxxPtt
1)(max)1(
tjoijii
j batt
1)(maxarg)1(
tjoijii
j batt Recursive Computation
x1 xt-1 xt xt+1
![Page 45: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/45.jpg)
oTo1 otot-1 ot+1
Viterbi Algorithm
)(maxargˆ TX ii
T
)1(ˆ1
^
tXtX
t
)(maxarg)ˆ( TXP ii
Compute the most likely state sequence by working backwards
x1 xt-1 xt xt+1 xT
![Page 46: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/46.jpg)
Outline
• Graphical Models• Hidden Markov Models– Supervised parameter estimation– Probability of a sequence– Viterbi– Baum-Welch: Unsupervised parameter
estimation
• Conditional Random Fields
![Page 47: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/47.jpg)
oTo1 otot-1 ot+1
Unsupervised Parameter Estimation
• Given an observation sequence, find the model that is most likely to produce that sequence.
• No analytic method• Given a model and observation sequence, update
the model parameters to better fit the observations.
A
B
AAA
BBB B
![Page 48: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/48.jpg)
oTo1 otot-1 ot+1
Parameter EstimationA
B
AAA
BBB B
Nmmm
jjoijit tt
tbatjip t
...1
)()(
)1()(),( 1
Probability of traversing an arc
Njti jipt
...1
),()( Probability of being in state i
![Page 49: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/49.jpg)
oTo1 otot-1 ot+1
Parameter EstimationA
B
AAA
BBB B
)1(ˆ i i
Now we can compute the new estimates of the model parameters.
T
t i
T
t tij
t
jipa
1
1
)(
),(ˆ
T
t i
kot t
ikt
ib t
1
}:{
)(
)(ˆ
![Page 50: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/50.jpg)
oTo1 otot-1 ot+1
Parameter EstimationA
B
AAA
BBB B
• Guarantee: P(o1:T|A,B,π) <= P(o1:T|A ̂,B ̂, π� )• In other words, by repeating this procedure, we
can gradually improve how well the HMM fits the unlabeled data.
• There is no guarantee that this will converge to the best possible HMM, however (only guaranteed to find a local maximum).
![Page 51: Sequence Models With slides by me, Joshua Goodman, Fei Xia.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649eff5503460f94c15656/html5/thumbnails/51.jpg)
oTo1 otot-1 ot+1
The Most Important ThingA
B
AAA
BBB B
We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not tractable.