Lecture 11: Viterbi and Forward Algorithms · 2019-04-14 · This lecture vTwo important algorithms...

Lecture 11: Viterbi and Forward

Algorithms

Kai-Wei ChangCS @ University of Virginia

kw@kwchang.net

Couse webpage: http://kwchang.net/teaching/NLP16

1CS6501 Natural Language Processing

Quiz 1

vMax: 24; Mean: 18.1; Median: 18; SD: 3.36

CS6501 Natural Language Processing 2

[0-5] [6-10] [11-15] [16-20] [21-25]

This lecture

vTwo important algorithms for inferencevForward algorithmvViterbi algorithm

3CS6501 Natural Language Processing

CS6501 Natural Language Processing ‹#›

Three basic problems for HMMs

vLikelihood of the input:vForward algorithm

vDecoding (tagging) the input:vViterbi algorithm

vEstimation (learning):vFind the best model parameters

v Case 1: supervised – tags are annotatedvMaximum likelihood estimation (MLE)

v Case 2: unsupervised -- only unannotated textvForward-backward algorithm

Howlikelythesentence”Ilovecat”occurs

POStagsof”Ilovecat”occurs

Howtolearnthemodel?

Likelihood of the input

vHow likely a sentence “I love cat” occurvCompute 𝑃(𝒘 ∣ 𝝀) for the input 𝒘 and

HMM 𝜆

vRemember, we model 𝑃 𝒕,𝒘 ∣ 𝝀

v𝑃 𝒘 𝝀 = ∑ 𝑃 𝒕,𝒘 𝝀 𝒕

Marginalprobability:Sumoverallpossibletagsequences

Likelihood of the input

v How likely a sentence “I love cat” occurv 𝑃 𝒘 𝝀 = ∑ 𝑃 𝒕,𝒘 𝝀 𝒕

= ∑ Π./01 𝑃 𝑤. 𝑡. 𝑃 𝑡. ∣ 𝑡.40 𝒕 v Assume we have 2 tags N, Vv 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡” 𝝀

= 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑁𝑁” 𝝀 + 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑁𝑉” 𝝀+𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑉𝑁” 𝝀 + 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑉𝑉” 𝝀+𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑁𝑁” 𝝀 + 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑁𝑉” 𝝀

+𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑉𝑁” 𝝀 + 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑉𝑉” 𝝀

v Now, let’s write down 𝑃(“𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡” ∣ 𝝀) with 45 tags…

Trellis diagram

vGoal: P(𝐰 ∣ 𝝀) = ∑ Π./01 𝑃 𝑤. 𝑡. 𝑃 𝑡. ∣ 𝑡.40 𝒕

𝑃(𝑡C = 2|𝑡0 = 1)

𝑖 = 1𝑖 = 2𝑖 = 3𝑖 = 4

𝑃(𝑡J = 1|𝑡C = 1) 𝑃(𝑤J|𝑡J = 1)

𝝀 istheparametersetofHMM.Let’signore itinsomeslidesforsimplicity’ssake

Trellis diagram

v P(“I eat a fish”, NVVA )

𝑖 = 1𝑖 = 2𝑖 = 3𝑖 = 4

𝑃(𝐼|𝑁)

𝑃(𝑒𝑎𝑡|𝑉) 𝑃(𝑎|𝑉)

𝑃(𝑓𝑖𝑠ℎ|A)

𝑃(𝑉|𝑁) 𝑃(𝑉|𝑉)

𝑃(𝑁| < 𝑆 >)

𝑃(𝐴|𝑉)

Trellis diagram

v ∑ Π./01 𝑃 𝑤. 𝑡. 𝑃 𝑡. ∣ 𝑡.40 𝒕 : sum over all paths

𝑖 = 1𝑖 = 2𝑖 = 3𝑖 = 4

Dynamic programming

vRecursively decompose a problem into smaller sub-problems

vSimilar to mathematical inductionvBase step: initial values for 𝑖 = 1v Inductive step: assume we know the values for 𝑖 = 𝑘, let’s compute 𝑖 = 𝑘 + 1

Forward algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v 𝒕Y :tagsequencewithlength𝑘,𝒘Y = 𝑤0, 𝑤C …𝑤Yv ∑ 𝑃(𝒕Y,𝒘𝒌)𝒕g = ∑ ∑ 𝑃(𝒕Y40, 𝒘𝒌, 𝑡Y = 𝑞)𝒕gij k

CS6501 Natural Language Processing 12𝑖 = 1𝑖 = 2𝑖 = 3𝑖 = 4

tagsequences tag@i=k

P(𝒘𝒌, 𝑡Y = 𝑞)

tagsequences

Forward algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v ∑ 𝑃(𝒕Y,𝒘)𝒕g = ∑ P(𝒘𝒌, 𝑡Y = 𝑞) k

v P(𝒘𝒌, 𝑡Y = 𝑞)= ∑ 𝑃(kl 𝒘𝒌, 𝑡Y40 = 𝑞m, 𝑡Y = 𝑞)

=∑ 𝑃(km 𝒘𝒌40, 𝑡Y40 = 𝑞m)𝑃(𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞′)𝑃(𝑤Y ∣ 𝑡Y = 𝑞)

CS6501 Natural Language Processing13𝑖 = 𝑘 − 1𝑖 = 𝑘

Forward algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v ∑ 𝑃(𝒕Y,𝒘)𝒕g = ∑ P(𝒘𝒌, 𝑡Y = 𝑞) k

v P(𝒘𝒌, 𝑡Y = 𝑞)= ∑ 𝑃(kl 𝒘𝒌, 𝑡Y40 = 𝑞m, 𝑡Y = 𝑞)

=∑ 𝑃(km 𝒘𝒌40, 𝑡Y40 = 𝑞m)𝑃(𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞′)𝑃(𝑤Y ∣ 𝑡Y = 𝑞)

Let’scallit𝛼Y(𝑞)

Thisis𝛼Y40(𝑞′)

Forward algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v 𝛼Y(𝑞)=∑ 𝛼Y40(𝑞′)km 𝑃(𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞′)𝑃(𝑤Y ∣ 𝑡Y = 𝑞)

Forward algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v 𝛼Y(𝑞)= ∑ 𝛼Y40(𝑞′)km 𝑃(𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞′)𝑃(𝑤Y ∣ 𝑡Y = 𝑞)

=𝑃(𝑤Y ∣ 𝑡Y = 𝑞)∑ 𝛼Y40(𝑞′)km 𝑃(𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞′)

Forward algorithm

v Base step: i=0 v 𝛼0 𝑞 = 𝑃 𝑤0 𝑡0 = 𝑞 𝑃(𝑡0 = 𝑞 ∣ 𝑡q)

initialprobability𝑝(𝑡0 = 𝑞)

Implementation using an array

vUse a 𝑛×𝑇 table to keep 𝛼Y(𝑞)

FromJuliaHockenmaier, IntrotoNLP

Initial:Trellis[1][q]= 𝑃 𝑤0 𝑡0 = 𝑞 𝑃(𝑡0 = 𝑞 ∣ 𝑡q)

Induction:𝛼Y(𝑞)=𝑃(𝑤Y ∣ 𝑡Y = 𝑞)∑ 𝛼Y40(𝑞′)km 𝑃(𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞′)

The forward algorithm (Pseudo Code)

.fwd=0

Jason’s ice cream

vP(”1,2,1”)?

p(…|C) p(…|H) p(…|START)p(1|…) 0.5 0.1p(2|…) 0.4 0.2p(3|…) 0.1 0.7p(C|…) 0.8 0.2 0.5p(H|…) 0.2 0.8 0.5

Scroll to the bottom to see a graph of what states and transitions the model thinks are likely on each day. Those likely states and transitions can be used to reestimate the red probabilities (this is the "forward-backward" or Baum-Welch algorithm), incr

#cones

0.80.8

0.5 0.50.4

0.1 0.10.2

Howtolearnthemodel?

Prediction in generative model

v Inference: What is the most likely sequence of tags for the given sequence of words w

vWhat are the latent states that most likely generate the sequence of word w

initialprobability𝑝(𝑡0)

Tagging the input

vFind best tag sequence of “I love cat”vRemember, we model 𝑃 𝒕,𝒘 ∣ 𝝀

v 𝒕∗ = argmax𝒕𝑃 𝒕,𝒘 𝝀

Findthebestonefromallpossibletagsequences

Tagging the input

v Assume we have 2 tags N, Vv Which one is the best?𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑁𝑁” 𝝀 , 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑁𝑉” 𝝀 ,𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑉𝑁” 𝝀 , 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑁𝑉𝑉” 𝝀 ,𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑁𝑁” 𝝀 , 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑁𝑉” 𝝀 ,

𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑉𝑁” 𝝀 , 𝑃 “𝐼𝑙𝑜𝑣𝑒𝑐𝑎𝑡”, ”𝑉𝑉𝑉” 𝝀

v Again! We need an efficient algorithm

Trellis diagram

vGoal: argmax𝒕Π./01 𝑃 𝑤. 𝑡. 𝑃 𝑡. ∣ 𝑡.40

𝑃(𝑡C = 2|𝑡0 = 1)

𝑖 = 1𝑖 = 2𝑖 = 3𝑖 = 4

𝑃(𝑡J = 1|𝑡C = 1) 𝑃(𝑤J|𝑡J = 1)

Trellis diagram

v Goal: argmax𝒕Π./01 𝑃 𝑤. 𝑡. 𝑃 𝑡. ∣ 𝑡.40

v Find the best path!

𝑖 = 1𝑖 = 2𝑖 = 3𝑖 = 4

Dynamic programming again!

vRecursively decompose a problem into smaller sub-problems

vSimilar to mathematical inductionvBase step: initial values for 𝑖 = 1v Inductive step: assume we know the values for 𝑖 = 𝑘, let’s compute 𝑖 = 𝑘 + 1

Viterbi algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v 𝒕Y :tagsequencewithlength𝑘,𝒘Y = 𝑤0, 𝑤C …𝑤Yv max

𝒕𝒌𝑃(𝒕Y,𝒘𝒌) = 𝑚𝑎𝑥kmax𝒕𝒌i𝟏

𝑃(𝒕Y40, 𝑡Y = 𝑞,𝒘𝒌)

CS6501 Natural Language Processing 30𝑖 = 1𝑖 = 2𝑖 = 3𝑖 = 4

tagsequencestag@i=k tagsequences

Viterbi algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v max

𝒕𝒌i𝟏𝑃(𝒕Y40, 𝑡Y = 𝑞,𝒘𝒌)

= 𝑚𝑎𝑥kl max𝒕𝒌i𝟏𝑃(𝒕Y4𝟐, 𝑡Y = 𝑞, 𝑡Y40 = 𝑞m,𝒘𝒌)

= 𝑚𝑎𝑥kl max𝒕𝒌i𝟏𝑃 𝒕Y4𝟐, 𝑡Y40 = 𝑞m,𝒘𝒌4𝟏 𝑃 𝑡Y = 𝑞, 𝑡Y40 = 𝑞m 𝑃 𝑤Y 𝑡Y = 𝑞

Let’scallit𝛿Y(𝑞)

Thisis𝛿Y40(𝑞′)

Viterbi algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v 𝛿Y 𝑞 =max

km𝛿Y40(𝑞m)𝑃(𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞′)𝑃(𝑤Y ∣ 𝑡Y = 𝑞)

Viterbi algorithm

v Inductive step: from 𝑖 = 𝑘 toi =k+1v 𝛿Y 𝑞 =max

kl𝛿Y40 𝑞m 𝑃 𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞m 𝑃 𝑤Y ∣ 𝑡Y = 𝑞

=𝑃 𝑤Y ∣ 𝑡Y = 𝑞 maxkl

𝛿Y40 𝑞m 𝑃 𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞m

Viterbi algorithm

v Base step: i=0 v 𝛿0 𝑞 = 𝑃 𝑤0 𝑡0 = 𝑞 𝑃(𝑡0 = 𝑞 ∣ 𝑡q)

initialprobability𝑝(𝑡0 = 𝑞)

Initial:Trellis[1][q]= 𝑃 𝑤0 𝑡0 = 𝑞 𝑃(𝑡0 = 𝑞 ∣ 𝑡q)

Induction:𝛿Y 𝑞 = 𝑃 𝑤Y ∣ 𝑡Y = 𝑞 max

kl𝛿Y40 𝑞m 𝑃 𝑡Y = 𝑞 ∣ 𝑡Y40 = 𝑞m

Retrieving the best sequence

vKeep one backpointer

The Viterbi algorithm (Pseudo Code)

.fwd=0

Maxinsteadofsum

Jason’s ice cream

vBest tag sequence for P(”1,2,1”)?

p(…|C) p(…|H) p(…|START)p(1|…) 0.5 0.1p(2|…) 0.4 0.2p(3|…) 0.1 0.7p(C|…) 0.8 0.2 0.5p(H|…) 0.2 0.8 0.5

Scroll to the bottom to see a graph of what states and transitions the model thinks are likely on each day. Those likely states and transitions can be used to reestimate the red probabilities (this is the "forward-backward" or Baum-Welch algorithm), incr

#cones

0.80.8

0.5 0.50.4

0.1 0.10.2

Trick: computing everything in log space

vHomework:vWrite forward and Viterbi algorithm in log-space

vHint: you need a function to compute log(a+b)

Howtolearnthemodel?

Lecture 11: Viterbi and Forward Algorithms · 2019-04-14 · This lecture vTwo important algorithms...

Documents

Transcript of Lecture 11: Viterbi and Forward Algorithms · 2019-04-14 · This lecture vTwo important algorithms...

Algorithms Algorithm: what is it ?. Algorithms Algorithm: what is it ? Some representative problems : - Interval Scheduling.

Introduction to Algorithms to Algorithms and... · Algorithm (many definitions) An algorithm is an exact specification of how to solve a computational problem An algorithm must specify

Algorithms and Data Structures. . . . . .3/27 Introduction Algorithms Data Structures Insertion Sort Analysis of the Algorithm Algorithms Informally, an algorithm is any well-deﬁned

Fundamental Algorithms · Fundamental methodologies of algorithm design Algorithms for standard problems Basic methods of algorithm analysis Primitive and higher data structures ...

Analysis of Algorithms Algorithm Input Output © 2010 Goodrich, Tamassia1Analysis of Algorithms.

sorting algorithms, all algorithms, algorithm complexity

Paper: Analysis & Design of Algorithms Note... · Page 1 Paper: Analysis & Design of Algorithms Chapter 1: Introduction to Algorithm Concept of algorithm An algorithm is a set of

Multi-water algorithms and algorithm performance assessmentiocs.ioccg.org/wp-content/.../thu-1400-report-bo7-seegersetal-web.pdf · IOCS 2017 – multi-water algorithms and algorithm

1 Recursion Algorithm Analysis Standard Algorithms Chapter 7.

Clustering Genetic Algorithm - avcr.cz · ETID’2007 Introduction Clustering Genetic Algorithm Experimental results Conclusion Genetic algorithms Genetic algorithms stochastic optimization

1 CSC 421: Algorithm Design & Analysis Spring 2013 Greedy algorithms greedy algorithms examples: optimal change, job scheduling Prim's algorithm (minimal.

Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool

CSE 780 Algorithms Advanced Algorithms Minimum spanning tree Generic algorithm Kruskal’s algorithm Prim’s algorithm.

Algorithm Animation for Bioinformatics Algorithms

CSE 780 Algorithms Advanced Algorithms Greedy algorithm Job-select problem Greedy vs DP.

CSC401 – Analysis of Algorithms Chapter 1 Algorithm Analysis Objectives: Introduce algorithm and algorithm analysis Discuss algorithm analysis methodologies.

CSC 331: Algorithm Analysis Divide-and-Conquer Algorithms.

Algorithm Aversion: People Erroneously Avoid Algorithms ...opim.wharton.upenn.edu/risk/library/WPAF201410-AlgorthimAversion... · Algorithm Aversion: People Erroneously Avoid Algorithms

Algorithms and Algorithm Analysis The “fun” stuff.

09 Genetic Algorithms - myreaders.info · Genetic algorithms, topics : Introduction, search optimization algorithm; Evolutionary algorithm (EAs); Genetic Algorithms (GAs) : ... method