CS 6840: Natural Language Processing Sequence Tagging...

103
CS 6840: Natural Language Processing Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected] Sequence Tagging with HMMs: Part of Speech Tagging

Transcript of CS 6840: Natural Language Processing Sequence Tagging...

Page 1: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

CS 6840: Natural Language Processing

Razvan C. Bunescu

School of Electrical Engineering and Computer Science

[email protected]

Sequence Tagging with HMMs:Part of Speech Tagging

Page 2: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Part of Speech (POS) Tagging

• Annotate each word in a sentence with its POS:– noun, verb, adjective, adverb, pronoun, preposition, interjection, …

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

2

Page 3: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Parts of Speech

• Lexical categories that are defined based on:– Syntactic function:

• nouns can occur with determiners: a goat.• nouns can take possessives: IBM’s annual revenue.• most nouns can occur in the plural: goats.

– Morphological function:• many verbs can be composed with the prefix “un”.

• There are tendencies toward semantic coherence:– nouns often refer to “people, places, or things”.– adjectives often refer to properties.

3

Page 4: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS: Closed Class vs. Open Class

• Closed Class:– relatively fixed membership.– usually function words:

• short common words which have a structuring role in grammar.

– Prepositions: of, in, by, on, under, over, …– Auxiliaries: may, can, will had, been, should, …– Pronouns: I, you, she, mine, his, them, …– Determiners: a, an, the, which, that, …– Conjunctions: and, but, or (coord.), as, if, when, (subord.), …– Particles: up, down, on, off, …– Numerals: one, two, three, third, …

4

Page 5: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS: Open Class vs. Closed Class

• Open Class:– new members are continually added.

• to fax, to google, futon, …– English has 4: Nouns, Verbs, Adjectives, Adverbs.

• Many languages have these 4, but not all (e.g. Korean).– Nouns: people, places, or things– Verbs: actions and processes– Adjectives: properties or qualities– Adverbs: a hodge-podge

• Unfortunately, John walked home extremely slowly yesterday.• directional, locative, temporal, degree, manner, …

5

Page 6: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS: Open vs. Closed Classes

• Open Class: new members are continually added.

1. Annie: Do you love me?Alvy: Love is too weak a word for what I feel... I lurve you. Y'know, I loove you, I, I luff you. There are two f's. I have to invent... Of course I love you. (Annie Hall)

2. 'Twas brillig, and the slithy tovesDid gyre and gimble in the wabe;All mimsy were the borogoves,And the mome raths outgrabe.

(Jabberwocky, Lewis Caroll)

"Beware the Jabberwock, my son!The jaws that bite, the claws that catch!Beware the Jubjub bird, and shunThe frumious Bandersnatch!"

6

Page 7: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Parts of Speech: Granularity

• Grammatical sketch of Greek [Dionysius Thrax, c. 100 B.C.]:– 8 tags: noun, verb, pronoun, preposition, adjective, conjunction,

participle, and article.

• Brown corpus [Francis, 1979]:– 87 tags.

• Penn Treebank [Marcus et al., 1993]:– 45 tags.

• British National Corpus (BNC) [Garside et al., 1997]:– C5 tagset: 61 tags.– C7 tagset: 146 tags.

We will focus on the Penn Treebank POS tags.7

Page 8: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Penn Treebank POS Tagset

8

Page 9: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Penn Treebank POS tags

• Selected from the original 87 tags of the Brown corpus:Þ lost finer distinctions between lexical categories.

1) Prepositions and subordinating conjunctions:– after/CS spending/VBG a/AT day/NN at/IN the/AT palace/NN– after/IN a/AT wedding/NN trip/NN to/IN Hawaii/NNP ./.

2) Infinitive to and prepositional to:– to/TO give/VB priority/NN to/IN teachers/NNS

3) Adverbial nouns:– Brown: Monday/NR, home/NR, west/NR, tomorrow/NR– PTB: Monday/NNP, (home, tomorrow, west)/(NN, RB)

9

Page 10: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging º POS Disambiguation

• Words often have more than one POS tag, e.g. back:– the back/JJ door– on my back/NN– win the voters back/RB– promised to back/VB the bill

• Brown corpus statistics [DeRose, 1998]:– 11.5% ambiguous English word types.– 40% of all word occurrences are ambiguous.

• most are easy to disambiguate– the tags are not equaly likely, i.e. low tag entropy: table

10

Page 11: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tag Ambiguity

11

Page 12: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging º POS Disambiguation

• Some distinctions are difficult even for humans:– Mrs. Shaefer never got around to joining

– All we gotta do is go around the corner

– Chateau Petrus costs around 250

• Use heuristics [Santorini, 1990]:– She told off/RP her friends She stepped off/IN the train

– She told her friends off/RP *She stepped the train off/IN

NNP NNP RB VBD RP TO VBG

DT PRP VBN VB VBZ VB IN DT NN

NNP NNP VBZ RB CD

12

Page 13: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

How Difficult is POS Tagging?

• Most current tagging algorithms: ~ 96% - 97% accuracy for Penn Treebank tagsets. – Current SofA 97.55% tagging accuracy. How good is this?

• Bidirectional LSTM-CRF Models for Sequence Tagging [Huang, Xu, Yu, 2015].

– Human Ceiling: how well humans do?• human annotators: about 96% - 97% [Marcus et al., 1993].• when allowed to discuss tags, consensus is 100% [Voutilainen, 95]

– Most Frequent Class Baseline:• 90% - 91% on the 87-tag Brown tagset [Charniak et al., 1993].• 93.69% on the 45-tag Penn Treebank, with unknown word model

[Toutanova et al., 2003].

13

Page 14: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging Methods

• Rule Based:– Rules are designed by human experts based on linguistic knowledge.

• Machine Learning:– Trained on data that has been manually labeled by humans.– Rule learning:

• Transformation Based Learning (TBL).– Sequence tagging:

• Hidden Markov Models (HMM).• Maximum Entropy (Logistic Regression).• Sequential Conditional Random Fields (CRF).• Recurrent Neural Networks (RNN):

– bidirectional, with a CRF layer (BI-LSTM-CRF).14

Page 15: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging: Rule Based

1) Start with a dictionary.

2) Assign all possible tags to words from the dictionary.

3) Write rules by hand to selectively remove tags, leaving the correct tag for each word.

15

Page 16: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging: Rule Based

1) Start with a dictionary:

she: PRPpromised: VBN,VBDto TOback: VB, JJ, RB, NNthe: DTbill: NN, VB

… for the ~100,000 words of English.

16

Page 17: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging: Rule Based

2) Assign every possible tag:

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

17

Page 18: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging: Rule Based

3) Write rules to eliminate incorrect tags.– Eliminate VBN if VBD is an option when VBN|VBD follows

“<S> PRP”

NNRB

VBN JJ VBPRP VBD TO VB DT NNShe promised to back the bill

18

Page 19: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Tagging as Sequence Labeling

• Sequence Labeling:– Tokenization and Sentence Segmentation.– Part of Speech Tagging.– Information Extraction

• Named Entity Recognition– Shallow Parsing.– Semantic Role Labeling.– DNA Analysis.– Music Segmentation.

• Solved using ML models for classification:– Token-level vs. Sequence-level.

19

Page 20: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling

• Sentence Segmentation:

Mr. Burns is Homer Simpson’s boss. He is very rich.

• Tokenization:

Mr. Burns is Homer Simpson’s boss. He is very rich.

O O O O O O O O I OO IO… … … … …

20

Page 21: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling

• Information Extraction:– Named Entity Recognition

Drug giant Pfizer Inc. has reached an agreement to buy the

private biotechnology firm Rinat Neuroscience Corp.

O O I I O O O O O O O

O O O I I I

21

Page 22: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling

• Information Extraction:– Text Segmentation into topical sections.

Vine covered cottage , near Contra Costa Hills . 2 bedroom house ,

modern kitchen and dishwasher . No pets allowed . $ 1050 / month[Haghighi & Klein, NAACL ‘06]

22

Page 23: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling

• Information Extraction:– segmenting classifieds into topical sections.

Vine covered cottage , near Contra Costa Hills . 2 bedroom house ,

modern kitchen and dishwasher . No pets allowed . $ 1050 / month

– Features– Neighborhood– Size– Restrictions– Rent

[Haghighi & Klein, NAACL ‘06]

23

Page 24: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling

• Semantic Role Labeling:– For each clause, determine the semantic role played by each noun

phrase that is an argument to the verb:

John drove Mary from Athens to Columbus in his Toyota Prius.The hammer broke the window.

• agent• patient• source• destination• instrument

24

Page 25: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling

• DNA Analysis:– transcription factor binding sites.– promoters.– introns, exons, …

AATGCGCTAACGTTCGATACGAGATAGCCTAAGAGTCA

25

Page 26: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling

• Music Analysis:– segmentation into “musical phrases”

[Romeo & Juliet, Nino Rota]

26

Page 27: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling as Classification

1) Classifiy each token individually into one of a number of classes:– Token represented as a vector of features extracted from context.– To build classification model, use general ML algorithms:

• Maximum Entropy (i.e. Logistic Regression)• Support Vector Machines (SVMs)• Perceptrons.• Winnow.• Naïve Bayes, Bayesian Networks.• Decision Trees.• k-Nearest Neighbor, …

27

Page 28: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Represent each position i in text as j(t, hi) ={jk(t, hi)}:– t is the potential POS tag at position i.– hi is the history/context of position i.

– j (t, hi) is a vector of features jk(t, hi), for k = 1..K.

• Represent the “unnormalized” score of a tag t as:

[Ratnaparkhi, EMNLP’96]

},,,,,,{ 212121 ----++= iiiiiiii ttwwwwwh

φk (t,hi ) =1 if suffix(wi ) = "ing" & t = VBG0 otherwise

⎧⎨⎪

⎩⎪

score(t,hi ) =wTφ(t,hi ) = wk

k=1

K

∑ φk (t,hi )

want wk to be large here28

Page 29: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging[Ratnaparkhi, EMNLP’96]

29

Page 30: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging[Ratnaparkhi, EMNLP’96]

the non-zero features for position 3

feature templates30

Page 31: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging[Ratnaparkhi, EMNLP’96]

the non-zero features for position 431

Page 32: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• How do we learn the weights w?– Train on manually annotated data (supervised learning).

• What does it mean “train w on annotated corpus”?– Probabilistic Discriminative Models:

• Maximum Entropy (Logistic Regression).– Distribution Free Methods:

• (Average) Perceptrons.• Support Vector Machines (SVMs).

[Collins, ACL 2002]

[Ratnaparkhi, EMNLP’96]

32

Page 33: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Probabilistic Discriminative Model:Þneed to transform score(t,hi) into probability p(t |hi).

• Training using:– Maximum Likelihood (ML).– Maximum A Posteriori (MAP) with a Gaussian prior on w.

• Inference (i.e. Testing):

[Ratnaparkhi, EMNLP’96]

p(t | hi ) =exp(wTφ(t,hi ))exp(wTφ(t ',hi ))t '∑

),(maxarg)),(exp(maxarg)|(maxargˆii

T

Ttii

T

Ttii

Tti htwhtwhtpt

iii

jjÎÎÎ

===

33

Page 34: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]

John saw the saw and decided to take it to the table.

classifier

NNP

34

Page 35: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNPJohn saw the saw and decided to take it to the table.

classifier

VBD

35

Page 36: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBDJohn saw the saw and decided to take it to the table.

classifier

DT

36

Page 37: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DTJohn saw the saw and decided to take it to the table.

classifier

NN

37

Page 38: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NNJohn saw the saw and decided to take it to the table.

classifier

CC

38

Page 39: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CCJohn saw the saw and decided to take it to the table.

classifier

VBD

39

Page 40: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBDJohn saw the saw and decided to take it to the table.

classifier

TO

40

Page 41: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TOJohn saw the saw and decided to take it to the table.

classifier

VB

41

Page 42: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VBJohn saw the saw and decided to take it to the table.

classifier

PRP

42

Page 43: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VB PRPJohn saw the saw and decided to take it to the table.

classifier

IN

43

Page 44: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VB PRP INJohn saw the saw and decided to take it to the table.

classifier

DT

44

Page 45: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]NNP VBD DT NN CC VBD TO VB PRP IN DTJohn saw the saw and decided to take it to the table.

classifier

NN

45

Page 46: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

A Maximum Entropy Model for POS Tagging

• Inference, need to do Forward traversal of input sequence:

• Some POS tags would be easier to disambiguate backward, what can we do?– Use backward traversal, with backward features … but lose

forward info.

[Ratnaparkhi, EMNLP’96]

[Animation by Ray Mooney, UT Austin]

John saw the saw and decided to take it to the table.

classifier

NN

46

Page 47: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequence Labeling as Classification

1) Classifiy each token individually into one of a number of classes.

2) Classify all tokens jointly into one of a number of classes:

– Hidden Markov Models.– Conditional Random Fields.– Structural SVMs.– Discriminatively Trained HMMs [Collins, EMNLP’02].– Bi-directional RNNs / LSTM-CRFs.

),...,,,...,(maxargˆ...ˆ 11,...,

11

nnT

ttn wwtttt

n

jl=

47

Page 48: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Hidden Markov Models

• Probabilistic Generative Models:

),...,|,...,(maxargˆ...ˆ 11,...,

11

nntt

n wwttpttn

=

),...,(),...,|,...,(maxarg 111,...,1

nnntt

ttpttwwpn

=

Use state emission probs Use state transition probs

48

Page 49: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Hidden Markov Models: Assumptions

1) A word event depends only on its POS tag:

2) A tag event depends only on the previous tag:

Þ POS tagging is

Õ=

=n

iiinn twpttwwp

111 )|(),...,|,...,(

Õ=

-=n

iiin ttpttp

111 )|(),...,(

Õ=

-=n

iiiii

ttn ttptwptt

n 11

,...,1 )|()|(maxargˆ...ˆ

1

49

Page 50: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Interlude

Tales of HMMs

50

Page 51: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Structured Data

• For many applications, the i.i.d. assumption does not hold:– pixels in images of real objects.– hyperlinked web pages.– cross-citations in scientific papers.– entities in social networks.– sequences of words/letters in text.– successive time frames in speech.– sequences of base pair in DNA.– musical notes in a tonal melody.– daily values of a particular stock.

51

Page 52: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Structured Data

• For many applications, the i.i.d. assumption does not hold:– pixels in images of real objects.– hyperlinked web pages.– cross-citations in scientific papers.– entities in social networks.– sequences of words/letters in text.– successive time frames in speech.– sequences of base pair in DNA.– musical notes in a tonal melody.– daily values of a particular stock.

Sequential Data

52

Page 53: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Probabilistic Graphical Models

• PGMs use a graph for compactly:1. Encoding a complex distribution over a multi-dimensional space.2. Representing a set of independencies that hold in the distribution.– Properties 1 and 2 are, in a “deep sense”, equivalent.

• Probabilistic Graphical Models:– Directed:

• i.e. Bayesian Networks i.e. Belief Networks.– Undirected:

• i.e. Markov Random Fields

53

Page 54: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Probabilistic Graphical Models

• Directed PGMs:– Bayesian Networks:

• Dynamic Bayesian Networks:– State Observation Models:

» Hidden Markov Models.» Linear Dynamical Systems (Kalman filters).

• Undirected PGMs:– Markov Random Fields (MRF).

• Conditional Random Fields (CRF).– Sequential CRFs.

54

Page 55: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Bayesian Networks

• A Bayesian Network structure G is a directed acyclic graph whose nodes X1, X2, ..., Xn represent random variables and edges correspond to “direct influences” between nodes:– Let Pa(Xi) denote the parents of Xi in G;– Let NonDescend(Xi) denote the variables in the graph that are not

descendants of Xi.– Then G encodes the following set of conditional independence

assumptions, called the local independencies:

For each Xi in G: Xi ⊥NonDescend(Xi) | Pa(Xi)

55

Page 56: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Bayesian Networks

1. Because Xi ⊥NonDescend(Xi) | Pa(Xi), it follows that:

2. More generally, d-separation:1. Two sets of nodes X and Y are conditionally independent given a

set of nodes E (X ⊥ Y | E) if X and Y are d-separated by E.

P(X1,X2,...,Xn ) = P Xi | Pa(Xi )( )i=1

n

56

Page 57: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Sequential Data

Q: How can we model sequential data?

1) Ignore sequential aspects and treat the observations as i.i.d.

2) Relax the i.i.d. assumption by using a Markov model.

x1 xt+1 xTxtxt-1… …

x1 xt+1 xTxtxt-1… …

57

Page 58: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Markov Models

• X = x1, …, xT is a sequence of random variables.• S = {s1, …, sN} is a state space, i.e. xt takes values from S.

1) Limited Horizon:

2) Stationarity:

Þ X is said to be a Markov chain.

)|(),...,|( 111 tkttkt xsxPxxsxP === ++

)|()|( 121 xsxPxsxP ktkt ===+

58

Page 59: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Markov Models: Parameters

• S = {s1, …, sN} are the visible states.

• P = {pi} are the initial state probabilities.

• A = {aij} are the state transition probabilities.

)|( 1 itjtij sxsxPa === +

)( 1 ii sxP ==p

x1 xt+1 xTxtxt-1… …A A A A AAP

59

Page 60: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Markov Models as DBNs

• A Markov Model is a Dynamic Bayesian Network:1. B0 = P is the initial distribution over states.

1. B→ = A is the 2-time-slice Bayesian Network (2-TBN).

– The unrolled DBN (Markov model) over T time steps:

xt+1xtA

x1P

x1 xt+1 xTxtxt-1… …A A A A AAP

60

Page 61: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Markov Models: Inference

Õ-

=+

=1

111

T

txxx tt

ap

Õ-

=+=

1

111 )|()(

T

ttt xxPxp

),...,()( 1 TxxpXp =

x1 xt+1 xTxtxt-1… …A A A A AAP

• Exercise: compute p(t,a,p)

61

Page 62: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

mth Order Markov Models

• First order Markov model:

• Second order Markov model:

• mth order Markov model:

Õ-

=+=

1

111 )|()()(

T

ttt xxPxpXp

Õ-

=-+=

1

211121 ),|()|()()(

T

tttt xxxPxxpxpXp

Õ-

=+-+-=

1

1111121 ),...,|(),...,|()...|()()(T

mtmtttmm xxxPxxxpxxpxpXp

62

Page 63: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Markov Models

• (Visible) Markov Models:– Developed by Andrei A. Markov [Markov, 1913]

• modeling the letter sequences in Pushkin’s “Eugene Onyegin”.

• Hidden Markov Models:– The states are hidden (latent) variables.– The states probabilistically generate surface events, or observations.

– Efficient training using Expectation Maximization (EM)• Maximum Likelihood (ML) when tagged data is available.

– Efficient inference using the Viterbi algorithm.

63

Page 64: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Hidden Markov Models (HMMs)

• Probabilistic directed graphical models:– Hidden states (shown in brown).

– Visible observations (shown in lavender).

– Arrows model probabilistic (in)dependencies.

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

64

Page 65: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Parameters

• S = {s1, …, sN} is the set of states.• K = {k1, …, kM} = {1, …, M} is the observations alphabet.

• X = x1, …, xT is a sequence of states.• O = o1, …, oT is a sequence of observations.

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

65

Page 66: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Parameters

• P = {pi}, iÎS, are the initial state probabilities.

• A = {aij} }, i,jÎS, are the state transition probabilities.

• B = {bik}, iÎS, kÎK, are the symbol emision probabilities.

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

A A A A AAP

B B B B B

)|( ittik sxkoPb ===

66

Page 67: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Hidden Markov Models as DBNs

• A Hidden Markov Model is a Dynamic Bayesian Network:1. B0 = P is the initial distribution over states.

1. B→ = A is the 2-time-slice Bayesian Network (2-TBN).

– The unrolled DBN (Markov model) over T time steps (prev. slide).

x1P

xt+1xtA

ot+1

B

67

Page 68: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Inference and Training

• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given

observation sequence i.e. p(O|µ) (Forward-Backward).

2) Given a model µ and an observation sequence O, compute the

most likely hidden state sequence (Viterbi).

3) Given an observation sequence O, find the model µ = (A, B, P)

that best explains the observed data (EM).

• Given observation and state sequence O, X find µ (ML).

),|(maxargˆ µOXPXX

=

68

Page 69: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

1) Given a model µ = (A, B, P), compute the probability of a given observation sequence O = o1, …, oT i.e. p(O|µ)

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

69

Page 70: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

TToxoxox bbbXOP ...),|(2211

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

70

Page 71: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

TToxoxox bbbXOP ...),|(2211

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

TT xxxxxxx aaaXP132211

...)|(-

=pµ

71

Page 72: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

TToxoxox bbbXOP ...),|(2211

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

TT xxxxxxx aaaXP132211

...)|(-

=pµ

)|(),|()|,( µµµ XPXOPXOP =

72

Page 73: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

TToxoxox bbbXOP ...),|(2211

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

TT xxxxxxx aaaXP132211

...)|(-

=pµ

)|(),|()|,( µµµ XPXOPXOP =

å=X

XPXOPOP )|(),|()|( µµµ

73

Page 74: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

111

1

111

1

1}...{)|(

+++På-

=

=tttt

T

oxxx

T

txxoxx babOp pµ

Time complexity?

74

Page 75: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Forward Procedure

• Define:

• Then solution is: )|,...()( 1 µa ixooPt tti ==

å=

=N

ii TOp

1)()|( aµ

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

75

Page 76: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

)|(),...()()|()|...(

)()|...(),...(

1111

11111

1111

111

jxoPjxooPjxPjxoPjxooP

jxPjxooPjxooP

tttt

ttttt

ttt

tt

=======

=====

+++

++++

+++

++)1( +tja

76

Page 77: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

)|(),...()()|()|...(

)()|...(),...(

1111

11111

1111

111

jxoPjxooPjxPjxoPjxooP

jxPjxooPjxooP

tttt

ttttt

ttt

tt

=======

=====

+++

++++

+++

++)1( +tja

77

Page 78: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

)|(),...()()|()|...(

)()|...(),...(

1111

11111

1111

111

jxoPjxooPjxPjxoPjxooP

jxPjxooPjxooP

tttt

ttttt

ttt

tt

=======

=====

+++

++++

+++

++)1( +tja

78

Page 79: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

)|(),...()()|()|...(

)()|...(),...(

1111

11111

1111

111

jxoPjxooPjxPjxoPjxooP

jxPjxooPjxooP

tttt

ttttt

ttt

tt

=======

=====

+++

++++

+++

++)1( +tja

79

Page 80: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

å

å

å

å

=

+++=

++=

+

++=

+

+=

=====

=====

====

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

a

)1( +tja

80

Page 81: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

å

å

å

å

=

+++=

++=

+

++=

+

+=

=====

=====

====

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

a

)1( +tja

81

Page 82: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

å

å

å

å

=

+++=

++=

+

++=

+

+=

=====

=====

====

Nijoiji

ttttNi

tt

tttNi

ttt

ttNi

ttt

tbat

jxoPixjxPixooP

jxoPixPixjxooP

jxoPjxixooP

...1

111...1

1

11...1

11

11...1

11

1)(

)|()|(),...(

)|()()|,...(

)|(),,...(

a

)1( +tja

82

Page 83: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Forward Procedure

1. Initialization

2. Recursion:

3. Termination:

Ni1 ,)1(1

££= ioii bpa

TtN,1j1 ,)()1(...1

1<£££=+ å

=+

Nijoijij tbatt aa

å=

=N

ii TOp

1)()|( aµ

83

Page 84: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Forward Procedure: Trellis Computation

s1a1(t)

s2a2(t)

sjaj(t+1)

sNaN(t)

s3a3(t)

.

.

.

a1jbjot+1

a2jbjot+1

a3jbjot+1

aNjbjot+1

S

84

Page 85: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Backward Procedure

• Define:

• Then solution is: ),|...()( 1 µb ixooPt tTti == +

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

å=

=N

iiioibOp

1)1()|(

1bpµ

85

Page 86: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Backward Procedure

1. Initialization

2. Recursion:

3. Termination:

Ni1 ,1)( ££=Tib

TtN,1i1 ),1()(...1

1<£££+= å

=+

tbat jNj

joiji tbb

å=

=N

iiioibOp

1)1()|(

1bpµ

86

Page 87: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Decoding

• Forward Procedure:

• Backward Procedure:

• Combination:

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

å=

=N

iiioibOp

1)1()|(

1bpµ

å=

=N

ii TOp

1)()|( aµ

)()()|(1

ttOp i

N

ii baµ å

=

=

87

Page 88: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Inference and Training

• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given

observation sequence i.e. p(O|µ) (Forward-Backward).

2) Given a model µ and an observation sequence O, compute the

most likely hidden state sequence (Viterbi).

3) Given an observation sequence O, find the model µ = (A, B, P)

that best explains the observed data (EM).

• Given observation and state sequence O, X find µ (ML).

),|(maxargˆ µOXPXX

=

88

Page 89: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Best State Sequence with Viterbi Algorithm

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

),|(maxargˆ µOXpXX

=

)|,(maxarg µOXpX

=

)|,...,,,...,(maxarg 11,...,1

µTTxxooxxp

T

=

Time complexity?

89

Page 90: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Viterbi Algorithm

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

)|,...,,,...,(maxargˆ11,...,1

µTTxxooxxpX

T

=

),,...,...(max)( 1111... 11ttttxxj ojxooxxpt

t

== ---

d

• The probability of the most probable path that leads to xt = j:

)|,...,,,...,(max)ˆ( 11,...,1

µTTxxooxxpXp

T

=

)(max)ˆ(1

TXp jNjd

££=

90

Page 91: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

• The probability of the most probable path that leads to xt = j:

• It can be shown that:

The Viterbi Algorithm

x1 xt+1 xTxtxt-1… …

o1 ot-1 ot ot+1 oT

),,...,...(max)( 1111... 11ttttxxj ojxooxxpt

t

== ---

d

1)(max)1(

1 +££=+

tjoijiNij batt dd å=

+=+

Nijoijij tbatt

...11

)()1( aaCompare with:

91

Page 92: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Viterbi Algorithm: Trellis Computation

s1d1(t)

s2d2(t)

sjdj(t+1)

sNdN(t)

s3d3(t)

.

.

.

a1jbjot+1

a2jbjot+1

a3jbjot+1

aNjbjot+1

max

92

Page 93: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Viterbi Algorithm

1. Initialization

2. Recursion

3. Termination

4. State sequence backtracking

1)1( jojj bpd =

0)1( =jy

1)(max)1(

1 +££=+

tjoijiNij batt dd

1)(maxarg)1(

1 +££=+

tjoijiNij batt dy

)(max)ˆ(1

TXp jNjd

££=

)(maxargˆ1

Tx jNjT d££

=

)ˆ(ˆ 11 ++= ttt xx y

Time complexity?

93

Page 94: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs: Inference and Training

• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given

observation sequence i.e. p(O|µ) (Forward-Backward).

2) Given a model µ and an observation sequence O, compute the

most likely hidden state sequence (Viterbi).

3) Given an observation sequence O, find the model µ = (A, B, P)

that best explains the observed data (EM).

• Given observation and state sequence O, X find µ (ML).

94

Page 95: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Parameter Estimation with Maximum Likelihood

• Given observation and state sequences O, X find µ =(A,B,P).

)|( 1 itjtij sxsxpa === +

)(),(

ˆ 1

it

itjtij sxC

sxsxCa

===

= +

)|( ittik sxkopb ===

)(),(ˆ

it

ittik sxC

sxkoCb=

===

)|,(maxargˆ µµµ

XOp=

)( 1 ii sxp ==pXsxC i

i)(ˆ 1 ==p

Exercise:Rewrite to use Laplace smoothing.

95

Page 96: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Parameter Estimation with Expectation Maximization

• Given observation sequences O find µ =(A,B,P).

• There is no known analytic method to find solution.

• Locally maximize p(O|µ) using iterative hill-climbing:Þ the Baum-Welch or Forward-Backward algorithm:

- Given a model µ and observation sequence, update the model parameters to to better fit the observations.

- A special case of the Expectation Maximization method.

)|(maxargˆ µµµ

Op=

µ̂

96

Page 97: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Baum-Welch Algorithm (EM)

[E] Assume µ is known, compute “hidden” parameters x, g :1) xt(i, j) = the probability of being in state si at time t and

state sj at time t+1.

2) gt(i) = the probability of being in state si at time t.

å=

+= +

Nmmm

jjoijit tt

tbatji t

...1)()()1()(

),( 1

baba

x

å=

=Njti jit

...1),()( xg

å-

=

=1

1 to from ns transitioofnumber expected),(

T

tjit ssjix

å-

=

=1

1 from ns transitioofnumber expected)(

T

tit sig

97

Page 98: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Baum-Welch Algorithm

[M] Re-estimate µ using expectations of x, g :

• Baum has proven that

)1(ˆ igp =i

åå

=

== T

t i

T

t tij

t

jia

1

1

)(

),(ˆ

g

x

åå

=

== T

t i

kot tik

t

ib t

1

}:{

)(

)(ˆg

g

µ̂

)|()ˆ|( µµ OpOp ³

98

Page 99: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

The Baum-Welch Algorithm

1. Start with some (random) model µ = (A,B,P).

2. [E step] Compute xt(i, j), gt(i) and their expectations.

3. [M step] Compute ML estimate .

4. Set and repeat from 2. until convergence.

µ̂

µµ ˆ=

99

Page 100: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

HMMs

• Three fundamental questions:1) Given a model µ = (A, B, P), compute the probability of a given

observation sequence i.e. p(O|µ) (Forward/Backward).

2) Given a model µ and an observation sequence O, compute the

most likely hidden state sequence (Viterbi).

3) Given an observation sequence O, find the model µ = (A, B, P)

that best explains the observed data (Baum-Welch, or EM).

• Given observation and state sequence O, X find µ (ML).

100

Page 101: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

Supplemental Reading

• Section 7.1, 7.2, 7.3, and 7.4 from Eisenstein.• Chapter 8 in Jurafsky & Martin:

– https://web.stanford.edu/~jurafsky/slp3/8.pdf

• Appendix A in Jurafsky & Martin:– https://web.stanford.edu/~jurafsky/slp3/A.pdf

101

Page 102: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

102

Page 103: CS 6840: Natural Language Processing Sequence Tagging …oucsace.cs.ohio.edu/~razvan/courses/nlp6840/lecture03.pdf[Romeo & Juliet, Nino Rota] 26. Sequence Labeling as Classification

POS Disambiguation: Context

“Here's a movie where you forgive the preposterous because it takes you to the perplexing.”

[Source Code, by Roger Ebert, March 31, 2011]

“The good, the bad, and the ugly”

“The young and the restless”

“The bold and the beautiful”

103