Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...

45
Sentence splitting, tokenization, language modeling, and part-of- speech tagging with hidden Markov models Yoshimasa Tsuruoka

Transcript of Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...

Page 1: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Sentence splitting, tokenization, language modeling, and part-of-

speech tagging with hidden Markov models

Yoshimasa Tsuruoka

Page 2: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Topics

• Sentence splitting

• Tokenization

• Maximum likelihood estimation (MLE)

• Language models

– Unigram

– Bigram

– Smoothing

• Hidden Markov models (HMMs)

– Part-of-speech tagging

– Viterbi algorithm

Page 3: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

raw(unstructured)

text

part-of-speechtagging

named entityrecognition

deepsyntacticparsing

annotated(structured)

text

text processing

dictionary ontology

………………………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………

Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

NN IN NN VBZ VBN IN NN IN JJ NN NNS .

protein_molecule organic_compound cell_line

PP PP NP

PP

VP

VP

NP

NP

S

negative regulation

Page 4: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Basic Steps of Natural Language Processing

• Sentence splitting

• Tokenization

• Part-of-speech tagging

• Shallow parsing

• Named entity recognition

• Syntactic parsing

• (Semantic Role Labeling)

Page 5: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Sentence splitting

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.

However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.

Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Page 6: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

A heuristic rule for sentence splitting

sentence boundary

= period + space(s) + capital letter

Regular expression in Perl

s/\. +([A-Z])/\.\n$1/g;

Page 7: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Errors

• Two solutions:

– Add more rules to handle exceptions

– Machine learning

IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated cytokines (e.g.

IL-5 and IL-13).

Page 8: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Adding rules to handle exceptions

#!/usr/bin/perl

while(<STDIN>) {$_ =~ s/([\.\?]) +([\(0-9a-zA-Z])/$1\n$2/g;

$_ =~ s/(\WDr\.)\n/$1 /g;$_ =~ s/(\WMr\.)\n/$1 /g;$_ =~ s/(\WMs\.)\n/$1 /g;$_ =~ s/(\WMrs\.)\n/$1 /g;$_ =~ s/(\Wvs\.)\n/$1 /g;$_ =~ s/(\Wa\.m\.)\n/$1 /g;$_ =~ s/(\Wp\.m\.)\n/$1 /g;$_ =~ s/(\Wi\.e\.)\n/$1 /g;$_ =~ s/(\We\.g\.)\n/$1 /g;

print;}

Page 9: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Tokenization

• Tokenizing general English sentences is relatively straightforward.

• Use spaces as the boundaries

• Use some heuristics to handle exceptions

``We’re heading into a recession.’’

`` We ’re heading into a recession . ’’

Page 10: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Exceptions

• separate possessive endings or abbreviated forms from preceding words:

– Mary’s → Mary ’sMary’s → Mary isMary’s → Mary has

• separate punctuation marks and quotes from words :

– Mary. → Mary .

– “new” → “ new ’’

Page 11: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Tokenization

• Tokenizer.sed: a simple script in sed

• http://www.cis.upenn.edu/~treebank/tokenization.html

• Undesirable tokenization– original: “1,25(OH)2D3”

– tokenized: “1 , 25 ( OH ) 2D3”

• Tokenization for biomedical text– Not straight-forward

– Needs dictionary? Machine learning?

Page 12: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Maximum likelihood estimation

• Learning parameters from data

• Coin flipping example

– We want to know the probability that a biased coin comes up heads.

– Data: {H, T, H, H, T, H, T, T, H, H, T}

( )11

6Head =P Why?

Page 13: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Maximum likelihood estimation

• Likelihood

– Data: {H, T, H, H, T, H, T, T, H, H, T}

• Maximize the (log) likelihood

( ) ( ) ( ) ( ) ( ) ( )

( )56 1

11111Data

θθ

θθθθθθθθθθθ

−=

−⋅⋅⋅−⋅−⋅⋅−⋅⋅⋅−⋅=P

( ) ( )θθ −+= 1log5log6Datalog P

( )0

1

56Datalog=

−−=

θθθd

Pd 11

6=θ

Page 14: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Language models

• A language model assigns a probability to a word sequence

• Statistical machine translation

– Noisy channel models: machine translation, part-of-speech tagging, speech recognition, optical character recognition, etc.

( )( ) ( )

( )( ) ( )ee|f

f

ee|ff|e PP

P

PPP ∝=

( )nwwP ...1

English sentence

French sentence Language model for English

Page 15: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Language modeling

• How do you compute ?

Learn from data (a corpus)

He opened the window . 0.0000458

She opened the window . 0.0000723

John was hit . 0.0000244

John hit was . 0.0000002

: :

( )nwwP ...1nww ...1

( )nwwP ...1

Page 16: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Estimate probabilities from a corpus

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

( )2

1.wallaonsatDumptyHumpty =P

( )2

1.fallgreatahadDumptyHumpty =P

( ) 0.fallahadDumptyHumpty =P

Corpus

Page 17: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Sequence to words

• Let’s decompose the sequence probability into probabilities of individual words

( ) ( ) ( )( ) ( ) ( )

( ) ( ) ( ) ( )

( )∏=

−=

=

=

=

=

n

i

ii

n

n

nn

wwwP

wwwwwPwwwPwwPwP

wwwwPwwPwP

wwwPwPwwP

1

11

3214213121

213121

1211

...|

...

|...||

|...|

|...... ( ) ( ) ( )xyPxPyxP |, =

Chain rule

Page 18: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Unigram model

• Ignore the context completely

• Example

( ) ( )

( )∏

=

=

=

n

i

i

n

i

iin

wP

wwwPwwP

1

1

111 ...|...

( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty

.wallaonsatDumptyHumpty

PPPPPPP

P

Page 19: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

MLE for unigram models

• Count the frequencies

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

( )( )

( )∑=

w

k

kk

wC

wCwP

( )14

2Humpty =P

( )14

1on =P

Page 20: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Unigram model

• Problem

• Word order is not considered.

( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty

.wallaonsatDumptyHumpty

PPPPPPP

P

( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty

.wallaonsatHumptyDumpty

PPPPPPP

P

Page 21: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Bigram model

• 1st order Markov assumption

• Example

( ) ( )

( )∏

=

=

=

n

i

ii

n

i

iin

wwP

wwwPwwP

1

1

1

111

|

...|...

( )( ) ( ) ( )( ) ( ) ( ) ( )wall|.a|wallon|asat|on

Dumpty|satHumpty|Dumptys|Humpty

.wallaonsatDumptyHumpty

PPPP

PPP

P

><≈

Page 22: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

MLE for bigram models

• Count frequencies

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

( )( )

( )1

11|

−− =

k

kkkk

wC

wwCwwP

( )2

1Dumpty|sat =P

( )2

1Dumpty|had =P

Page 23: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

N-gram models

• n-1 order Markov assumption

• Modeling with a large n

– Pros: rich contextual information

– Cons: sparseness (unreliable probability estimation)

( ) ( )

( )∏

=

−+−

=

=

n

i

inii

n

i

iin

wwwP

wwwPwwP

1

11

1

111

...|

...|...

Page 24: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Web 1T 5-gram data

• Released from Google– http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?c

atalogId=LDC2006T13

– 1 trillion word tokens of text from Web pages

• Examples

serve as the incoming 92

serve as the incubator 99

serve as the independent 794

serve as the index 223

serve as the indication 72

serve as the indicator 120

: :

Page 25: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Smoothing

• Problems with the maximum likelihood estimation method

– Events that are not observed in the training corpus are given zero probabilities

– Unreliable estimates when the counts are small

Page 26: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Add-One Smoothing

• Bigram counts table

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

Humpty Dumpty sat on a wall had great fall .

Humpty 0 2 0 0 0 0 0 0 0 0

Dumpty 0 0 1 0 0 0 1 0 0 0

sat 0 0 0 1 0 0 0 0 0 0

on 0 0 0 0 1 0 0 0 0 0

a 0 0 0 0 0 1 0 1 0 0

wall 0 0 0 0 0 0 0 0 0 1

had 0 0 0 0 1 0 0 0 0 0

great 0 0 0 0 0 0 0 0 1 0

fall 0 0 0 0 0 0 0 0 0 1

. 0 0 0 0 0 0 0 0 0 0

Page 27: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Add-One Smoothing

• Add one to all counts

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

Humpty Dumpty sat on a wall had great fall .

Humpty 1 3 1 1 1 1 1 1 1 1

Dumpty 1 1 2 1 1 1 2 1 1 1

sat 1 1 1 2 1 1 1 1 1 1

on 1 1 1 1 2 1 1 1 1 1

a 1 1 1 1 1 2 1 2 1 1

wall 1 1 1 1 1 1 1 1 1 2

had 1 1 1 1 2 1 1 1 1 1

great 1 1 1 1 1 1 1 1 2 1

fall 1 1 1 1 1 1 1 1 1 2

. 1 1 1 1 1 1 1 1 1 1

Page 28: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Add-One Smoothing

• V: the size of the vocabulary

( )( )( ) VwC

wwCwwP

k

kkkk

+

+=

−−

1

11

1|

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

( )102

11Dumpty|sat

+

+=P

Page 29: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Part-of-speech tagging

• Assigns a part-of-speech tag to each word in the sentence.

• Part-of-speech tagsNN: Noun IN: Preposition

NNP: Proper noun VBD: Verb, past tense

DT: Determiner VBN: Verb, past participle

: :

Paul Krugman, a professor at Princeton University, wasNNP NNP , DT NN IN NNP NNP , VBDawarded the Nobel Prize in Economics on Monday.

VBN DT NNP NNP IN NNP IN NNP

Page 30: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Part-of-speech tagging is not easy

• Parts-of-speech are often ambiguous

• We need to look at the context

• But how?

I have to go there.

I had a go at it.

verb

noun

Page 31: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Writing rules for part-of-speech tagging

• If the previous word is “to”, then it’s a verb.

• If the previous word is “a”, then it’s a noun.

• If the next word is …

:

Writing rules manually is impossible

I have to go there.

verb

I had a go at it.

noun

Page 32: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Learning from examples

The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN

…………………………………………………………………………………….…………………………………………………………………………………….

Machine Learning Algorithm

training

We demonstrate that …

Unseen text

We demonstratePRP VBPthat …IN

Page 33: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Part-of-speech tagging

• Input:

– Word sequence

• Output (prediction):

– The most probable tag sequence

( )

( ) ( )( )

( ) ( )TWPTP

WP

TWPTP

WTPT

T

T

T

|argmax

|argmax

|argmax

τ

τ

τ

=

=

=)

nttT ...1=

nwwW ...1=

constant

Page 34: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Assumptions

• Assume the word emission probability depends only on its tag

( ) ( )

( ) ( )( ) ( ) ( )

( )∏=

−=

=

=

=

=

n

i

nii

nnnn

nnn

nn

ttwwwP

ttwwwwPttwwPttwP

ttwwwPttwP

ttwwPTWP

1

111

121311211

11211

11

......|

...

...|......|...|

...|......|

...|...|

( ) ( )∏=

=n

i

ii twPTWP1

||

Page 35: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Part-of-speech tagging with Hidden Markov Models

emission probabilitytransition probability

( ) ( )

( ) ( )∏=

−∈

=

=

n

i

iiiiT

T

twPttP

TWPTPT

1

1 ||argmax

|argmax

τ

τ

)

• Assume the first-order Markov assumption for tags

• Then, the most probably tag sequence can be expressed as

( ) ( )∏=

−≈n

i

ii ttPTP1

1|

Page 36: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Learning from training data

The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN

…………………………………………………………………………………….…………………………………………………………………………………….

( )( )( )1

11|

−− =

i

iiii

tC

ttCttP

( )( )( )i

iiii

tC

twCtwP =|

Maximum likelihood estimation

transition probability

emission probability

Page 37: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Examples from the Wall Street Journal corpus (1)

( )( )( )( )( ) 02.0DT|CD

07.0DT|NNS

11.0DT|NNP

22.0DT|JJ

47.0DT|NN

=

=

=

=

=

P

P

P

P

P

POS Tag Description Examples

DT determiner a, the

NN noun, singular or mass company

NNS noun plural companies

NNP proper nouns, singular Nasdaq

IN preposition/subordinating conjunction in, of, like

VBD verb, past tense took

: : :

( )( )( )( )( ) 05.0|VBD

08.0|.

11.0|,

12.0NN|NN

25.0NN|IN

=

=

=

=

=

NNP

NNP

NNP

P

P

Page 38: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Examples from the Wall Street Journal corpus (2)

( )( )( )( )( ) 0107.0NN|share

0155.0NN|market

0167.0NN|year

019.0NN|company

037.0NN|%

=

=

=

=

=

P

P

P

P

P

POS Tag Description Examples

DT determiner a, the

NN noun, singular or mass company

NNS noun plural companies

NNP proper nouns, singular Nasdaq

IN preposition/subordinating conjunction in, of, like

VBD verb, past tense took

: : :

( )( )( )( )( ) 022.0VBD|rose

056.0VBD|had

064.0VBD|were

131.0VBD|was

188.0VBD|said

=

=

=

=

=

P

P

P

P

P

Page 39: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Part-of-speech tagging

• Input:– Word sequence

• Output (prediction):– The most probable tag sequence nttT ...1=

nwwW ...1=

( ) ( )∏=

−∈

=n

i

iiiiT

twPttPT1

1 ||argmaxτ

)

w1

t1

w2

t2

w3

t3

…wn

tn…

Page 40: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Finding the best sequence

• Naive approach:– Enumerate all possible tag sequences with their probabilities and

select the one that gives the highest probability

w1

t1

w2

t2

w3

t3

…wn

tn…

DTNNNNSNNPVB:

DTNNNNSNNPVB:

DTNNNNSNNPVB:

DTNNNNSNNPVB:

exponential in the length of the sentence

Page 41: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Finding the best sequence

• if you write them down…

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

:

DT|DT|DTNN|NNS|NNNNS|s|NNS

DT|DT|DTNN|NN|NNNN|s|NN

DT|DT|DTNN|DT|NNDT|s|DT

:

DT|DT|DTDT|DT|DTNNS|s|NNS

DT|DT|DTDT|DT|DTNN|s|NN

DT|DT|DTDT|DT|DTDT|s|DT

21

21

21

21

21

21

n

n

n

n

n

n

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

Λ

Λ

Λ

Λ

Λ

Λ

><

><

><

><

><

><

A lot of redundancy in computation

There should be a way to do it more efficiently

Page 42: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Viterbi algorithm (1)

• Let be the probability of the best sequence for ending with the tag

( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( )111

1

1

1...,

1

1

1...,

1

1,...,

||max

||max||max

||maxmax

||max

1

2211

2211

121

−−−

=

−−

=

=

−−

−−

=

=

=

kkkkkkt

k

i

iiiittt

kkkkt

k

i

iiiitttt

k

i

iiiittt

kk

tVtwPttP

twPttPtwPttP

twPttP

twPttPtV

k

kk

kk

k

( )kk tV

ktkww ...1

We can compute them in a recursive manner!

Page 43: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Viterbi algorithm (2)

( ) ( ) ( ) ( )111 ||max1

−−−−

= kkkkkkt

kk tVtwPttPtVk

wk-1

tk-1

wk

tk

DT

NN

NNS

NNP

DT

NN

NNS

NNP

( ) 1s0 =><V

Beginning of the sentence

( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )

:

NNPDT|NNP|DT

NNSDT|NNS|DT

NNDT|NN|DT

DTDT|DT|DT

1

1

1

1

kk

kk

kk

kk

VwPP

VwPP

VwPP

VwPP

Page 44: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Viterbi algorithm (3)

• Left to right

– Save backward pointers to recover the best tag sequence

Complexity: (number of tags)2 x (length)

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

Page 45: Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf · Sentence splitting, tokenization, language modeling, and part-of-speech tagging with

Topics

• Sentence splitting

• Tokenization

• Maximum likelihood estimation (MLE)

• Language models

– Unigram

– Bigram

– Smoothing

• Hidden Markov models (HMMs)

– Part-of-speech tagging

– Viterbi algorithm