Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...

Sentence splitting, tokenization, language modeling, and part-of-

speech tagging with hidden Markov models

Yoshimasa Tsuruoka

Topics

• Sentence splitting

• Tokenization

• Maximum likelihood estimation (MLE)

• Language models

– Unigram

– Bigram

– Smoothing

• Hidden Markov models (HMMs)

– Part-of-speech tagging

– Viterbi algorithm

raw(unstructured)

text

part-of-speechtagging

named entityrecognition

deepsyntacticparsing

annotated(structured)

text

text processing

dictionary ontology

………………………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………

Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

NN IN NN VBZ VBN IN NN IN JJ NN NNS .

protein_molecule organic_compound cell_line

PP PP NP

PP

VP

VP

NP

NP

S

negative regulation

Basic Steps of Natural Language Processing


• Tokenization

• Part-of-speech tagging

• Shallow parsing

• Named entity recognition

• Syntactic parsing

• (Semantic Role Labeling)

Sentence splitting

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.

However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.

Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

A heuristic rule for sentence splitting

sentence boundary

= period + space(s) + capital letter

Regular expression in Perl

s/\. +([A-Z])/\.\n$1/g;

Errors

• Two solutions:

– Add more rules to handle exceptions

– Machine learning

IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated cytokines (e.g.

IL-5 and IL-13).

Adding rules to handle exceptions

#!/usr/bin/perl

while(<STDIN>) {$_ =~ s/([\.\?]) +([\(0-9a-zA-Z])/$1\n$2/g;

$_ =~ s/(\WDr\.)\n/$1 /g;$_ =~ s/(\WMr\.)\n/$1 /g;$_ =~ s/(\WMs\.)\n/$1 /g;$_ =~ s/(\WMrs\.)\n/$1 /g;$_ =~ s/(\Wvs\.)\n/$1 /g;$_ =~ s/(\Wa\.m\.)\n/$1 /g;$_ =~ s/(\Wp\.m\.)\n/$1 /g;$_ =~ s/(\Wi\.e\.)\n/$1 /g;$_ =~ s/(\We\.g\.)\n/$1 /g;

print;}

Tokenization

• Tokenizing general English sentences is relatively straightforward.

• Use spaces as the boundaries

• Use some heuristics to handle exceptions

``We’re heading into a recession.’’

`` We ’re heading into a recession . ’’

Exceptions

• separate possessive endings or abbreviated forms from preceding words:

– Mary’s → Mary ’sMary’s → Mary isMary’s → Mary has

• separate punctuation marks and quotes from words :

– Mary. → Mary .

– “new” → “ new ’’

Tokenization

• Tokenizer.sed: a simple script in sed

• http://www.cis.upenn.edu/~treebank/tokenization.html

• Undesirable tokenization– original: “1,25(OH)2D3”

– tokenized: “1 , 25 ( OH ) 2D3”

• Tokenization for biomedical text– Not straight-forward

– Needs dictionary? Machine learning?

Maximum likelihood estimation

• Learning parameters from data

• Coin flipping example

– We want to know the probability that a biased coin comes up heads.

– Data: {H, T, H, H, T, H, T, T, H, H, T}

( )11

6Head =P Why?


• Likelihood

– Data: {H, T, H, H, T, H, T, T, H, H, T}

• Maximize the (log) likelihood

( ) ( ) ( ) ( ) ( ) ( )

( )56 1

11111Data

θθ

θθθθθθθθθθθ

−=

−⋅⋅⋅−⋅−⋅⋅−⋅⋅⋅−⋅=P

( ) ( )θθ −+= 1log5log6Datalog P

( )0

1

56Datalog=

−−=

θθθd

Pd 11

6=θ

Language models

• A language model assigns a probability to a word sequence

• Statistical machine translation

– Noisy channel models: machine translation, part-of-speech tagging, speech recognition, optical character recognition, etc.

( )( ) ( )

( )( ) ( )ee|f

f

ee|ff|e PP

P

PPP ∝=

( )nwwP ...1

English sentence

French sentence Language model for English

Language modeling

• How do you compute ?

Learn from data (a corpus)

He opened the window . 0.0000458

She opened the window . 0.0000723

John was hit . 0.0000244

John hit was . 0.0000002

: :

( )nwwP ...1nww ...1

( )nwwP ...1

Estimate probabilities from a corpus

Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .

( )2

1.wallaonsatDumptyHumpty =P

( )2

1.fallgreatahadDumptyHumpty =P

( ) 0.fallahadDumptyHumpty =P

Corpus

Sequence to words

• Let’s decompose the sequence probability into probabilities of individual words

( ) ( ) ( )( ) ( ) ( )

( ) ( ) ( ) ( )

( )∏=

−=

=

=

=

=

n

i

ii

n

n

nn

wwwP

wwwwwPwwwPwwPwP

wwwwPwwPwP

wwwPwPwwP

1

11

3214213121

213121

1211

...|

...

|...||

|...|

|...... ( ) ( ) ( )xyPxPyxP |, =

Chain rule

Unigram model

• Ignore the context completely

• Example

( ) ( )

( )∏

∏

=

=

−

≈

=

n

i

i

n

i

iin

wP

wwwPwwP

1

1

111 ...|...

( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty

.wallaonsatDumptyHumpty

PPPPPPP

P

≈

MLE for unigram models

• Count the frequencies


( )( )

( )∑=

w

k

kk

wC

wCwP

( )14

2Humpty =P

( )14

1on =P

Unigram model

• Problem

• Word order is not considered.



PPPPPPP

P

≈


.wallaonsatHumptyDumpty

PPPPPPP

P

≈

MLE for bigram models

• Count frequencies


( )( )

( )1

11|

−

−− =

k

kkkk

wC

wwCwwP

( )2

1Dumpty|sat =P

( )2

1Dumpty|had =P

N-gram models

• n-1 order Markov assumption

• Modeling with a large n

– Pros: rich contextual information

– Cons: sparseness (unreliable probability estimation)

( ) ( )

( )∏

∏

=

−+−

=

−

≈

=

n

i

inii

n

i

iin

wwwP

wwwPwwP

1

11

1

111

...|

...|...

Web 1T 5-gram data

• Released from Google– http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?c

atalogId=LDC2006T13

– 1 trillion word tokens of text from Web pages

• Examples

serve as the incoming 92

serve as the incubator 99

serve as the independent 794

serve as the index 223

serve as the indication 72

serve as the indicator 120

: :

Smoothing

• Problems with the maximum likelihood estimation method

– Events that are not observed in the training corpus are given zero probabilities

– Unreliable estimates when the counts are small

Add-One Smoothing

• Bigram counts table


Humpty Dumpty sat on a wall had great fall .

Humpty 0 2 0 0 0 0 0 0 0 0

Dumpty 0 0 1 0 0 0 1 0 0 0

sat 0 0 0 1 0 0 0 0 0 0

on 0 0 0 0 1 0 0 0 0 0

a 0 0 0 0 0 1 0 1 0 0

wall 0 0 0 0 0 0 0 0 0 1

had 0 0 0 0 1 0 0 0 0 0

great 0 0 0 0 0 0 0 0 1 0

fall 0 0 0 0 0 0 0 0 0 1

. 0 0 0 0 0 0 0 0 0 0

Add-One Smoothing

• Add one to all counts


Humpty Dumpty sat on a wall had great fall .

Humpty 1 3 1 1 1 1 1 1 1 1

Dumpty 1 1 2 1 1 1 2 1 1 1

sat 1 1 1 2 1 1 1 1 1 1

on 1 1 1 1 2 1 1 1 1 1

a 1 1 1 1 1 2 1 2 1 1

wall 1 1 1 1 1 1 1 1 1 2

had 1 1 1 1 2 1 1 1 1 1

great 1 1 1 1 1 1 1 1 2 1

fall 1 1 1 1 1 1 1 1 1 2

. 1 1 1 1 1 1 1 1 1 1

Add-One Smoothing

• V: the size of the vocabulary

( )( )( ) VwC

wwCwwP

k

kkkk

+

+=

−

−−

1

11

1|


( )102

11Dumpty|sat

+

+=P

Part-of-speech tagging

• Assigns a part-of-speech tag to each word in the sentence.

• Part-of-speech tagsNN: Noun IN: Preposition

NNP: Proper noun VBD: Verb, past tense

DT: Determiner VBN: Verb, past participle

: :

Paul Krugman, a professor at Princeton University, wasNNP NNP , DT NN IN NNP NNP , VBDawarded the Nobel Prize in Economics on Monday.

VBN DT NNP NNP IN NNP IN NNP

Part-of-speech tagging is not easy

• Parts-of-speech are often ambiguous

• We need to look at the context

• But how?

I have to go there.

I had a go at it.

verb

noun

Writing rules for part-of-speech tagging

• If the previous word is “to”, then it’s a verb.

• If the previous word is “a”, then it’s a noun.

• If the next word is …

:

Writing rules manually is impossible

I have to go there.

verb

I had a go at it.

noun

Learning from examples

The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN

…………………………………………………………………………………….…………………………………………………………………………………….

Machine Learning Algorithm

training

We demonstrate that …

Unseen text

We demonstratePRP VBPthat …IN


• Input:

– Word sequence

• Output (prediction):

– The most probable tag sequence

( )

( ) ( )( )

( ) ( )TWPTP

WP

TWPTP

WTPT

T

T

T

|argmax

|argmax

|argmax

τ

τ

τ

∈

∈

∈

=

=

=)

nttT ...1=

nwwW ...1=

constant

Assumptions

• Assume the word emission probability depends only on its tag

( ) ( )

( ) ( )( ) ( ) ( )

( )∏=

−=

=

=

=

=

n

i

nii

nnnn

nnn

nn

ttwwwP

ttwwwwPttwwPttwP

ttwwwPttwP

ttwwPTWP

1

111

121311211

11211

11

......|

...

...|......|...|

...|......|

...|...|

( ) ( )∏=

=n

i

ii twPTWP1

||

Part-of-speech tagging with Hidden Markov Models

emission probabilitytransition probability

( ) ( )

( ) ( )∏=

−∈

∈

=

=

n

i

iiiiT

T

twPttP

TWPTPT

1

1 ||argmax

|argmax

τ

τ

)

• Assume the first-order Markov assumption for tags

• Then, the most probably tag sequence can be expressed as

( ) ( )∏=

−≈n

i

ii ttPTP1

1|

Learning from training data

The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN

…………………………………………………………………………………….…………………………………………………………………………………….

( )( )( )1

11|

−

−− =

i

iiii

tC

ttCttP

( )( )( )i

iiii

tC

twCtwP =|


transition probability

emission probability

Examples from the Wall Street Journal corpus (1)

( )( )( )( )( ) 02.0DT|CD

07.0DT|NNS

11.0DT|NNP

22.0DT|JJ

47.0DT|NN

=

=

=

=

=

P

P

P

P

P

POS Tag Description Examples

DT determiner a, the

NN noun, singular or mass company

NNS noun plural companies

NNP proper nouns, singular Nasdaq

IN preposition/subordinating conjunction in, of, like

VBD verb, past tense took

: : :

( )( )( )( )( ) 05.0|VBD

08.0|.

11.0|,

12.0NN|NN

25.0NN|IN

=

=

=

=

=

NNP

NNP

NNP

P

P

Examples from the Wall Street Journal corpus (2)

( )( )( )( )( ) 0107.0NN|share

0155.0NN|market

0167.0NN|year

019.0NN|company

037.0NN|%

=

=

=

=

=

P

P

P

P

P

POS Tag Description Examples

DT determiner a, the

NN noun, singular or mass company

NNS noun plural companies

NNP proper nouns, singular Nasdaq

IN preposition/subordinating conjunction in, of, like

VBD verb, past tense took

: : :

( )( )( )( )( ) 022.0VBD|rose

056.0VBD|had

064.0VBD|were

131.0VBD|was

188.0VBD|said

=

=

=

=

=

P

P

P

P

P


• Input:– Word sequence

• Output (prediction):– The most probable tag sequence nttT ...1=

nwwW ...1=

( ) ( )∏=

−∈

=n

i

iiiiT

twPttPT1

1 ||argmaxτ

)

w1

t1

w2

t2

w3

t3

…wn

tn…

Finding the best sequence

• Naive approach:– Enumerate all possible tag sequences with their probabilities and

select the one that gives the highest probability

w1

t1

w2

t2

w3

t3

…wn

tn…

DTNNNNSNNPVB:

DTNNNNSNNPVB:

DTNNNNSNNPVB:

DTNNNNSNNPVB:

exponential in the length of the sentence

Finding the best sequence

• if you write them down…

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( ) ( )

:

DT|DT|DTNN|NNS|NNNNS|s|NNS

DT|DT|DTNN|NN|NNNN|s|NN

DT|DT|DTNN|DT|NNDT|s|DT

:

DT|DT|DTDT|DT|DTNNS|s|NNS

DT|DT|DTDT|DT|DTNN|s|NN

DT|DT|DTDT|DT|DTDT|s|DT

21

21

21

21

21

21

n

n

n

n

n

n

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

wPPwPPwPP

Λ

Λ

Λ

Λ

Λ

Λ

><

><

><

><

><

><

A lot of redundancy in computation

There should be a way to do it more efficiently

Viterbi algorithm (1)

• Let be the probability of the best sequence for ending with the tag

( ) ( ) ( )

( ) ( )

( ) ( ) ( ) ( )

( ) ( ) ( )111

1

1

1...,

1

1

1...,

1

1,...,

||max

||max||max

||maxmax

||max

1

2211

2211

121

−−−

−

=

−−

=

−

=

−

−

−−

−−

−

=

=

=

≡

∏

∏

∏

kkkkkkt

k

i

iiiittt

kkkkt

k

i

iiiitttt

k

i

iiiittt

kk

tVtwPttP

twPttPtwPttP

twPttP

twPttPtV

k

kk

kk

k

( )kk tV

ktkww ...1

We can compute them in a recursive manner!


( ) ( ) ( ) ( )111 ||max1

−−−−

= kkkkkkt

kk tVtwPttPtVk

wk-1

tk-1

wk

tk

…

…

…

…

DT

NN

NNS

NNP

DT

NN

NNS

NNP

( ) 1s0 =><V

Beginning of the sentence

( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )

:

NNPDT|NNP|DT

NNSDT|NNS|DT

NNDT|NN|DT

DTDT|DT|DT

1

1

1

1

−

−

−

−

kk

kk

kk

kk

VwPP

VwPP

VwPP

VwPP


• Left to right

– Save backward pointers to recover the best tag sequence

Complexity: (number of tags)2 x (length)

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

Topics


• Tokenization

• Maximum likelihood estimation (MLE)

• Language models

– Unigram

– Bigram

– Smoothing

• Hidden Markov models (HMMs)

– Part-of-speech tagging

– Viterbi algorithm

Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...

Documents

Transcript of Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...