Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...
Transcript of Sentence splitting, tokenization, language modeling, and ...nactem.ac.uk/dtc/dtc-Tsuruoka.pdf ·...
Sentence splitting, tokenization, language modeling, and part-of-
speech tagging with hidden Markov models
Yoshimasa Tsuruoka
Topics
• Sentence splitting
• Tokenization
• Maximum likelihood estimation (MLE)
• Language models
– Unigram
– Bigram
– Smoothing
• Hidden Markov models (HMMs)
– Part-of-speech tagging
– Viterbi algorithm
raw(unstructured)
text
part-of-speechtagging
named entityrecognition
deepsyntacticparsing
annotated(structured)
text
text processing
dictionary ontology
………………………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………
Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
NN IN NN VBZ VBN IN NN IN JJ NN NNS .
protein_molecule organic_compound cell_line
PP PP NP
PP
VP
VP
NP
NP
S
negative regulation
Basic Steps of Natural Language Processing
• Sentence splitting
• Tokenization
• Part-of-speech tagging
• Shallow parsing
• Named entity recognition
• Syntactic parsing
• (Semantic Role Labeling)
Sentence splitting
Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.
However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.
Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
A heuristic rule for sentence splitting
sentence boundary
= period + space(s) + capital letter
Regular expression in Perl
s/\. +([A-Z])/\.\n$1/g;
Errors
• Two solutions:
– Add more rules to handle exceptions
– Machine learning
IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of Th2-associated cytokines (e.g.
IL-5 and IL-13).
Adding rules to handle exceptions
#!/usr/bin/perl
while(<STDIN>) {$_ =~ s/([\.\?]) +([\(0-9a-zA-Z])/$1\n$2/g;
$_ =~ s/(\WDr\.)\n/$1 /g;$_ =~ s/(\WMr\.)\n/$1 /g;$_ =~ s/(\WMs\.)\n/$1 /g;$_ =~ s/(\WMrs\.)\n/$1 /g;$_ =~ s/(\Wvs\.)\n/$1 /g;$_ =~ s/(\Wa\.m\.)\n/$1 /g;$_ =~ s/(\Wp\.m\.)\n/$1 /g;$_ =~ s/(\Wi\.e\.)\n/$1 /g;$_ =~ s/(\We\.g\.)\n/$1 /g;
print;}
Tokenization
• Tokenizing general English sentences is relatively straightforward.
• Use spaces as the boundaries
• Use some heuristics to handle exceptions
``We’re heading into a recession.’’
`` We ’re heading into a recession . ’’
Exceptions
• separate possessive endings or abbreviated forms from preceding words:
– Mary’s → Mary ’sMary’s → Mary isMary’s → Mary has
• separate punctuation marks and quotes from words :
– Mary. → Mary .
– “new” → “ new ’’
Tokenization
• Tokenizer.sed: a simple script in sed
• http://www.cis.upenn.edu/~treebank/tokenization.html
• Undesirable tokenization– original: “1,25(OH)2D3”
– tokenized: “1 , 25 ( OH ) 2D3”
• Tokenization for biomedical text– Not straight-forward
– Needs dictionary? Machine learning?
Maximum likelihood estimation
• Learning parameters from data
• Coin flipping example
– We want to know the probability that a biased coin comes up heads.
– Data: {H, T, H, H, T, H, T, T, H, H, T}
( )11
6Head =P Why?
Maximum likelihood estimation
• Likelihood
– Data: {H, T, H, H, T, H, T, T, H, H, T}
• Maximize the (log) likelihood
( ) ( ) ( ) ( ) ( ) ( )
( )56 1
11111Data
θθ
θθθθθθθθθθθ
−=
−⋅⋅⋅−⋅−⋅⋅−⋅⋅⋅−⋅=P
( ) ( )θθ −+= 1log5log6Datalog P
( )0
1
56Datalog=
−−=
θθθd
Pd 11
6=θ
Language models
• A language model assigns a probability to a word sequence
• Statistical machine translation
– Noisy channel models: machine translation, part-of-speech tagging, speech recognition, optical character recognition, etc.
( )( ) ( )
( )( ) ( )ee|f
f
ee|ff|e PP
P
PPP ∝=
( )nwwP ...1
English sentence
French sentence Language model for English
Language modeling
• How do you compute ?
Learn from data (a corpus)
He opened the window . 0.0000458
She opened the window . 0.0000723
John was hit . 0.0000244
John hit was . 0.0000002
: :
( )nwwP ...1nww ...1
( )nwwP ...1
Estimate probabilities from a corpus
Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .
( )2
1.wallaonsatDumptyHumpty =P
( )2
1.fallgreatahadDumptyHumpty =P
( ) 0.fallahadDumptyHumpty =P
Corpus
Sequence to words
• Let’s decompose the sequence probability into probabilities of individual words
( ) ( ) ( )( ) ( ) ( )
( ) ( ) ( ) ( )
( )∏=
−=
=
=
=
=
n
i
ii
n
n
nn
wwwP
wwwwwPwwwPwwPwP
wwwwPwwPwP
wwwPwPwwP
1
11
3214213121
213121
1211
...|
...
|...||
|...|
|...... ( ) ( ) ( )xyPxPyxP |, =
Chain rule
Unigram model
• Ignore the context completely
• Example
( ) ( )
( )∏
∏
=
=
−
≈
=
n
i
i
n
i
iin
wP
wwwPwwP
1
1
111 ...|...
( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty
.wallaonsatDumptyHumpty
PPPPPPP
P
≈
MLE for unigram models
• Count the frequencies
Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .
( )( )
( )∑=
w
k
kk
wC
wCwP
( )14
2Humpty =P
( )14
1on =P
Unigram model
• Problem
• Word order is not considered.
( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty
.wallaonsatDumptyHumpty
PPPPPPP
P
≈
( )( ) ( ) ( ) ( ) ( ) ( ) ( ).wallaonsatDumptyHumpty
.wallaonsatHumptyDumpty
PPPPPPP
P
≈
Bigram model
• 1st order Markov assumption
• Example
( ) ( )
( )∏
∏
=
−
=
−
≈
=
n
i
ii
n
i
iin
wwP
wwwPwwP
1
1
1
111
|
...|...
( )( ) ( ) ( )( ) ( ) ( ) ( )wall|.a|wallon|asat|on
Dumpty|satHumpty|Dumptys|Humpty
.wallaonsatDumptyHumpty
PPPP
PPP
P
><≈
MLE for bigram models
• Count frequencies
Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .
( )( )
( )1
11|
−
−− =
k
kkkk
wC
wwCwwP
( )2
1Dumpty|sat =P
( )2
1Dumpty|had =P
N-gram models
• n-1 order Markov assumption
• Modeling with a large n
– Pros: rich contextual information
– Cons: sparseness (unreliable probability estimation)
( ) ( )
( )∏
∏
=
−+−
=
−
≈
=
n
i
inii
n
i
iin
wwwP
wwwPwwP
1
11
1
111
...|
...|...
Web 1T 5-gram data
• Released from Google– http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?c
atalogId=LDC2006T13
– 1 trillion word tokens of text from Web pages
• Examples
serve as the incoming 92
serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
: :
Smoothing
• Problems with the maximum likelihood estimation method
– Events that are not observed in the training corpus are given zero probabilities
– Unreliable estimates when the counts are small
Add-One Smoothing
• Bigram counts table
Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .
Humpty Dumpty sat on a wall had great fall .
Humpty 0 2 0 0 0 0 0 0 0 0
Dumpty 0 0 1 0 0 0 1 0 0 0
sat 0 0 0 1 0 0 0 0 0 0
on 0 0 0 0 1 0 0 0 0 0
a 0 0 0 0 0 1 0 1 0 0
wall 0 0 0 0 0 0 0 0 0 1
had 0 0 0 0 1 0 0 0 0 0
great 0 0 0 0 0 0 0 0 1 0
fall 0 0 0 0 0 0 0 0 0 1
. 0 0 0 0 0 0 0 0 0 0
Add-One Smoothing
• Add one to all counts
Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .
Humpty Dumpty sat on a wall had great fall .
Humpty 1 3 1 1 1 1 1 1 1 1
Dumpty 1 1 2 1 1 1 2 1 1 1
sat 1 1 1 2 1 1 1 1 1 1
on 1 1 1 1 2 1 1 1 1 1
a 1 1 1 1 1 2 1 2 1 1
wall 1 1 1 1 1 1 1 1 1 2
had 1 1 1 1 2 1 1 1 1 1
great 1 1 1 1 1 1 1 1 2 1
fall 1 1 1 1 1 1 1 1 1 2
. 1 1 1 1 1 1 1 1 1 1
Add-One Smoothing
• V: the size of the vocabulary
( )( )( ) VwC
wwCwwP
k
kkkk
+
+=
−
−−
1
11
1|
Humpty Dumpty sat on a wall .Humpty Dumpty had a great fall .
( )102
11Dumpty|sat
+
+=P
Part-of-speech tagging
• Assigns a part-of-speech tag to each word in the sentence.
• Part-of-speech tagsNN: Noun IN: Preposition
NNP: Proper noun VBD: Verb, past tense
DT: Determiner VBN: Verb, past participle
: :
Paul Krugman, a professor at Princeton University, wasNNP NNP , DT NN IN NNP NNP , VBDawarded the Nobel Prize in Economics on Monday.
VBN DT NNP NNP IN NNP IN NNP
Part-of-speech tagging is not easy
• Parts-of-speech are often ambiguous
• We need to look at the context
• But how?
I have to go there.
I had a go at it.
verb
noun
Writing rules for part-of-speech tagging
• If the previous word is “to”, then it’s a verb.
• If the previous word is “a”, then it’s a noun.
• If the next word is …
:
Writing rules manually is impossible
I have to go there.
verb
I had a go at it.
noun
Learning from examples
The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ
supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN
…………………………………………………………………………………….…………………………………………………………………………………….
Machine Learning Algorithm
training
We demonstrate that …
Unseen text
We demonstratePRP VBPthat …IN
Part-of-speech tagging
• Input:
– Word sequence
• Output (prediction):
– The most probable tag sequence
( )
( ) ( )( )
( ) ( )TWPTP
WP
TWPTP
WTPT
T
T
T
|argmax
|argmax
|argmax
τ
τ
τ
∈
∈
∈
=
=
=)
nttT ...1=
nwwW ...1=
constant
Assumptions
• Assume the word emission probability depends only on its tag
( ) ( )
( ) ( )( ) ( ) ( )
( )∏=
−=
=
=
=
=
n
i
nii
nnnn
nnn
nn
ttwwwP
ttwwwwPttwwPttwP
ttwwwPttwP
ttwwPTWP
1
111
121311211
11211
11
......|
...
...|......|...|
...|......|
...|...|
( ) ( )∏=
=n
i
ii twPTWP1
||
Part-of-speech tagging with Hidden Markov Models
emission probabilitytransition probability
( ) ( )
( ) ( )∏=
−∈
∈
=
=
n
i
iiiiT
T
twPttP
TWPTPT
1
1 ||argmax
|argmax
τ
τ
)
• Assume the first-order Markov assumption for tags
• Then, the most probably tag sequence can be expressed as
( ) ( )∏=
−≈n
i
ii ttPTP1
1|
Learning from training data
The involvement of ion channels in B and T lymphocyte activation isDT NN IN NN NNS IN NN CC NN NN NN VBZ
supported by many reports of changes in ion fluxes and membraneVBN IN JJ NNS IN NNS IN NN NNS CC NN
…………………………………………………………………………………….…………………………………………………………………………………….
( )( )( )1
11|
−
−− =
i
iiii
tC
ttCttP
( )( )( )i
iiii
tC
twCtwP =|
Maximum likelihood estimation
transition probability
emission probability
Examples from the Wall Street Journal corpus (1)
( )( )( )( )( ) 02.0DT|CD
07.0DT|NNS
11.0DT|NNP
22.0DT|JJ
47.0DT|NN
=
=
=
=
=
P
P
P
P
P
POS Tag Description Examples
DT determiner a, the
NN noun, singular or mass company
NNS noun plural companies
NNP proper nouns, singular Nasdaq
IN preposition/subordinating conjunction in, of, like
VBD verb, past tense took
: : :
( )( )( )( )( ) 05.0|VBD
08.0|.
11.0|,
12.0NN|NN
25.0NN|IN
=
=
=
=
=
NNP
NNP
NNP
P
P
Examples from the Wall Street Journal corpus (2)
( )( )( )( )( ) 0107.0NN|share
0155.0NN|market
0167.0NN|year
019.0NN|company
037.0NN|%
=
=
=
=
=
P
P
P
P
P
POS Tag Description Examples
DT determiner a, the
NN noun, singular or mass company
NNS noun plural companies
NNP proper nouns, singular Nasdaq
IN preposition/subordinating conjunction in, of, like
VBD verb, past tense took
: : :
( )( )( )( )( ) 022.0VBD|rose
056.0VBD|had
064.0VBD|were
131.0VBD|was
188.0VBD|said
=
=
=
=
=
P
P
P
P
P
Part-of-speech tagging
• Input:– Word sequence
• Output (prediction):– The most probable tag sequence nttT ...1=
nwwW ...1=
( ) ( )∏=
−∈
=n
i
iiiiT
twPttPT1
1 ||argmaxτ
)
w1
t1
w2
t2
w3
t3
…wn
tn…
Finding the best sequence
• Naive approach:– Enumerate all possible tag sequences with their probabilities and
select the one that gives the highest probability
w1
t1
w2
t2
w3
t3
…wn
tn…
DTNNNNSNNPVB:
DTNNNNSNNPVB:
DTNNNNSNNPVB:
DTNNNNSNNPVB:
exponential in the length of the sentence
Finding the best sequence
• if you write them down…
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
:
DT|DT|DTNN|NNS|NNNNS|s|NNS
DT|DT|DTNN|NN|NNNN|s|NN
DT|DT|DTNN|DT|NNDT|s|DT
:
DT|DT|DTDT|DT|DTNNS|s|NNS
DT|DT|DTDT|DT|DTNN|s|NN
DT|DT|DTDT|DT|DTDT|s|DT
21
21
21
21
21
21
n
n
n
n
n
n
wPPwPPwPP
wPPwPPwPP
wPPwPPwPP
wPPwPPwPP
wPPwPPwPP
wPPwPPwPP
Λ
Λ
Λ
Λ
Λ
Λ
><
><
><
><
><
><
A lot of redundancy in computation
There should be a way to do it more efficiently
Viterbi algorithm (1)
• Let be the probability of the best sequence for ending with the tag
( ) ( ) ( )
( ) ( )
( ) ( ) ( ) ( )
( ) ( ) ( )111
1
1
1...,
1
1
1...,
1
1,...,
||max
||max||max
||maxmax
||max
1
2211
2211
121
−−−
−
=
−−
=
−
=
−
−
−−
−−
−
=
=
=
≡
∏
∏
∏
kkkkkkt
k
i
iiiittt
kkkkt
k
i
iiiitttt
k
i
iiiittt
kk
tVtwPttP
twPttPtwPttP
twPttP
twPttPtV
k
kk
kk
k
( )kk tV
ktkww ...1
We can compute them in a recursive manner!
Viterbi algorithm (2)
( ) ( ) ( ) ( )111 ||max1
−−−−
= kkkkkkt
kk tVtwPttPtVk
wk-1
tk-1
wk
tk
…
…
…
…
DT
NN
NNS
NNP
DT
NN
NNS
NNP
( ) 1s0 =><V
Beginning of the sentence
( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )
:
NNPDT|NNP|DT
NNSDT|NNS|DT
NNDT|NN|DT
DTDT|DT|DT
1
1
1
1
−
−
−
−
kk
kk
kk
kk
VwPP
VwPP
VwPP
VwPP
Viterbi algorithm (3)
• Left to right
– Save backward pointers to recover the best tag sequence
Complexity: (number of tags)2 x (length)
Noun
Verb
Noun
Verb
Noun
Verb
Noun
Verb
He has opened it
Topics
• Sentence splitting
• Tokenization
• Maximum likelihood estimation (MLE)
• Language models
– Unigram
– Bigram
– Smoothing
• Hidden Markov models (HMMs)
– Part-of-speech tagging
– Viterbi algorithm