NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics
-
Upload
vladimir-kulyukin -
Category
Technology
-
view
821 -
download
2
description
Transcript of NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics
Natural Language Processing
www.vkedco.blogspot.com
N-Grams, Markov & Hidden Markov Models, FOPC Basics
Vladimir Kulyukin
Outline
● N-Grams
● Markov & Hidden Markov Models (HMMs)
● First-Order Predicate Calculus Basics (FOPC)
N-Grams
Introduction
● Word prediction is a fundamental task of spelling error
correction, speech recognition, augmentative
communication, and many other areas of NLP
● Word prediction can be trained on various text corpora
● N-Gram is a word prediction model that uses the
previous N-1 words to predict the next word
● In statistical NLP, N-Gram is called a language model
(LM) or grammar
Word Prediction Examples
● It happened a long time …
● She wants to make a collect phone …
● I need to open a bank …
● Nutrition labels include serving …
● Nutrition labels include amounts of total …
Word Prediction Examples
● It happened a long time ago.
● She wants to make a collect phone call.
● I need to open a bank account.
● Nutrition labels include serving sizes.
● Nutrition labels include amounts of total fat|
carbohydrate.
Augmentative Communication
● Many people with physical disabilities experience
problems communicating with other people: many of
them cannot speak or type
● Word prediction models can productively augment their
communication efforts by automatically suggesting the
next word to speak or type
● For example, people with disabilities can use simple
hand movements to choose next words to speak or type
Real-Word Spelling Errors
● Real-word spelling errors are real words incorrectly used
● Examples:
– They are leaving in about fifteen minuets to go to her house.
– The study was conducted mainly be John Black.
– The design an construction of the system will take more than a year.
– Hopefully, all with continue smoothly in my absence.
– I need to notified the bank of this problem.
– He is trying to fine out.
K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM
Computing Surveys, Vol. 24, No. 4, Dec. 1992.
Word Sequence Probabilities
● Word prediction is based on evaluating probabilities of specific
word sequences
● To count those probabilities we need a corpus (speech or text)
● We also need to determine what is counted count and how:
the most important decision is how to handle punctuation
marks and capitalization (text) or pauses like uh and um
(speech)
● What is counted and how depends on the task at hand (e.g.,
punctuation is more important to grammar checking than
spelling correction)
Wordforms, Lemmas, Types, Tokens
● Wordform is an alphanumerical sequence actually used
in the corpus (e.g., begin, began, begun)
● Lemma is a set of word forms (e.g., {begin, began,
begun})
● Token is a synonym of wordform
● Type is a dictionary entry: for example, a dictionary
lists only begin as the main entry for the lemma
{begin, began, begun}
Unsmoothed N-Grams
Notation: A Sequence of N Words
n
n www ... 11
● Example: ‘I understand this algorithm.’
– W1 = ‘I’
– W2 = ‘understand’
– W3 = ‘this’
– W4 = ‘algorithm’
– W5 = ‘.’
Probabilities of Word Sequences
1
111
1
1
2
13211
|
| ... | ...
kn
k
n
nn
wwP
wwPwwPwPwPwwP
Example:
P(‘I understand this algorithm.’) =
P(‘I’) *
P(‘understand’|‘I’) *
P(‘this’|‘I understand’) *
P(‘algorithm’|‘I understand this’) *
P(‘.’|‘I understand this algorithm’)
Probabilities of Word Sequences
● How difficult is it to compute the required probabilities?
– P(‘I’) - this is easy to compute
– P(‘understand’|‘I’) – harder but quite feasible
– P(‘this’|‘I understand’) – harder but feasible
– P(‘algorithm’|‘I understand this’) – really hard
– P(‘.’|‘I understand this algorithm’) – possible but
impractical
Probability Approximation
● Markov assumption: we can estimate the probability of a word
given only N previous words
● If N = 0, we have the unigram model (aka 0th-order Markov
model)
● If N = 1, we have the bigram model (aka 1st-order Markov
model)
● If N = 3, we have the trigram model (aka 2nd-order Markov
model)
● N can be greater but the higher values are rare, because they
hard to compute
Bigram Probability Approximation
● <S> is the start of sentence mark
● P(‘I understand this algorithm.’) =
P(‘I’|<S>) *
P(‘understand’|‘I’) *
P(‘this’|‘understand’) *
P(‘algorithm’|‘this’) *
P(‘.’ |‘algorithm’)
Trigram Probability Approximation
● <S> is the start of sentence mark
● P(‘I understand this algorithm.’) =
P(‘I’|<S><S>) *
P(‘understand’|‘<S>I’) *
P(‘this’|‘I understand’) *
P(‘algorithm’|‘understand this’) *
P(‘.’ |‘this algorithm’)
N-Gram Approximation
4,||||
3,||||
2,||||
1,||
123
1
3
1
14
1
1
12
1
2
1
13
1
1
1
1
1
1
12
1
1
1
1
1
1
NwwwwPwwPwwPwwP
NwwwPwwPwwPwwP
NwwPwwPwwPwwP
NwwPwwP
nnnn
n
nn
n
nn
n
nn
nnn
n
nn
n
nn
n
n
nn
n
nn
n
nn
n
n
n
Nnn
n
n
Bigram Approximation
11
1
1 ||
kk
n
k
n
n wwPwwP
Bigram Approximation
<S> I 0.25
I understand 0.3
understand this 0.05
this algorithm 0.7
algorithm . 0.45
P(‘I understand this algorithm.’) =
P(‘I’|<S>) * P(‘understand’|‘I’) * P(‘this’|‘understand’) * P(‘algorithm’|‘this’) * P(‘.’ |‘algorithm’) =
0.25 * 0.3 * 0.05 * 0.7 * 0.45 =
0.00118125
Logprobs
● If we compute raw probability products, we risk the problem of
numerical underflow: at some point all probability products
become zero, especially on long word sequences
● To address this problem, the probabilities are computed in the
logarithmic space: instead of computing the product of
probabilities, the sum of logarithms of those probabilities is
computed
● log(P(A)P(B)) = log(P(A)) + log(P(B))
● Original product can be recovered: P(A)P(B) = log-1(P(A)P(B))
Bigram Computation
1
1
1
1
11
1
11
11
|
size dictionary is ,
corpusin ofcount
n
nn
V
i
in
nnnn
V
i
nin
nnnn
wC
wwC
wwC
wwCwwP
VwCwwC
wwwwC
N-Gram Generalization
1,|1
1
1
11
1
NwC
wwCwwP
n
Nn
n
n
Nnn
Nnn
Maximum Likelihood Estimation
● This N-Gram probability estimation is known as the Maximum
Likelihood Estimation (MLE)
● It is the MLE because it always maximizes the probability of the
training set (the statistics of the training set)
● If a word W occurs 5 times in a training corpus of 100 words, its
probability of occurrence is P(W) = 5/100
● This is not a good estimate of P(W) in all corpora, but the one
that maximizes P(W) in the training corpus
Smoothed N-Grams
Unsmoothed N-Gram Problem
● Since any corpus is finite, in any corpus used for computing N-
Grams, some valid N-Grams will not be found
● To put it differently, an N-Gram matrix for any corpus is likely
to be sparse: it will have a large number of possible N-Grams
with zero counts
● The MLE methods produce unreliable estimates when counts are
greater than 0 but still small (small is relative)
● Smoothing is a set of techniques used to overcome zero or low
counts
Add-One Smoothing
One way to smooth is to add one to all N-Gram counts and
normalize by the size of the dictionary (V)
smoothed one-add //1
|
unsmoothed//|
1
11
*
1
11
VwC
wwCwwP
wC
wwCwwP
n
nnnn
n
nnnn
A Problem with Add-One Smoothing
● Much of the total probability mass moves to the N-
Grams with zero counts
● Researchers attribute it to the arbitrary choice of the
value of 1
● Add-One smoothing appears to be worse than other
methods at predicting N-Grams with zero counts
● Some research indicates that add-one smoothing is no
better than no smoothing
Good-Turing Discounting
● Probability mass assigned to N-Grams with zero or low
counts is reassigned by using with N-Grams with higher
counts
● Let Nc is the number of N-Grams that occur c times in a
corpus
● N0 is the number of N-Grams that occur 0 times
● N1 is the number of N-Grams that occur once
● N2 is the number of N-Grams that occur twice
Good-Turing Discounting
Let C(w1 … wn)=c be a count of some N-Gram w1 … wn,
then the new count smoothed by the GTD, i.e., C*(w1 …
wn), is:
c
cnn
N
NwwCwwC 1
11
* 1......
N-Gram Vectors
● N-Grams can be computed over any finite symbolic sets
● Those symbolic sets are called alphabets and can
consists of wordforms, waveforms, individual letters,
etc.
● The choice of the symbols in the alphabet depends on
the application
● Regardless of the application, the objective is to take
an input sequence over a specific alphabet and
compute its N-Gram frequency vector
Dimensions of N-Gram Vectors
● Let A be an alphabet and n > 0 be the size of the N-Gram
● The number of N-Gram dimensions is |A|n
● Suppose that the alphabet has 26 characters and we
compute trigrams over that alphabets, then the number
of possible trigrams, i.e., the dimension of N-Gram
frequency vectors is 263 = 17576
● A practical implication is that N-Gram frequency vectors
even for low values of n are sparse
Example
● Suppose the alphabet A = {a, <space>, <start>}
● The number of possible bigrams (n=2) is |A|2 = 9:
– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;
6) <start><space>; 7) <space><space> ; 8) <space>a;
9) <space><start>
● Suppose the input is = ‘a a’
● The input’s N-Grams are: <start>a, a<space>, <space>a
● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)
Example
● Suppose the alphabet A = {a, <space>, <start>}
● The number of possible bigrams (n=2) is |A|2 = 9:
– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;
6) <start><space>; 7) <space><space> ; 8) <space>a;
9) <space><start>
● Suppose the input is = ‘a a’
● The input’s N-Grams are: <start>a, a<space>, <space>a
● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)
Markov & Hidden Markov Models
Markov Models
11 ... | nn wwwP
Markov Models are closely related to N-Grams:
the basic idea is to estimate the conditional
probability of the n-th observation given a
sequence of n-1 observations
Markov Assumption
order 3rd // | ... |
order 2nd //| ... |
order1st // | ... |
12311
1211
111
nnnnnn
nnnnn
nnnn
wwwwPwwwP
wwwPwwwP
wwPwwwP
● If n = 5 and the size of the observation alphabet is 3, we
need to collect statistics over 35 = 243 sequence types
● If n = 2 and the size of the observation alphabet is 3, we
need to collection statistics over 32 = 9 sequence types
● So the number of observations matters
Weather Example 01
Sunny Rainy Foggy
Sunny 0.8 0.05 0.15
Rainy 0.2 0.6 0.2
Foggy 0.2 0.3 0.5
Weather Today vs. Weather Tomorrow
04.08.005.0
||
|
,|
|,
1223
12
213
132
SunnywSunnywPSunnywRainywP
SunnywSunnywP
SunnywSunnywRainywP
SunnywRainywSunnywP
Weather Example 02
Sunny Rainy Foggy
Sunny 0.8 0.05 0.15
Rainy 0.2 0.6 0.2
Foggy 0.2 0.3 0.5
Weather Today vs. Weather Tomorrow
34.0
||
||
||
|,
|,
|,
|
1223
1223
1223
132
132
132
13
FoggywSunnywPSunnywRainywP
FoggywRainywPRainywRainywP
FoggywFoggywPFoggywRainywP
FoggywRainywSunnywP
FoggywRainywRainywP
FoggywRainywFoggywP
FoggywRainywP
Weather Example 03
Umbrella
Sunny 0.8
Rainy 0.2
Foggy 0.2
Weather vs. Umbrella
n
nnnnn
uuP
wwPwwuuPuuwwP
...
......|......|...
1
11111
Speech Recognition
w is a sequence of tokens
L is a language
y is an acoustic signal
wPwyPywP
yP
wPwyPywP
LwLw
LwLw
|maxarg|maxarg
|maxarg|maxarg
FOPC Basics
Basic Notions
● Conceptualization = Objects + Relations
● Universe of Discourse the set of objects in a
conceptualization
● Functions & Relations
Example
A
A
A
A
A
A
c
b
a
d
e
<{a, b, c, d, e}, {hat}, {on, above, clear, table}>
hat is a function; on, above, clear, table - relations
References
● Ch 06, D. Jurafsky & J. Martin. Speech & Language Processing,
Prentice Hall, ISBN 0-13-095069-6
● E. Fossler-Lussier, J. 1998. Markov Models & Hidden Markov
Models: A Brief Tutorial, ICSI, UC Berkley.