NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

45
Natural Language Processing www.vkedco.blogspot.com N-Grams, Markov & Hidden Markov Models, FOPC Basics Vladimir Kulyukin

description

 

Transcript of NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Page 2: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Outline

● N-Grams

● Markov & Hidden Markov Models (HMMs)

● First-Order Predicate Calculus Basics (FOPC)

Page 3: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

N-Grams

Page 4: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Introduction

● Word prediction is a fundamental task of spelling error

correction, speech recognition, augmentative

communication, and many other areas of NLP

● Word prediction can be trained on various text corpora

● N-Gram is a word prediction model that uses the

previous N-1 words to predict the next word

● In statistical NLP, N-Gram is called a language model

(LM) or grammar

Page 5: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Word Prediction Examples

● It happened a long time …

● She wants to make a collect phone …

● I need to open a bank …

● Nutrition labels include serving …

● Nutrition labels include amounts of total …

Page 6: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Word Prediction Examples

● It happened a long time ago.

● She wants to make a collect phone call.

● I need to open a bank account.

● Nutrition labels include serving sizes.

● Nutrition labels include amounts of total fat|

carbohydrate.

Page 7: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Augmentative Communication

● Many people with physical disabilities experience

problems communicating with other people: many of

them cannot speak or type

● Word prediction models can productively augment their

communication efforts by automatically suggesting the

next word to speak or type

● For example, people with disabilities can use simple

hand movements to choose next words to speak or type

Page 8: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Real-Word Spelling Errors

● Real-word spelling errors are real words incorrectly used

● Examples:

– They are leaving in about fifteen minuets to go to her house.

– The study was conducted mainly be John Black.

– The design an construction of the system will take more than a year.

– Hopefully, all with continue smoothly in my absence.

– I need to notified the bank of this problem.

– He is trying to fine out.

K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM

Computing Surveys, Vol. 24, No. 4, Dec. 1992.

Page 9: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Word Sequence Probabilities

● Word prediction is based on evaluating probabilities of specific

word sequences

● To count those probabilities we need a corpus (speech or text)

● We also need to determine what is counted count and how:

the most important decision is how to handle punctuation

marks and capitalization (text) or pauses like uh and um

(speech)

● What is counted and how depends on the task at hand (e.g.,

punctuation is more important to grammar checking than

spelling correction)

Page 10: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Wordforms, Lemmas, Types, Tokens

● Wordform is an alphanumerical sequence actually used

in the corpus (e.g., begin, began, begun)

● Lemma is a set of word forms (e.g., {begin, began,

begun})

● Token is a synonym of wordform

● Type is a dictionary entry: for example, a dictionary

lists only begin as the main entry for the lemma

{begin, began, begun}

Page 11: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Unsmoothed N-Grams

Page 12: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Notation: A Sequence of N Words

n

n www ... 11

● Example: ‘I understand this algorithm.’

– W1 = ‘I’

– W2 = ‘understand’

– W3 = ‘this’

– W4 = ‘algorithm’

– W5 = ‘.’

Page 13: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Probabilities of Word Sequences

1

111

1

1

2

13211

|

| ... | ...

kn

k

n

nn

wwP

wwPwwPwPwPwwP

Example:

P(‘I understand this algorithm.’) =

P(‘I’) *

P(‘understand’|‘I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘I understand this’) *

P(‘.’|‘I understand this algorithm’)

Page 14: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Probabilities of Word Sequences

● How difficult is it to compute the required probabilities?

– P(‘I’) - this is easy to compute

– P(‘understand’|‘I’) – harder but quite feasible

– P(‘this’|‘I understand’) – harder but feasible

– P(‘algorithm’|‘I understand this’) – really hard

– P(‘.’|‘I understand this algorithm’) – possible but

impractical

Page 15: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Probability Approximation

● Markov assumption: we can estimate the probability of a word

given only N previous words

● If N = 0, we have the unigram model (aka 0th-order Markov

model)

● If N = 1, we have the bigram model (aka 1st-order Markov

model)

● If N = 3, we have the trigram model (aka 2nd-order Markov

model)

● N can be greater but the higher values are rare, because they

hard to compute

Page 16: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Bigram Probability Approximation

● <S> is the start of sentence mark

● P(‘I understand this algorithm.’) =

P(‘I’|<S>) *

P(‘understand’|‘I’) *

P(‘this’|‘understand’) *

P(‘algorithm’|‘this’) *

P(‘.’ |‘algorithm’)

Page 17: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Trigram Probability Approximation

● <S> is the start of sentence mark

● P(‘I understand this algorithm.’) =

P(‘I’|<S><S>) *

P(‘understand’|‘<S>I’) *

P(‘this’|‘I understand’) *

P(‘algorithm’|‘understand this’) *

P(‘.’ |‘this algorithm’)

Page 18: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

N-Gram Approximation

4,||||

3,||||

2,||||

1,||

123

1

3

1

14

1

1

12

1

2

1

13

1

1

1

1

1

1

12

1

1

1

1

1

1

NwwwwPwwPwwPwwP

NwwwPwwPwwPwwP

NwwPwwPwwPwwP

NwwPwwP

nnnn

n

nn

n

nn

n

nn

nnn

n

nn

n

nn

n

n

nn

n

nn

n

nn

n

n

n

Nnn

n

n

Page 19: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Bigram Approximation

11

1

1 ||

kk

n

k

n

n wwPwwP

Page 20: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Bigram Approximation

<S> I 0.25

I understand 0.3

understand this 0.05

this algorithm 0.7

algorithm . 0.45

P(‘I understand this algorithm.’) =

P(‘I’|<S>) * P(‘understand’|‘I’) * P(‘this’|‘understand’) * P(‘algorithm’|‘this’) * P(‘.’ |‘algorithm’) =

0.25 * 0.3 * 0.05 * 0.7 * 0.45 =

0.00118125

Page 21: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Logprobs

● If we compute raw probability products, we risk the problem of

numerical underflow: at some point all probability products

become zero, especially on long word sequences

● To address this problem, the probabilities are computed in the

logarithmic space: instead of computing the product of

probabilities, the sum of logarithms of those probabilities is

computed

● log(P(A)P(B)) = log(P(A)) + log(P(B))

● Original product can be recovered: P(A)P(B) = log-1(P(A)P(B))

Page 22: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Bigram Computation

1

1

1

1

11

1

11

11

|

size dictionary is ,

corpusin ofcount

n

nn

V

i

in

nnnn

V

i

nin

nnnn

wC

wwC

wwC

wwCwwP

VwCwwC

wwwwC

Page 23: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

N-Gram Generalization

1,|1

1

1

11

1

NwC

wwCwwP

n

Nn

n

n

Nnn

Nnn

Page 24: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Maximum Likelihood Estimation

● This N-Gram probability estimation is known as the Maximum

Likelihood Estimation (MLE)

● It is the MLE because it always maximizes the probability of the

training set (the statistics of the training set)

● If a word W occurs 5 times in a training corpus of 100 words, its

probability of occurrence is P(W) = 5/100

● This is not a good estimate of P(W) in all corpora, but the one

that maximizes P(W) in the training corpus

Page 25: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Smoothed N-Grams

Page 26: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Unsmoothed N-Gram Problem

● Since any corpus is finite, in any corpus used for computing N-

Grams, some valid N-Grams will not be found

● To put it differently, an N-Gram matrix for any corpus is likely

to be sparse: it will have a large number of possible N-Grams

with zero counts

● The MLE methods produce unreliable estimates when counts are

greater than 0 but still small (small is relative)

● Smoothing is a set of techniques used to overcome zero or low

counts

Page 27: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Add-One Smoothing

One way to smooth is to add one to all N-Gram counts and

normalize by the size of the dictionary (V)

smoothed one-add //1

|

unsmoothed//|

1

11

*

1

11

VwC

wwCwwP

wC

wwCwwP

n

nnnn

n

nnnn

Page 28: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

A Problem with Add-One Smoothing

● Much of the total probability mass moves to the N-

Grams with zero counts

● Researchers attribute it to the arbitrary choice of the

value of 1

● Add-One smoothing appears to be worse than other

methods at predicting N-Grams with zero counts

● Some research indicates that add-one smoothing is no

better than no smoothing

Page 29: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Good-Turing Discounting

● Probability mass assigned to N-Grams with zero or low

counts is reassigned by using with N-Grams with higher

counts

● Let Nc is the number of N-Grams that occur c times in a

corpus

● N0 is the number of N-Grams that occur 0 times

● N1 is the number of N-Grams that occur once

● N2 is the number of N-Grams that occur twice

Page 30: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Good-Turing Discounting

Let C(w1 … wn)=c be a count of some N-Gram w1 … wn,

then the new count smoothed by the GTD, i.e., C*(w1 …

wn), is:

c

cnn

N

NwwCwwC 1

11

* 1......

Page 31: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

N-Gram Vectors

● N-Grams can be computed over any finite symbolic sets

● Those symbolic sets are called alphabets and can

consists of wordforms, waveforms, individual letters,

etc.

● The choice of the symbols in the alphabet depends on

the application

● Regardless of the application, the objective is to take

an input sequence over a specific alphabet and

compute its N-Gram frequency vector

Page 32: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Dimensions of N-Gram Vectors

● Let A be an alphabet and n > 0 be the size of the N-Gram

● The number of N-Gram dimensions is |A|n

● Suppose that the alphabet has 26 characters and we

compute trigrams over that alphabets, then the number

of possible trigrams, i.e., the dimension of N-Gram

frequency vectors is 263 = 17576

● A practical implication is that N-Gram frequency vectors

even for low values of n are sparse

Page 33: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Example

● Suppose the alphabet A = {a, <space>, <start>}

● The number of possible bigrams (n=2) is |A|2 = 9:

– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;

6) <start><space>; 7) <space><space> ; 8) <space>a;

9) <space><start>

● Suppose the input is = ‘a a’

● The input’s N-Grams are: <start>a, a<space>, <space>a

● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)

Page 34: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Example

● Suppose the alphabet A = {a, <space>, <start>}

● The number of possible bigrams (n=2) is |A|2 = 9:

– 1) aa; 2) a<start>; 3) a<space>; 4) <start><start>; 5) <start>a;

6) <start><space>; 7) <space><space> ; 8) <space>a;

9) <space><start>

● Suppose the input is = ‘a a’

● The input’s N-Grams are: <start>a, a<space>, <space>a

● To the input’s N-Gram vector: (0, 0, 1, 0, 1, 0, 0, 1, 0)

Page 35: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Markov & Hidden Markov Models

Page 36: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Markov Models

11 ... | nn wwwP

Markov Models are closely related to N-Grams:

the basic idea is to estimate the conditional

probability of the n-th observation given a

sequence of n-1 observations

Page 37: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Markov Assumption

order 3rd // | ... |

order 2nd //| ... |

order1st // | ... |

12311

1211

111

nnnnnn

nnnnn

nnnn

wwwwPwwwP

wwwPwwwP

wwPwwwP

● If n = 5 and the size of the observation alphabet is 3, we

need to collect statistics over 35 = 243 sequence types

● If n = 2 and the size of the observation alphabet is 3, we

need to collection statistics over 32 = 9 sequence types

● So the number of observations matters

Page 38: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Weather Example 01

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

04.08.005.0

||

|

,|

|,

1223

12

213

132

SunnywSunnywPSunnywRainywP

SunnywSunnywP

SunnywSunnywRainywP

SunnywRainywSunnywP

Page 39: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Weather Example 02

Sunny Rainy Foggy

Sunny 0.8 0.05 0.15

Rainy 0.2 0.6 0.2

Foggy 0.2 0.3 0.5

Weather Today vs. Weather Tomorrow

34.0

||

||

||

|,

|,

|,

|

1223

1223

1223

132

132

132

13

FoggywSunnywPSunnywRainywP

FoggywRainywPRainywRainywP

FoggywFoggywPFoggywRainywP

FoggywRainywSunnywP

FoggywRainywRainywP

FoggywRainywFoggywP

FoggywRainywP

Page 40: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Weather Example 03

Umbrella

Sunny 0.8

Rainy 0.2

Foggy 0.2

Weather vs. Umbrella

n

nnnnn

uuP

wwPwwuuPuuwwP

...

......|......|...

1

11111

Page 41: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Speech Recognition

w is a sequence of tokens

L is a language

y is an acoustic signal

wPwyPywP

yP

wPwyPywP

LwLw

LwLw

|maxarg|maxarg

|maxarg|maxarg

Page 42: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

FOPC Basics

Page 43: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Basic Notions

● Conceptualization = Objects + Relations

● Universe of Discourse the set of objects in a

conceptualization

● Functions & Relations

Page 44: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

Example

A

A

A

A

A

A

c

b

a

d

e

<{a, b, c, d, e}, {hat}, {on, above, clear, table}>

hat is a function; on, above, clear, table - relations

Page 45: NLP (Fall 2013): N-Grams, Markov & Hidden Markov Models, FOPC Basics

References

● Ch 06, D. Jurafsky & J. Martin. Speech & Language Processing,

Prentice Hall, ISBN 0-13-095069-6

● E. Fossler-Lussier, J. 1998. Markov Models & Hidden Markov

Models: A Brief Tutorial, ICSI, UC Berkley.