1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.

1

COMP 791A: Statistical Language Processing

n-gram Models over Sparse Data

Chap. 6

2

“Shannon Game” (Shannon, 1951)“I am going to make a collect …”

Predict the next word given the n-1 previous words.

Past behavior is a good guide to what will happen in the future as there is regularity in language.

Determine the probability of different sequences from a training corpus.

3

Language Modeling a statistical model of word/character sequences used to predict the next character/word given

the previous ones

applications: Speech recognition Spelling correction

He is trying to fine out. Hopefully, all with continue smoothly in my absence.

Optical character recognition / Handwriting recognition

Statistical Machine Translation …

4

1st approximation

each word has an equal probability to follow any other with 100,000 words, the probability of

each of them at any given point is .00001

but some words are more frequent then others… in Brown corpus:

“the” appears 69,971 times “rabbit” appears 11 times

5

Remember Zipf’s Lawf×r = k

0

50

100

150

200

250

300

350

0 500 1000 1500 2000

Rank

Fre

q

6

Frequency of frequencies most words are rare (happax

legomena) but common words are very

common

7

n-grams take into account the frequency of the word

in some training corpus at any given point, “the” is more probable than

“rabbit”

but bag of word approach… “Just then, the white …”

so the probability of a word also depends on the previous words (the history)

P(wn |w1w2…wn-1)

8

Problems with n-grams

“the large green ______ .” “mountain”? “tree”?

“Sue swallowed the large green ______ .” “pill”? “broccoli”?

Knowing that Sue “swallowed” helps narrow down possibilities

But, how far back do we look?

9

Reliability vs. Discrimination larger n:

more information about the context of the specific instance greater discrimination But:

too consuming ex: for a vocabulary of 20,000 words:

number of bigrams = 400 million (20 0002) number of trigrams = 8 trillion (20 0003) number of four-grams = 1.6 x 1017 (20 0004)

too many chances that the history has never been seen before (data sparseness)

smaller n: less precision BUT:

more instances in training data, better statistical estimates more reliability

--> Markov approximation: take only the most recent history

10

Markov assumption Markov Assumption:

we can predict the probability of some future item on the basis of a short history

if (history = last n-1 words) --> (n-1)th order Markov model or n-gram model

Most widely used: unigram (n=1) bigram (n=2) trigram (n=3)

11

Text generation with n-grams

n-gram model trained on 40 million words from WSJ

Unigram: Months the my and issue of year foreign new exchange’s

September were recession exchange new endorsed a acquire to six executives.

Bigram: Last December through the way to preserve the Hudson

corporation N.B.E.C. Taylor would seem to complete the major central planner one point five percent of U.S.E. has already old M. X. corporation of living on information such as more frequently fishing to keep her.

Trigram: They also point to ninety point six billion dollars from two

hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions.

12

Bigrams first-order Markov models

N-by-N matrix of probabilities/frequencies N = size of the vocabulary we are modeling

P(wn|wn-1)

a aardvark aardwolf aback … zoophyte zucchini

a 0 0 0 0 … 8 5

aardvark 0 0 0 0 … 0 0

aardwolf 0 0 0 0 … 0 0

aback 26 1 6 0 … 12 2

… … … … … … … …

zoophyte 0 0 0 1 … 0 0

zucchini 0 0 0 3 … 0 0

1st w

ord

2nd word

13

Why use only bi- or tri-grams?

Markov approximation is still costlywith a 20 000 word vocabulary: bigram needs to store 400 million parameters trigram needs to store 8 trillion parameters using a language model > trigram is impractical

to reduce the number of parameters, we can: do stemming (use stems instead of word types) group words into semantic classes seen once --> same as unseen ...

14

Building n-gram Models

Data preparation: Decide training corpus Clean and tokenize How do we deal with sentence boundaries?

I eat. I sleep. (I eat) (eat I) (I sleep)

<s>I eat <s> I sleep <s> (<s> I) (I eat) (eat <s>) (<s> I) (I sleep) (sleep <s>)

Use statistical estimators: to derive a good probability estimates based on

training data.

15

Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing

Add-one -- Laplace Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)

( Validation: Held Out Estimation Cross Validation )

Witten-Bell smoothing Good-Turing smoothing

Combining Estimators Simple Linear Interpolation General Linear Interpolation Katz’s Backoff

16

Statistical Estimators --> Maximum Likelihood Estimation (MLE) Smoothing





17

Maximum Likelihood Estimation Choose the parameter values which gives

the highest probability on the training corpus

Let C(w1,..,wn) be the frequency of n-gram w1,..,wn

)w,..,C(w)w,..,C(w

)w,..,w|(wP1-n1

n11-n1nMLE

18

Example 1: P(event)

in a training corpus, we have 10 instances of “come across” 8 times, followed by “as” 1 time, followed by “more” 1 time, followed by “a”

with MLE, we have: P(as | come across) = 0.8 P(more | come across) = 0.1 P(a | come across) = 0.1 P(X | come across) = 0 where X “as”, “more”,

“a”

19

Example 2: P(sequence of events)

eat on .16 eat some .06 eat British .001 … <s> I .25 <s> I ’d .06 …

I want .32 I would .29 I don’t .08 … want to .65 want a .5 …

to eat .26 to have .14 to spend .09 … British f ood .6 British restaurant .15 …

P(I want to eat British food) = P(I|<s>) x P(want|I) x P(to|want) x P(eat|to) x P(British|eat) x P(food|British)= .25 x .32 x .65 x .26 x .001 x .6= .000008

20

Some adjustments

product of probabilities… numerical underflow for long sentences

so instead of multiplying the probs, we add the log of the probs

P(I want to eat British food) = log(P(I|<s>)) + log(P(want|I)) + log(P(to|want)) +

log(P(eat|to)) + log(P(British|eat)) + log(P(food|British))= log(.25) + log(.32) + log(.65) + log (.26) + log(.001) +

log(.6)= -11.722

21

Problem with MLE: data sparseness What if a sequence never appears in training corpus?

P(X)=0 “come across the men” --> prob = 0 “come across some men” --> prob = 0 “come across 3 men” --> prob = 0

MLE assigns a probability of zero to unseen events … probability of an n-gram involving unseen words will be

zero!

but… most words are rare (Zipf’s Law). so n-grams involving rare words are even more rare… data

sparseness

22

in (Balh et al 83) training with 1.5 million words 23% of the trigrams from another part of the same

corpus were previously unseen. in Shakespeare’s work

out of 844 000 possible bigrams 99.96% were not used

So MLE alone is not good enough estimator

Solution: smoothing decrease the probability of previously seen events so that there is a little bit of probability mass left over

for previously unseen events also called discounting

Problem with MLE: data sparseness (con’t)

23

Discounting or Smoothing

MLE is usually unsuitable for NLP because of the sparseness of the data

We need to allow for possibility of seeing events not seen in training

Must use a Discounting or Smoothing technique

Decrease the probability of previously seen events to leave a little bit of probability for previously unseen events

24

Statistical Estimators Maximum Likelihood Estimation (MLE) --> Smoothing

--> Add-one -- Laplace Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




25

Many smoothing techniques

Add-one Add-delta Witten-Bell smoothing Good-Turing smoothing Church-Gale smoothing Absolute-discounting Kneser-Ney smoothing ...

26

Add-one Smoothing (Laplace’s law)

Pretend we have seen every n-gram at least once

Intuitively: new_count(n-gram) = old_count(n-gram) +

1

The idea is to give a little bit of the probability space to unseen events

27

Add-one: Example I want to eat Chinese f ood lunch … Total (N)

I 8 1087 0 13 0 0 0 3437

want 3 0 786 0 6 8 6 1215

to 3 0 10 860 3 0 12 3256

eat 0 0 2 0 19 2 52 938

Chinese 2 0 0 0 0 120 1 213

f ood 19 0 17 0 0 0 0 1506

lunch 4 0 0 0 0 1 0 459

…

unsmoothed bigram counts:

I want to eat Chinese f ood lunch … Total

I .0023 (8/ 3437)

.32 0 .0038 (13/ 3437)

0 0 0 1

want .0025 0 .65 0 .0049 .0066 .0049 1

to .00092 0 .0031 .26 .00092 0 .0037 1

eat 0 0 .0021 0 .020 .0021 .055 1

Chinese .0094 0 0 0 0 .56 .0047 1

f ood .013 0 .011 0 0 0 0 1

lunch .0087 0 0 0 0 .0022 0 1

…

unsmoothed normalized bigram probabilities:

1st w

ord

2nd word

28

Add-one: Example (con’t) I want to eat Chinese f ood lunch … Total (N+V)

I 8 9 1087 1088

1 14 1 1 1 3437 5053

want 3 4 1 787 1 7 9 7 2831

to 4 1 11 861 4 1 13 4872

eat 1 1 23 1 20 3 53 2554

Chinese 3 1 1 1 1 121 2 1829

f ood 20 1 18 1 1 1 1 3122

lunch 5 1 1 1 1 2 1 2075

add-one smoothed bigram counts:


I .0018 (9/ 5053)

.22 .0002 .0028 (14/ 5053)

.0002 .0002 .0002 1

want .0014 .00035 .28 .00035 .0025 .0032 .0025 1

to .00082 .00021 .0023 .18 .00082 .00021 .0027 1

eat .00039 .00039 .0012 .00039 .0078 .0012 .021 1

Chinese .0016 .00055 .00055 .00055 .00055 .066 .0011 1

f ood .0064 .00032 .0058 .00032 .00032 .00032 .00032 1

lunch .0024 .00048 .00048 .00048 .00048 .0022 .00048 1

add-one normalized bigram probabilities:

29

Add-one, more formally

N: nb of n-grams in training corpus starting with w1…wn-

1

V: size of vocabulary i.e. nb of possible different n-grams starting with

w1…wn-1

i.e. nb of word types

V N1 )w w (w C

)w w (wPn1 2

n21Add1

30

The example again

I want to eat Chinese f ood lunch … Total (N) I 8 1087 0 13 0 0 0 3437 want 3 0 786 0 6 8 6 1215 to 3 0 10 860 3 0 12 3256 eat 0 0 2 0 19 2 52 938 Chinese 2 0 0 0 0 120 1 213 f ood 19 0 17 0 0 0 0 1506 lunch 4 0 0 0 0 1 0 459

unsmoothed bigram counts:V= 1616 word types

V= 1616

P(I eat) = C(I eat) + 1 / (nb bigrams starting with “I” + nb of possible bigrams starting with “I”)= 13 + 1 / 3437 + 1616= 0.0028

31

Problem with add-one smoothing every previously unseen n-gram is given a low

probability but there are so many of them that too much

probability mass is given to unseen events

adding 1 to frequent bigram, does not change much but adding 1 to low bigrams (including unseen ones)

boosts them too much !

In NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events.

32

Problem with add-one smoothing bigrams starting with Chinese are boosted by a factor of

8 ! (1829 / 213)

I want to eat Chinese f ood lunch … Total (N) I 8 1087 0 13 0 0 0 3437 want 3 0 786 0 6 8 6 1215 to 3 0 10 860 3 0 12 3256 eat 0 0 2 0 19 2 52 938 Chinese 2 0 0 0 0 120 1 213 f ood 19 0 17 0 0 0 0 1506 lunch 4 0 0 0 0 1 0 459

I want to eat Chinese f ood lunch … Total (N+V)

I 9 1088 1 14 1 1 1 5053

want 4 1 787 1 7 9 7 2831

to 4 1 11 861 4 1 13 4872

eat 1 1 23 1 20 3 53 2554

Chinese 3 1 1 1 1 121 2 1829

f ood 20 1 18 1 1 1 1 3122

lunch 5 1 1 1 1 2 1 2075

unsmoothed bigram counts:

add-one smoothed bigram counts:

1st w

ord

1st w

ord

33

Problem with add-one smoothing (con’t) Data from the AP from (Church and Gale, 1991)

Corpus of 22,000,000 bigrams Vocabulary of 273,266 words (i.e. 74,674,306,760 possible bigrams - or

bins) 74,671,100,000 bigrams were unseen And each unseen bigram was given a frequency of 0.000295

fMLE fempirical fadd-one

0 0.000027 0.000295

1 0.448 0.000274

2 1.25 0.000411

3 2.24 0.000548

4 3.23 0.000685

5 4.21 0.000822

too high

too low

Freq. from training data

Freq. from held-out

data

Add-one smoothed

freq.

Total probability mass given to unseen bigrams = (74,671,100,000 x 0.000295) / 22,000,000 ~99.96 !!!!

34


Add-one -- Laplace --> Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws

(ELE) Validation:

Held Out Estimation Cross Validation



35

Add-delta smoothing (Lidstone’s law)

instead of adding 1, add some other (smaller) positive value

most widely used value for = 0.5 if =0.5, Lidstone’s Law is called:

the Expected Likelihood Estimation (ELE) or the Jeffreys-Perks Law

better than add-one, but still…

V N )w w (w C

)w w (wPn1 2

n21AddD

V 0.5 N0.5 )w w (w C

)w w (wPn1 2

n21ELE

36



--> ( Validation: Held Out Estimation Cross Validation )



37

Validation / Held-out Estimation How do we know how much of the probability

space to “hold out” for unseen events? ie. We need a good way to guess in

advance Held-out data:

We can divide the training data into two parts: the training set: used to build initial estimates by

counting the held out data: used to refine the initial

estimates (i.e. see how often the bigrams that appeared r times in the training text occur in the held-out text)

38

Held Out Estimation

For each n-gram w1...wn we compute: Ctr(w1...wn) the frequency of w1...wn in the training data Cho(w1...wn) the frequency of w1...wn in the held out data

Let: r = the frequency of an n-gram in the training data Nr = the number of different n-grams with frequency r

in the training data Tr = the sum of the counts of all n-grams in the held-out

data that appeared r times in the training data

T = total number of n-gram in the held out data So:

r} )w(wC wherew{w

n1hor

n1trn1

)w(wCT

)w(wC r where N1

x T

T)w(wP n1tr

r

rn1ho

39

Some explanation…

)w(wC r where N1

x T

T)w(wP n1tr

r

rn1ho

probability in held-out data for all n-grams appearing r times in

the training data

since we have Nr different n-grams in the training data that

occurred r times, let's share this probability mass equality among

them

ex: assume if r=5 and 10 different n-grams (types) occur 5 times in training --> N5 = 10 if all the n-grams (types) that occurred 5 times in training,

occurred in total (n-gram tokens) 20 times in the held-out data --> T5 = 20 assume the held-out data contains 2000 n-grams (tokens)

0.001 101

x 2000

205)r with gram-n (anPho

40

Cross-Validation

Held Out estimation is useful if there is a lot of data available

If not, we can use each part of the data both as training data and as held out data.

Main methods: Deleted Estimation (two-way cross validation)

Divide data into part 0 and part 1 In one model use 0 as the training data and 1 as the held

out data In another model use 1 as training and 0 as held out data. Do a weighted average of the two models

Leave-One-Out Divide data into N parts (N = nb of tokens) Leave 1 token out each time Train N language models

41

Dividing the corpus Training:

Training data (80% of total data) To build initial estimates (frequency counts)

Held out data (10% of total data) To refine initial estimates (smoothed estimates)

Testing: Development test data (5% of total data)

To test while developing Final test data (5% of total data)

To test at the end

But how do we divide? Randomly select data (ex. sentences, n-grams)

Advantage: Test data is very similar to training data Cut large chunks of consecutive data

Advantage: Results are lower, but more realistic

42

Developing and Testing Models1. Write an algorithm2. Train it

With training set & held-out data

3. Test it With development set

4. Note things it does wrong & revise it 5. Repeat 1-5 until satisfied6. Only then, evaluate and publish results

With final test set Better to give final results by testing on n smaller

samples of the test data and averaging

43

Factors of training corpus Size:

the more, the better but after a while, not much improvement…

bigrams (characters) after 100’s million words (IBM)

trigrams (characters) after some billions of words (IBM)

Nature (adaptation): training on WSJ and testing on AP??

44




--> Witten-Bell smoothing Good-Turing smoothing


45

Witten-Bell smoothing

intuition: An unseen n-gram is one that just did not

occur yet When it does happen, it will be its first

occurrence So give to unseen n-grams the probability

of seeing a new n-gram

46

Some intuition Assume these counts:

Observations: a seems more promiscuous than b…

b has always been followed by c, but a seems to be followed by a wider range of words

c seems more stubborn than b… c and b have same distribution but we have seen 300 instances of bigrams starting with c, so there

seems to be less chances that a new bigram starting with c will be new, compared to b

a b c d … Total a 10 10 10 0 30 b 0 0 30 0 30 c 0 0 300 0 300 d …

1st w

ord

2nd word

47

a b c d … Total a 10 10 10 0 30 b 0 0 30 0 30 c 0 0 300 0 300 d …

intuitively, ad should be more probable than bdbd should be more probable than cdP(d|a) > P(d|b) > P(d|c)

Some intuition (con’t)

1

112

w of ssstubbornnew of ypromiscuit

)w| P(w

48

to compute the probability of a bigram w1w2 we have never seen, we use: promiscuity T(w1)

= the probability of seeing a new bigram starting with w1

= number of different n-grams (types) starting with w1

stubbornness N(w1) = number of n-gram tokens starting with w1

the following total probability mass will be given to all (not each) unseen bigrams

for all unseen events

this probability mass, must be distributed in equal parts over all unseen bigrams

Z (w1) : number of unseen n-grams starting with w1

for each unseen event

)T(w)N(w)T(w

x)Z(w

1 )w|P(w

11

1

112

)T(w)N(w)T(w

)w|bigrams unseen P(all11

11

Witten-Bell smoothing

49

Small example

all unseen bigrams starting with a will share a probability mass of

each unseen bigrams starting with a will have an equal part of this

a b c d … Total = N(w1) nb seen tokens

T(w1) nb seen types

Z(w1) nb. unseen types

a 10 10 10 0 30 3 1

b 0 0 30 0 30 1 3

c 0 0 300 0 300 1 3

d

…

0.0910.09111

T(a)N(a)T(a)

Z(a)1

a)|P(d

0.0913 30

3T(a) N(a)

T(a)

50

all unseen bigrams starting with b will share a probability mass of

each unseen bigrams starting with b will have an equal part of this

Small example (con’t)

0.032 130

1

T(b) N(b)T(b)

0.011 0.032 x 31

T(b) N(b)

T(b)x

Z(b)1

b)|P(a

0.011 0.032 x 31

T(b) N(b)

T(b)x

Z(b)1

b)|P(b

0.011 0.032 x 31

T(b) N(b)

T(b)x

Z(b)1

b)|P(d


T(w1) nb seen types


a 10 10 10 0 30 3 1

b 0 0 30 0 30 1 3

c 0 0 300 0 300 1 3

d

…

51

all unseen bigrams starting with c will share a probability mass of

each unseen bigrams starting with c will have an equal part of this

Small example (con’t)

0.0033 1300

1

T(c) N(c)T(c)

0.0011 0.0033 x 31

T(c) N(c)

T(c)x

Z(c)1

c)|P(a

0.0011 0.0033 x 31

T(c) N(c)

T(c)x

Z(c)1

c)|P(b

0.0011 0.0033 x 31

T(c) N(c)

T(c)x

Z(c)1

c)|P(d


T(w1) nb seen types


a 10 10 10 0 30 3 1

b 0 0 30 0 30 1 3

c 0 0 300 0 300 1 3

d

…

52

Unseen bigrams: To get from the probabilities back to the counts, we know

that: // N (w1) = nb of tokens starting

with w1

so we get:

)N(w) w|C(w

)w|P(w1

1212

)T(w )N(w)N(w

)Z(w)T(w

)N(w)T(w )N(w

)T(w)Z(w

1

)N(w )w|P(w )w|C(w

1 1

1

1

1

1 1 1

1

1

1 1212

More formally

53

Seen bigrams : since we added probability mass to unseen bigrams, we

must decrease (discount) the probability mass of seen event (so that total = 1)

we increased prob mass of unseen event by a factor of T(w1) / N(w1) + T(w1) , so we must discount by the same factor

so we get:

)T(w )N(w)N(w

)w|unt(woriginalCo

)T(w )N(w )T(w - )T(w )N(w

)w|unt(woriginalCo

)N(w)T(w )N(w

)T(w)w|ob(woriginalPr

)N(w )w|newProb(w )w|C(w

1 1

1 12

1 1

1 1 1 12

1 1 1

1 12

1 1212

1

)T(w )N(w

)T(w) w1|ob(w2originalPr )w|newProb(w

1 1

1 12 1

More formally (con’t)

54

The restaurant example The original counts were:

T(w)= number of different seen bigrams types starting with w we have a vocabulary of 1616 words, so we can compute Z(w)= number of unseen bigrams types starting with w

Z(w) = 1616 - T(w) N(w) = number of bigrams tokens starting with w

I want to eat Chinese

f ood lunch … N(w) seen bigram tokens

T(w) seen bigram types

Z(w) unseen bigram types

I 8 1087 0 13 0 0 0 3437 95 1521

want 3 0 786 0 6 8 6 1215 76 1540

to 3 0 10 860 3 0 12 3256 130 1486

eat 0 0 2 0 19 2 52 938 124 1492

Chinese 2 0 0 0 0 120 1 213 20 1592

f ood 19 0 17 0 0 0 0 1506 82 534

lunch 4 0 0 0 0 1 0 459 45 1571

55

Witten-Bell smoothed count


I 7.78 1057.76 .061 12.65 .06 .06 .06 3437

want 2.82 .05 739.73 .05 5.65 7.53 5.65 1215

to 2.88 .08 9.62 826.98 2.88 .08 12.50 3256

eat .07 .07 19.43 .07 16.78 1.77 45.93 938

Chinese 1.74 .01 .01 .01 .01 109.70 .91 213

f ood 18.02 .05 16.12 .05 .05 .05 .05 1506

lunch 3.64 .03 .03 .03 .03 0.91 .03 459

• the count of the unseen bigram “I lunch”

• the count of the seen bigram “want to”

Witten-Bell smoothed bigram counts:

0.06 953437

3437x

152195

T(I)N(I)

N(I)x

Z(I)T(I)

739.73 761215

1215786x

T(want)N(want)N(want)

to)x count(want

56

Witten-Bell smoothed probabilities


I .0022 (7.78/ 3437)

.3078 .000002 .0037

.000002 .000002 .000002 1

want .00230 .00004 .6088 .00004 .0047 .0062 .0047 1

to .00009 .00003 .0030 .2540 .00009 .00003 .0038 1

eat .00008 .00008 .0021 .00008 .0179 .0019 .0490 1

Chinese .00812 .00005 .00005 .00005 .00005 .5150 .0042 1

f ood .0120 .00004 .0107 .00004 .00004 .00004 .00004 1

lunch .0079 .00006 .00006 .00006 .00006 .0020 .00006 1

Witten-Bell normalized bigram probabilities:

57



Validation: Held Out Estimation Cross Validation

Witten-Bell smoothing --> Good-Turing smoothing


58

Good-Turing Estimator

Based on the assumption that words have a binomial distribution

Works well in practice (with large corpora)

Idea: Re-estimate the probability mass of n-grams with zero

(or low) counts by looking at the number of n-grams with higher counts

Ex:c

1c

NN

1)(cc* Nb of ngrams that occur c

times

Nb of ngrams that occur c+1 times

o

1o N

N1)(0c

occurred never that bigrams of nbonce occurred that bigrams of nb

occurred never that bigrams for count new

59

Good-Turing Estimator (con’t)

In practice c* is not used for all counts c large counts (> a threshold k) are assumed to be

reliable

If c > k (usually k = 5)c* = c

If c <= k

kc1 for

N1)N(k

1

N1)N(k

cNN

1)(cc*

1

1k

1

1k

c

1c

60





--> Combining Estimators Simple Linear Interpolation General Linear Interpolation Katz’s Backoff

61

Combining Estimators

so far, we gave the same probability to all unseen n-grams

we have never seen the bigrams journal of Punsmoothed(of |journal) = 0 journal from Punsmoothed(from |journal) = 0 journal never Punsmoothed(never |journal) = 0

all models so far will give the same probability to all 3 bigrams

but intuitively, “journal of” is more probable because... “of” is more frequent than “from” & “never” unigram probability P(of) > P(from) > P(never)

62

observation: unigram model suffers less from data sparseness than

bigram model bigram model suffers less from data sparseness than

trigram model …

so use a lower model estimate, to estimate probability of unseen n-grams

if we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model

Combining Estimators (con’t)

63





Combining Estimators --> Simple Linear Interpolation General Linear Interpolation Katz’s Backoff

64

Simple Linear Interpolation Solve the sparseness in a trigram model by

mixing with bigram and unigram models Also called:

linear interpolation, finite mixture models deleted interpolation

Combine linearlyPli(wn|wn-2,wn-1) = 1P(wn) + 2P(wn|wn-1) + 3P(wn|wn-2,wn-

1)

where 0 i 1 and i i =1

65





Combining Estimators Simple Linear Interpolation --> General Linear Interpolation Katz’s Backoff

66

General Linear Interpolation In simple linear interpolation, the weights i are

constant So the unigram estimate is always combined with

the same weight, regardless of whether the trigram is accurate (because there is lots of data) or poor

We can have a more general and powerful model where i are a function of the history h

where 0 i(h) 1 and i i(h) =1

Having a specific (h) per n-gram is not a good idea, but we can set a (h) according to the frequency of the n-gram

)w,w|(wP )w,(w )w|P(w )(w )P(w )w,w|(wP 1-n2-nn1-n2-n31-nn1-n2n11-n2-nngli

67





Combining Estimators Simple Linear Interpolation General Linear Interpolation --> Katz’s Backoff

68

Katz’s Backing Off Model

higher-order model are more reliable so use lower-order model only if necessary

Pbo(wn|wn-2, wn-1) = Pdisc(wn|wn-2, wn-1 ) if c(wn-2wn-1 wn ) > k // if trigram was seen enough

α1 Pdisc(wn |wn-1) if c(wn-1 wn) > k // if bigram was seen enough

α2 Pdisc(wn) otherwise

α1 and α2 make sure the probability mass is 1 when backing-off to lower-order model

discounted probabilities (with Good-Turing, add-one, …)

69

Other applications of LM Author / Language identification

hypothesis: texts that resemble each other (same author, same language) share similar characteristics

In English character sequence “ing” is more probable than in French

Training phase: construction of the language model with pre-classified documents (known language/author)

Testing phase: evaluation of unknown text (comparison with language

model)

70

Example: Language identification bigram of characters

characters = 26 letters (case insensitive) possible variations: case sensitivity,

punctuation, beginning/end of sentence marker, …

71

A B C D … Y Z

A 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014

B 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014

C 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014

D 0.0042 0.0014 0.0014 0.0014 … 0.0014 0.0014

E 0.0097 0.0014 0.0014 0.0014 … 0.0014 0.0014

… … … … … … … 0.0014

Y 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014

Z 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014

1. Train an language model for English:

2. Train a language model for French

3. Evaluate probability of a sentence with LM-English & LM-French

4. Highest probability -->language of sentence

1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.

Documents

Transcript of 1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.