Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite...

64
1 Language Modelling COM4513/6513 Natural Language Processing Nikos Aletras [email protected] @nikaletras Computer Science Department Week 3 Spring 2020

Transcript of Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite...

Page 1: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

1

Language ModellingCOM4513/6513 Natural Language Processing

Nikos [email protected]

@nikaletras

Computer Science Department

Week 3Spring 2020

Page 2: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

2

In previous lecture...

Our first NLP problem: Text classification

But we ignored word order (apart from short sequences, e.g.n-grams)!

Page 3: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

2

In previous lecture...

Our first NLP problem: Text classification

But we ignored word order (apart from short sequences, e.g.n-grams)!

Page 4: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

3

In this lecture...

Our second NLP problem: Language Modelling

What is the probability of a given sequence of words in a particularlanguage (e.g. English)?

Odd problem. Applications?

Page 5: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

3

In this lecture...

Our second NLP problem: Language Modelling

What is the probability of a given sequence of words in a particularlanguage (e.g. English)?

Odd problem. Applications?

Page 6: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

4

Applications of LMs

Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)

Language detection (“Ciao Sheffield” is it Italian or English?)

Grammatical error detection (“You’re welcome” or “Yourwelcome”?)

Speech recognition (“I was tired too.” or “I was tired two.”?)

Page 7: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

4

Applications of LMs

Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)

Language detection (“Ciao Sheffield” is it Italian or English?)

Grammatical error detection (“You’re welcome” or “Yourwelcome”?)

Speech recognition (“I was tired too.” or “I was tired two.”?)

Page 8: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

4

Applications of LMs

Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)

Language detection (“Ciao Sheffield” is it Italian or English?)

Grammatical error detection (“You’re welcome” or “Yourwelcome”?)

Speech recognition (“I was tired too.” or “I was tired two.”?)

Page 9: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

4

Applications of LMs

Word likelihood for query completion in information retrieval(“Is Sheffield” → try it on your search engine)

Language detection (“Ciao Sheffield” is it Italian or English?)

Grammatical error detection (“You’re welcome” or “Yourwelcome”?)

Speech recognition (“I was tired too.” or “I was tired two.”?)

Page 10: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

5

Problem setup

Training data is a (often large) set of sentences xm with words xn:

Dtrain = {x1, ..., xM}x = [x1, ...xN ]

for example:

x =[<s>,The,water, is, clear, ., </s>]

<s>: denotes start of the sentence</s>: denotes end of the sentence

Page 11: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

6

Calculate sentence probabilities

We want to learn a model that returns the probability of anunseen sentence x:

P(x) = P(x1, ..., xn), for ∀x ∈ VmaxN

V is the vocabulary and VmaxN all possible sentences.

How to compute probability?

Page 12: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

6

Calculate sentence probabilities

We want to learn a model that returns the probability of anunseen sentence x:

P(x) = P(x1, ..., xn), for ∀x ∈ VmaxN

V is the vocabulary and VmaxN all possible sentences.

How to compute probability?

Page 13: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

7

Unigram language model

Multiply the probability of each word appearing in the sentence xcomputed over the entire corpus:

P(x) =N∏

n=1

P(xn) =N∏

n=1

c(xn)∑x∈V c(x)

<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>

P(i love) = P(i)P(love) =2

20· 1

20= 0.005

Page 14: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

7

Unigram language model

Multiply the probability of each word appearing in the sentence xcomputed over the entire corpus:

P(x) =N∏

n=1

P(xn) =N∏

n=1

c(xn)∑x∈V c(x)

<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>

P(i love) = P(i)P(love) =2

20· 1

20= 0.005

Page 15: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

7

Unigram language model

Multiply the probability of each word appearing in the sentence xcomputed over the entire corpus:

P(x) =N∏

n=1

P(xn) =N∏

n=1

c(xn)∑x∈V c(x)

<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>

P(i love) = P(i)P(love) =2

20· 1

20= 0.005

Page 16: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

8

What could go wrong?

<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>

The most probable word is <s> ( 320)

The most probable single-word sentence is “<s>”

The most probable two-word sentence is “<s> <s>”

The most probable N-word sentence is Nד<s>”

Page 17: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

8

What could go wrong?

<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>

The most probable word is <s> ( 320)

The most probable single-word sentence is “<s>”

The most probable two-word sentence is “<s> <s>”

The most probable N-word sentence is Nד<s>”

Page 18: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

8

What could go wrong?

<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>

The most probable word is <s> ( 320)

The most probable single-word sentence is “<s>”

The most probable two-word sentence is “<s> <s>”

The most probable N-word sentence is Nד<s>”

Page 19: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

8

What could go wrong?

<s> i love playing basketball </s><s> arctic monkeys are from sheffield </s><s> i study in sheffield uni </s>

The most probable word is <s> ( 320)

The most probable single-word sentence is “<s>”

The most probable two-word sentence is “<s> <s>”

The most probable N-word sentence is Nד<s>”

Page 20: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

9

Maximum Likelihood Estimation

Instead of assuming independence:

P(x) =N∏

n=1

P(xn)

We assume that each word is dependent on all previous ones:

P(x) = P(x1, ..., xN)

= P(x1)P(x2...xN |x1)

= P(x1)P(x2|x1)...P(xN |x1, ..., xN−1)

=N∏

n=1

P(xn|x1, ...xn−1) (chain rule)

What could go wrong?

Page 21: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

9

Maximum Likelihood Estimation

Instead of assuming independence:

P(x) =N∏

n=1

P(xn)

We assume that each word is dependent on all previous ones:

P(x) = P(x1, ..., xN)

= P(x1)P(x2...xN |x1)

= P(x1)P(x2|x1)...P(xN |x1, ..., xN−1)

=N∏

n=1

P(xn|x1, ...xn−1) (chain rule)

What could go wrong?

Page 22: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

10

Problems with MLE

Let’s analyse this:

P(x) = P(x1)P(x2|x1)P(x3|x2, x1)...P(xN |x1, ..., xN−1)

P(xn|xn−1...x1) =c(x1...xn−1, xn)

c(x1...xn−1)

As we condition on more words, the counts become sparser

Page 23: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

11

Bigram Language Models

Assume that the choice of a word dependsonly on the one before it:

P(x) =N∏

n=1

P(xn|xn−1) =N∏

n=1

c(xn−1, xn)

c(xn−1)

k-th order Markov assumption:

P(xn|xn−1, ..., x1) ≈ P(xn|xn−1, ..., xn−k)

with k=1

Page 24: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

12

Bigram LM: From counts to probabilities

Unigram counts:arctic monkeys are my favourite band

100 600 4000 3000 500 200

Bigram counts (rows: xi−1, cols: xi ):arctic monkeys are my favourite band

arctic 0 10 2 0 0 0monkeys 0 0 250 1 5 0

are 3 45 0 600 25 1my 0 2 0 1 300 5

favourite 0 1 0 0 0 50band 0 0 3 10 0 0

Page 25: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

13

Bigram LM: From counts to probabilities

From the bigram count matrix, compute probabilities by dividingeach cell by the appropriate unigram count for its row.

Bigram probabilities (rows: xi−1, cols: xi ):arctic monkeys are my favourite band

arctic 0 0.1 0.02 0 0 0monkeys 0 0 0.417 0.0017 0.008 0

are 0.0008 0.0113 0 0.15 0.0063 0.00003my 0 0.0007 0 0.0003 0.1 0.0017

favourite 0 0.002 0 0 0 0.1band 0 0 0.015 0.05 0 0

Page 26: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

14

Example: Bigram language model

x = [arctic,monkeys, are,my, favourite, band]

P(x) = P(monkeys|arctic)P(are|monkeys)P(my|are)

P(favourite|my)P(band|favourite)

=c(arctic,monkeys)

c(arctic)...c(favourite,band)

c(favourite)

= 0.1 · 0.417 · 0.15 · 0.1 · 0.1= 0.00006255

Page 27: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

15

Longer contexts (N-gram LMs)

P(x |context) =P(context, x)

P(context)=

c(context, x)

c(context)

In bigram LM context is xn−1, trigram xn−2, xn−1, etc.

The longer the context:

the more likely to capture long-range dependencies:“I saw a tiger that was really very...” fierce or talkative?the sparser the counts (zero probabilities)

5-grams and training sets with billions of tokens are common.

Page 28: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

16

Unknown Words

If a word was never encountered in training, any sentencecontaining it will have probability 0

It happens:

all corpora are finitenew words emerging

Common solutions:

Generate unknown words in the training data by replacinglow-frequency words with a special UNKNOWN tokenUse classes of unknown words, e.g. names and numbers

Page 29: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

16

Unknown Words

If a word was never encountered in training, any sentencecontaining it will have probability 0

It happens:

all corpora are finitenew words emerging

Common solutions:

Generate unknown words in the training data by replacinglow-frequency words with a special UNKNOWN tokenUse classes of unknown words, e.g. names and numbers

Page 30: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

17

Implementation details

Dealing with large datasets requires efficiency:

use log probabilities to avoid underflows (small numbers)efficient data structures for sparse counts, e.g. lossy datastructures Bloom filters)

How do we train and evaluate our language models?

We need train/dev/test data

Evaluation approaches

Page 31: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

17

Implementation details

Dealing with large datasets requires efficiency:

use log probabilities to avoid underflows (small numbers)efficient data structures for sparse counts, e.g. lossy datastructures Bloom filters)

How do we train and evaluate our language models?

We need train/dev/test data

Evaluation approaches

Page 32: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

18

Intrinsic Evaluation: Accuracy

How well does our LM predict the next word?

I always order pizza with cheese and...mushrooms?bread?and?

Accuracy: how often the LM predicts the correct word

The higher the better

Page 33: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

19

Intrinsic Evaluation: Perplexity

Perplexity: the inverse probability of the test setx = [x1, ..., xN ], normalised by the number of words N:

PP(x) = P(x1, . . . , xN)1/N

= N

√1

P(x1, . . . , xN)

= N

√1∏N

i=1 P(xi |x1, ...xi−1)

Measures how well a probability distribution predicts a sample.

The lower the better.

Why is a bigram language model likely to have lower perplexitythan a unigram one? There is more context to predict the nextword!

Page 34: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

19

Intrinsic Evaluation: Perplexity

Perplexity: the inverse probability of the test setx = [x1, ..., xN ], normalised by the number of words N:

PP(x) = P(x1, . . . , xN)1/N

= N

√1

P(x1, . . . , xN)

= N

√1∏N

i=1 P(xi |x1, ...xi−1)

Measures how well a probability distribution predicts a sample.

The lower the better.

Why is a bigram language model likely to have lower perplexitythan a unigram one?

There is more context to predict the nextword!

Page 35: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

19

Intrinsic Evaluation: Perplexity

Perplexity: the inverse probability of the test setx = [x1, ..., xN ], normalised by the number of words N:

PP(x) = P(x1, . . . , xN)1/N

= N

√1

P(x1, . . . , xN)

= N

√1∏N

i=1 P(xi |x1, ...xi−1)

Measures how well a probability distribution predicts a sample.

The lower the better.

Why is a bigram language model likely to have lower perplexitythan a unigram one? There is more context to predict the nextword!

Page 36: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

20

The problem with perplexity

Doesn’t always correlate with application performance

Can’t evaluate non probabilistic LMs

Page 37: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

21

Extrinsic Evaluation

Sentence completion

Grammatical error correction: detecting “odd” sentences andpropose alternatives

Natural lanuage generation: prefer more “natural” sentences

Speech recognition

Machine translation

Page 38: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

22

Smoothing

What happens when words that are in our vocabulary appearwith an unseen context in test data?

They will be assigned with zero probability

Smoothing (or discounting) to therescue: Steal from the rich and give tothe poor!

Page 39: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

22

Smoothing

What happens when words that are in our vocabulary appearwith an unseen context in test data?

They will be assigned with zero probability

Smoothing (or discounting) to therescue: Steal from the rich and give tothe poor!

Page 40: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

22

Smoothing

What happens when words that are in our vocabulary appearwith an unseen context in test data?

They will be assigned with zero probability

Smoothing (or discounting) to therescue: Steal from the rich and give tothe poor!

Page 41: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

23

Smoothing intuition

Taking from the frequent and giving to the rare (discounting)

Page 42: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

24

Add-1 Smoothing

Add-1 (or Laplace) smoothing adds one to all bigram counts:

Padd−1(xn|xn−1) =c(xn−1, xn) + 1

c(xn−1) + |V |

Pretend we have seen all bigrams at least once!

Page 43: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

25

Add-k Smoothing

Add-1 puts too much probability mass to unseen bigrams, betterto add-k , k < 1:

Padd−k(xn|xn−1) =counts(xn−1, xn) + k

counts(xn−1) + k|V |

k is a hyperparameter: choose optimal value on the dev set!

Page 44: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

25

Add-k Smoothing

Add-1 puts too much probability mass to unseen bigrams, betterto add-k , k < 1:

Padd−k(xn|xn−1) =counts(xn−1, xn) + k

counts(xn−1) + k|V |

k is a hyperparameter: choose optimal value on the dev set!

Page 45: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

26

Interpolation

Longer contexts are more informative:

dog bites ... better than bites ...

but only if they are frequent enough:

canid bites ...better than bites ...?

Can we combine evidence from unigram, bigram and trigramprobabilities?

Page 46: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

26

Interpolation

Longer contexts are more informative:

dog bites ... better than bites ...

but only if they are frequent enough:

canid bites ...better than bites ...?

Can we combine evidence from unigram, bigram and trigramprobabilities?

Page 47: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

27

Simple Linear Interpolation

For a trigram LM:

PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)

+ λ2P(xn|xn−1)

+ λ1P(xn) λi > 0,∑

λi = 1

Weighted average of unigram, bigram and trigram probabilities

How we choose the value of λs?

Parameter tuning on the devset!

Page 48: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

27

Simple Linear Interpolation

For a trigram LM:

PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)

+ λ2P(xn|xn−1)

+ λ1P(xn) λi > 0,∑

λi = 1

Weighted average of unigram, bigram and trigram probabilities

How we choose the value of λs?

Parameter tuning on the devset!

Page 49: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

27

Simple Linear Interpolation

For a trigram LM:

PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)

+ λ2P(xn|xn−1)

+ λ1P(xn) λi > 0,∑

λi = 1

Weighted average of unigram, bigram and trigram probabilities

How we choose the value of λs?

Parameter tuning on the devset!

Page 50: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

27

Simple Linear Interpolation

For a trigram LM:

PSLI (xn|xn−1, xn−2) = λ3P(xn|xn−1, xn−2)

+ λ2P(xn|xn−1)

+ λ1P(xn) λi > 0,∑

λi = 1

Weighted average of unigram, bigram and trigram probabilities

How we choose the value of λs? Parameter tuning on the devset!

Page 51: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

28

Backoff

Start with n-gram order of k but if the counts are 0 use k − 1:

BO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0

BO(xn|xn−1 . . . xn−k+1), otherwise

Is this a probability distribution?

Page 52: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

28

Backoff

Start with n-gram order of k but if the counts are 0 use k − 1:

BO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0

BO(xn|xn−1 . . . xn−k+1), otherwise

Is this a probability distribution?

Page 53: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

29

Backoff

NO! Must discount probabilities for contexts with counts P? anddistribute the mass to the shorter context ones:

PBO(xn|xn−1 . . . xn−k) ={P?(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0

αxn−1...xn−kPBO(xn|xn−1 . . . xn−k+1), otherwise

αxn−1...xn−k =βxn−1...xn−k∑

PBO(xn|xn−1 . . . xn−k+1)

β, is the left-over probability mass for the (n-k)-gram

Page 54: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

30

Absolute Discounting

Using 22M words for train and held-out

Can you predict the heldout (test) set average count given thetraining?

Testing counts = training counts - 0.75 (absolute discount)

Page 55: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

30

Absolute Discounting

Using 22M words for train and held-out

Can you predict the heldout (test) set average count given thetraining?Testing counts = training counts - 0.75 (absolute discount)

Page 56: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

31

Absolute discounting

PAbsDiscount(xn|xn−1) =c(xn, xn−1)− d

c(xn−1)+ λxn−1P(xn)

d = 0.75, λs tuned to ensure we have a valid probabilitydistribution.

Component of the Kneser-Ney discounting:

Intuition: a word can be very frequent, but if only follows veryfew contexts,e.g. Francisco is frequent but almost always follows SanThe unigram probability in the context of the bigram shouldcapture how likely xn is to be a novel continuation.

Page 57: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

32

Stupid Backoff

Do we really need probabilities? Estimating the additionalparameters takes time for large corpora.

If scoring is enough, stupid backoff works adequately:

SBO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0

λSBO(xn|xn−1 . . . xn−k+1), otherwise

Empirically found that λ = 0.4 works well

They called it stupid because they didn’t expect it to workwell!

Page 58: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

32

Stupid Backoff

Do we really need probabilities? Estimating the additionalparameters takes time for large corpora.

If scoring is enough, stupid backoff works adequately:

SBO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0

λSBO(xn|xn−1 . . . xn−k+1), otherwise

Empirically found that λ = 0.4 works well

They called it stupid because they didn’t expect it to workwell!

Page 59: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

32

Stupid Backoff

Do we really need probabilities? Estimating the additionalparameters takes time for large corpora.

If scoring is enough, stupid backoff works adequately:

SBO(xn|xn−1 . . . xn−k) ={P(xn|xn−1 . . . xn−k), if c(xn . . . xn−k) > 0

λSBO(xn|xn−1 . . . xn−k+1), otherwise

Empirically found that λ = 0.4 works well

They called it stupid because they didn’t expect it to workwell!

Page 60: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

33

Syntax-based language models

P(binoculars|saw)

more informative than:

P(binoculars|strong, very,with, ship, the, saw)

Page 61: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

33

Syntax-based language models

P(binoculars|saw)

more informative than:

P(binoculars|strong, very,with, ship, the, saw)

Page 62: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

34

Last words: More data defeats smarter models!

From Large Language Models in Machine Translation

Page 64: Language Modelling - COM4513/6513 Natural Language Processing · arctic monkeys are my favourite band arctic 0 0.1 0.02 0 0 0 monkeys 0 0 0.417 0.0017 0.008 0 are 0.0008 0.0113 0

36

Coming up next...

We have learned how to model word sequences using Markovmodels

In the following lecture we will look at how to performpart-of-speech tagging using:

the Hidden Markov Model (HMM)the Conditional Random Fields (CRFs), an extension oflogistic regression for sequence modelling