Lectures Lm

Language Modeling

Michael Collins, Columbia University

Overview

I The language modeling problem

I Trigram models

I Evaluating language models: perplexity

I Estimation techniques:

I Linear interpolationI Discounting methods

The Language Modeling Problem

I We have some (finite) vocabulary,say V = {the, a, man, telescope, Beckham, two, . . .}

I We have an (infinite) set of strings, V†

the STOPa STOPthe fan STOPthe fan saw Beckham STOPthe fan saw saw STOPthe fan saw Beckham play for Real Madrid STOP

The Language Modeling Problem (Continued)

I We have a training sample of example sentences inEnglish

I We need to “learn” a probability distribution pi.e., p is a function that satisfies∑

x∈V†p(x) = 1, p(x) ≥ 0 for all x ∈ V†

p(the STOP) = 10−12

p(the fan STOP) = 10−8

p(the fan saw Beckham STOP) = 2× 10−8

p(the fan saw saw STOP) = 10−15

. . .p(the fan saw Beckham play for Real Madrid STOP) = 2× 10−9

x∈V†p(x) = 1, p(x) ≥ 0 for all x ∈ V†

Why on earth would we want to do this?!

I Speech recognition was the original motivation.(Related problems are optical character recognition,handwriting recognition.)

I The estimation techniques developed for this problem willbe VERY useful for other problems in NLP

Why on earth would we want to do this?!

I Speech recognition was the original motivation.(Related problems are optical character recognition,handwriting recognition.)

I The estimation techniques developed for this problem willbe VERY useful for other problems in NLP

A Naive Method

I We have N training sentences

I For any sentence x1 . . . xn, c(x1 . . . xn) is the number oftimes the sentence is seen in our training data

I A naive estimate:

p(x1 . . . xn) =c(x1 . . . xn)

Overview

I Trigram models

Markov Processes

I Consider a sequence of random variables X1, X2, . . . Xn.Each random variable can take any value in a finite set V .For now we assume the length n is fixed (e.g., n = 100).

I Our goal: model

P (X1 = x1, X2 = x2, . . . , Xn = xn)

First-Order Markov Processes

P (X1 = x1, X2 = x2, . . . Xn = xn)

= P (X1 = x1)n∏i=2

P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)

= P (X1 = x1)n∏i=2

P (Xi = xi|Xi−1 = xi−1)

The first-order Markov assumption: For any i ∈ {2 . . . n}, forany x1 . . . xi,

P (Xi = xi|X1 = x1 . . . Xi−1 = xi−1) = P (Xi = xi|Xi−1 = xi−1)

P (X1 = x1, X2 = x2, . . . Xn = xn)

= P (X1 = x1)n∏i=2

P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)

= P (X1 = x1)n∏i=2

P (Xi = xi|Xi−1 = xi−1)

P (X1 = x1, X2 = x2, . . . Xn = xn)

= P (X1 = x1)n∏i=2

P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)

= P (X1 = x1)n∏i=2

P (Xi = xi|Xi−1 = xi−1)

P (X1 = x1, X2 = x2, . . . Xn = xn)

= P (X1 = x1)n∏i=2

P (Xi = xi|X1 = x1, . . . , Xi−1 = xi−1)

= P (X1 = x1)n∏i=2

P (Xi = xi|Xi−1 = xi−1)

Second-Order Markov Processes

P (X1 = x1, X2 = x2, . . . Xn = xn)

= P (X1 = x1)× P (X2 = x2|X1 = x1)

×n∏i=3

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

=n∏i=1

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

(For convenience we assume x0 = x−1 = *, where * is aspecial “start” symbol.)

P (X1 = x1, X2 = x2, . . . Xn = xn)

= P (X1 = x1)× P (X2 = x2|X1 = x1)

×n∏i=3

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

=n∏i=1

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

P (X1 = x1, X2 = x2, . . . Xn = xn)

= P (X1 = x1)× P (X2 = x2|X1 = x1)

×n∏i=3

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

=n∏i=1

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

Modeling Variable Length Sequences

I We would like the length of the sequence, n, to also be arandom variable

I A simple solution: always define Xn = STOP whereSTOP is a special symbol

I Then use a Markov process as before:

P (X1 = x1, X2 = x2, . . . Xn = xn)

=n∏i=1

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

Modeling Variable Length Sequences

I We would like the length of the sequence, n, to also be arandom variable

I A simple solution: always define Xn = STOP whereSTOP is a special symbol

I Then use a Markov process as before:

P (X1 = x1, X2 = x2, . . . Xn = xn)

=n∏i=1

P (Xi = xi|Xi−2 = xi−2, Xi−1 = xi−1)

Trigram Language Models

I A trigram language model consists of:

1. A finite set V2. A parameter q(w|u, v) for each trigram u, v, w such that

w ∈ V ∪ {STOP}, and u, v ∈ V ∪ {*}.

I For any sentence x1 . . . xn where xi ∈ V fori = 1 . . . (n− 1), and xn = STOP, the probability of thesentence under the trigram language model is

p(x1 . . . xn) =n∏i=1

q(xi|xi−2, xi−1)

where we define x0 = x−1 = *.

Trigram Language Models

I A trigram language model consists of:

1. A finite set V2. A parameter q(w|u, v) for each trigram u, v, w such that

w ∈ V ∪ {STOP}, and u, v ∈ V ∪ {*}.

I For any sentence x1 . . . xn where xi ∈ V fori = 1 . . . (n− 1), and xn = STOP, the probability of thesentence under the trigram language model is

p(x1 . . . xn) =n∏i=1

q(xi|xi−2, xi−1)

where we define x0 = x−1 = *.

An Example

For the sentence

the dog barks STOP

we would have

p(the dog barks STOP) = q(the|*, *)

×q(dog|*, the)

×q(barks|the, dog)

×q(STOP|dog, barks)

The Trigram Estimation Problem

Remaining estimation problem:

q(wi | wi−2, wi−1)

For example:q(laughs | the, dog)

A natural estimate (the “maximum likelihood estimate”):

q(wi | wi−2, wi−1) =Count(wi−2, wi−1, wi)

Count(wi−2, wi−1)

q(laughs | the, dog) =Count(the, dog, laughs)

Count(the, dog)

The Trigram Estimation Problem

Remaining estimation problem:

q(wi | wi−2, wi−1)

For example:q(laughs | the, dog)

Count(the, dog)

Sparse Data Problems

Count(the, dog)

Say our vocabulary size is N = |V|, then there are N3

parameters in the model.

e.g., N = 20, 000 ⇒ 20, 0003 = 8× 1012 parameters

Overview

I Trigram models

Evaluating a Language Model: Perplexity

I We have some test data, m sentences

s1, s2, s3, . . . , sm

I We could look at the probability under our model∏mi=1 p(si). Or more conveniently, the log probability

logm∏i=1

p(si) =m∑i=1

log p(si)

I In fact the usual evaluation measure is perplexity

Perplexity = 2−l where l =1

m∑i=1

log p(si)

and M is the total number of words in the test data.

s1, s2, s3, . . . , sm

logm∏i=1

p(si) =m∑i=1

log p(si)

m∑i=1

log p(si)

s1, s2, s3, . . . , sm

logm∏i=1

p(si) =m∑i=1

log p(si)

m∑i=1

log p(si)

Some Intuition about Perplexity

I Say we have a vocabulary V , and N = |V|+ 1and model that predicts

q(w|u, v) =1

for all w ∈ V ∪ {STOP}, for all u, v ∈ V ∪ {*}.I Easy to calculate the perplexity in this case:

Perplexity = 2−l where l = log1

⇒Perplexity = N

Perplexity is a measure of effective “branching factor”

Typical Values of Perplexity

I Results from Goodman (“A bit of progress in languagemodeling”), where |V| = 50, 000

I A trigram model: p(x1 . . . xn) =∏n

i=1 q(xi|xi−2, xi−1).Perplexity = 74

I A bigram model: p(x1 . . . xn) =∏n

i=1 q(xi|xi−1).Perplexity = 137

I A unigram model: p(x1 . . . xn) =∏n

i=1 q(xi).Perplexity = 955

Some History

I Shannon conducted experiments on entropy of Englishi.e., how good are people at the perplexity game?

C. Shannon. Prediction and entropy of printedEnglish. Bell Systems Technical Journal,30:50–64, 1951.

Some HistoryChomsky (in Syntactic Structures (1957)):

Second, the notion “grammatical” cannot be identified with“meaningful” or “significant” in any semantic sense.Sentences (1) and (2) are equally nonsensical, but any speakerof English will recognize that only the former is grammatical.

(1) Colorless green ideas sleep furiously.

(2) Furiously sleep ideas green colorless.

. . . Third, the notion “grammatical in English” cannot beidentified in any way with the notion “high order of statisticalapproximation to English”. It is fair to assume that neithersentence (1) nor (2) (nor indeed any part of these sentences)has ever occurred in an English discourse. Hence, in anystatistical model for grammaticalness, these sentences will beruled out on identical grounds as equally ‘remote’ fromEnglish. Yet (1), though nonsensical, is grammatical, while(2) is not. . . .

Overview

I Trigram models

Sparse Data Problems

Count(the, dog)

Say our vocabulary size is N = |V|, then there are N3

parameters in the model.

e.g., N = 20, 000 ⇒ 20, 0003 = 8× 1012 parameters

The Bias-Variance Trade-Off

I Trigram maximum-likelihood estimate

qML(wi | wi−2, wi−1) =Count(wi−2, wi−1, wi)

I Bigram maximum-likelihood estimate

qML(wi | wi−1) =Count(wi−1, wi)

Count(wi−1)

I Unigram maximum-likelihood estimate

qML(wi) =Count(wi)

Count()

Linear Interpolation

I Take our estimate q(wi | wi−2, wi−1) to be

q(wi | wi−2, wi−1) = λ1 × qML(wi | wi−2, wi−1)+λ2 × qML(wi | wi−1)+λ3 × qML(wi)

where λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i.

Linear Interpolation (continued)

Our estimate correctly defines a distribution (defineV ′ = V ∪ {STOP}):∑w∈V ′ q(w | u, v)

w∈V ′ [λ1 × qML(w | u, v) + λ2 × qML(w | v) + λ3 × qML(w)]

= λ1∑

w qML(w | u, v) + λ2∑

w qML(w | v) + λ3∑

w qML(w)

= λ1 + λ2 + λ3

(Can show also that q(w | u, v) ≥ 0 for all w ∈ V ′)

= λ1∑

w qML(w | u, v) + λ2∑

w qML(w | v) + λ3∑

w qML(w)

= λ1 + λ2 + λ3

= λ1∑

w qML(w | u, v) + λ2∑

w qML(w | v) + λ3∑

w qML(w)

= λ1 + λ2 + λ3

= λ1∑

w qML(w | u, v) + λ2∑

w qML(w | v) + λ3∑

w qML(w)

= λ1 + λ2 + λ3

= λ1∑

w qML(w | u, v) + λ2∑

w qML(w | v) + λ3∑

w qML(w)

= λ1 + λ2 + λ3

= λ1∑

w qML(w | u, v) + λ2∑

w qML(w | v) + λ3∑

w qML(w)

= λ1 + λ2 + λ3

How to estimate the λ values?

I Hold out part of training set as “validation” data

I Define c′(w1, w2, w3) to be the number of times thetrigram (w1, w2, w3) is seen in validation set

I Choose λ1, λ2, λ3 to maximize:

L(λ1, λ2, λ3) =∑

w1,w2,w3

c′(w1, w2, w3) log q(w3 | w1, w2)

such that λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i, andwhere

L(λ1, λ2, λ3) =∑

w1,w2,w3

c′(w1, w2, w3) log q(w3 | w1, w2)

L(λ1, λ2, λ3) =∑

w1,w2,w3

c′(w1, w2, w3) log q(w3 | w1, w2)

Allowing the λ’s to vary

I Take a function Π that partitions historiese.g.,

Π(wi−2, wi−1) =

1 If Count(wi−1, wi−2) = 02 If 1 ≤ Count(wi−1, wi−2) ≤ 23 If 3 ≤ Count(wi−1, wi−2) ≤ 54 Otherwise

I Introduce a dependence of the λ’s on the partition:

q(wi | wi−2, wi−1) = λΠ(wi−2,wi−1)1 × qML(wi | wi−2, wi−1)

+λΠ(wi−2,wi−1)2 × qML(wi | wi−1)

+λΠ(wi−2,wi−1)3 × qML(wi)

where λΠ(wi−2,wi−1)1 + λ

Π(wi−2,wi−1)2 + λ

Π(wi−2,wi−1)3 = 1,

and λΠ(wi−2,wi−1)i ≥ 0 for all i.

Overview

I Trigram models

Discounting MethodsI Say we’ve seen the following counts:

x Count(x) qML(wi | wi−1)the 48

the, dog 15 15/48the, woman 11 11/48the, man 10 10/48the, park 5 5/48the, job 2 2/48the, telescope 1 1/48the, manual 1 1/48the, afternoon 1 1/48the, country 1 1/48the, street 1 1/48

I The maximum-likelihood estimates are high(particularly for low count items)

Discounting MethodsI Now define “discounted” counts,

Count∗(x) = Count(x)− 0.5I New estimates:

x Count(x) Count∗(x) Count∗(x)

Count(the)

the 48

the, dog 15 14.5 14.5/48the, woman 11 10.5 10.5/48the, man 10 9.5 9.5/48the, park 5 4.5 4.5/48the, job 2 1.5 1.5/48the, telescope 1 0.5 0.5/48the, manual 1 0.5 0.5/48the, afternoon 1 0.5 0.5/48the, country 1 0.5 0.5/48the, street 1 0.5 0.5/48

Discounting Methods (Continued)I We now have some “missing probability mass”:

α(wi−1) = 1−∑w

Count∗(wi−1, w)

Count(wi−1)

e.g., in our example, α(the) = 10× 0.5/48 = 5/48

Katz Back-Off Models (Bigrams)I For a bigram model, define two sets

A(wi−1) = {w : Count(wi−1, w) > 0}B(wi−1) = {w : Count(wi−1, w) = 0}

I A bigram model

qBO(wi | wi−1) =

Count∗(wi−1,wi)

Count(wi−1)If wi ∈ A(wi−1)

α(wi−1)qML(wi)∑

w∈B(wi−1)qML(w)

If wi ∈ B(wi−1)

α(wi−1) = 1−∑

w∈A(wi−1)

Count∗(wi−1, w)

Count(wi−1)

Katz Back-Off Models (Trigrams)I For a trigram model, first define two sets

A(wi−2, wi−1) = {w : Count(wi−2, wi−1, w) > 0}B(wi−2, wi−1) = {w : Count(wi−2, wi−1, w) = 0}

I A trigram model is defined in terms of the bigram model:

qBO(wi | wi−2, wi−1) =

Count∗(wi−2,wi−1,wi)

Count(wi−2,wi−1)

If wi ∈ A(wi−2, wi−1)

α(wi−2,wi−1)qBO(wi|wi−1)∑w∈B(wi−2,wi−1)

qBO(w|wi−1)

If wi ∈ B(wi−2, wi−1)

α(wi−2, wi−1) = 1−∑

w∈A(wi−2,wi−1)

Count∗(wi−2, wi−1, w)

Summary

I Three steps in deriving the language model probabilities:

1. Expand p(w1, w2 . . . wn) using Chain rule.2. Make Markov Independence Assumptions

p(wi | w1, w2 . . . wi−2, wi−1) = p(wi | wi−2, wi−1)3. Smooth the estimates using low order counts.

I Other methods used to improve language models:

I “Topic” or “long-range” features.I Syntactic models.

It’s generally hard to improve on trigram models though!!

Lectures Lm

Documents

Transcript of Lectures Lm

Installation Manual Manual Instalación - Grupo MCI · Lumens output 1424 lm 1548 lm 1646 lm Luminaire efﬁciency 65 lm/W 70 lm/W 75 lm/W DALI DMX ... Atornillar los clips de inox

AP 01 H - Hornsby Shire · 2016-11-16 · LM LM LM LM LM LM LM B2 B2 B2 P11 CONCRETE BEAM REFER ENG. DETAIL LM LM LM LM PM1 ... Paving Type 3-FRP Grate to Eng. details on sheet S12.10

LM-D2930A LM-K2030A.pdf

LM-100 & LM-120 User Manual - Amprobecontent.amprobe.com/manualsA/LM-100_LM-120_Light-Meters_Manu… · LM-100 LM-120 Light Meters Users Manual • Mode d’emploi • Bedienungshandbuch

MCI LIGHT 20-21 - Flux AB...1626 lm 1702 lm 9 - 20 - 40 - 10x48 - 55 - 75 - 1956 lm 1704 lm 1536 lm 1688 lm 1806 lm 1892 lm 9 - 20 - 40 - 10x48 - 55 - 75 - 1814 lm 1592 lm 1432 lm

LM-Y120UM LM-Y120QM

LMProPower AirLEDpublications.lm-dental.com/LM-Dental/Manuals and... · LM-ProPower 100732 LM-ProPower 100722us LM-ProPower 100732us LM-ProPower 100722jp LM-ProPower 100732p Manufacturer,

Solavanti GMCI LINEGRAZER · 555 lm 596 lm 569 lm 9° - 28° - 45° - 10°x54° - Asym. - 779 lm 746 lm 675 lm 730 lm 695 lm Luminaire efﬁcacy 46 to 52 lm/W 51 to 59 lm/W 58 to

SA SUGAR INDUSTRY · 2017-11-30 · Socio-economic indicators of cane producing municipalities i LM (ton) ni LM (la) s t LM u) ni LM (u) i LM rg, L) e LM) i LM n, u) a LM, ow e LM)

SLEEK LINE RECESSED - Grupo MCI€¦ · SLEEK LINE RECESSED ® REFERENCES ... Lumens output 1424 lm 1548 lm 1646 lm Luminaire efficiency 65 lm/W 70 lm/W 75 lm/W DALI DMX 1 …

ASPIRE: THE ATLAS OF SOCIAL PROTECTION AND LABOR …...LM policy services (intermediation) LM policy measures (active LM programs) LM policy supports (passive LM programs) Job search

Akustik - Planlicht€¦ · ø 1174 mm 8398 lm 8660 lm 4824 lm 4986 lm 8788 lm 9088 lm 5078 lm 5152 lm 8016 lm 8284 lm ø 1530 mm 10792 lm 11612 lm 6456 lm 6704 lm 11124 lm 11972

LM-U1350A, LM-U1350D, LM-U1350X, LMS-U1350

LG LM-U2350 LM-4050 LM-5050A SERVICE MANUAL

LM-ProPower AirLED QuickGuide - LM-Dentalpublications.lm-dental.com/LM-Dental/Manuals and instructions/LM... · ment by pressing AP. QuickGuide LM-ProPower AirLED or 60 % 100 % To

300 lm/W 16 lm/W 70 lm/W Illustration: © Johan Jarnestad ...

LM & LM-K SERIES - Engineered Air

GW PLLRA1 · 2021. 2. 22. · MF 270 lm 280 lm N1 280 lm 290 lm N2 290 lm 300 lm N3 300 lm 310 lm N4 310 lm 320 lm N5 320 lm 330 lm N6 330 lm 340 lm Forward Voltage Groups Group Forward

Akustik - planlicht.com · „Mous“ QM „Wildspitze ... ø 1174 mm 8398 lm 8660 lm 4824 lm 4986 lm 8788 lm 9088 lm 5078 lm 5152 lm 8016 lm 8284 lm ø 1530 mm 10792 lm 11612 lm

GW P9LR34 - Osram · M3 200 lm 210 lm M4 210 lm 220 lm M5 220 lm 230 lm M6 230 lm 240 lm Forward Voltage Groups Group Forward Voltage 2) Forward Voltage 2) min. max. V F V F K8 20.80