Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for...

123
Introduction to Language Modeling Alex Acero

Transcript of Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for...

Page 1: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Introduction to Language Modeling

Alex Acero

Page 2: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Acknowledgments

• Joshua Goodman, Scott MacKenzie for many slides

Page 3: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 4: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 5: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Probability – definition

P(X) means probability that X is trueP(baby is a boy) 0.5 (% of total that are boys)

P(baby is named John) 0.001 (% of total named John)

BabiesBaby boys

John

Page 6: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Joint probabilities

• P(X, Y) means probability that X and Y are both true, e.g. P(brown eyes, boy)

BabiesBaby boys

JohnBrown eyes

Page 7: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Conditional probabilities

• P(X|Y) means probability that X is true when we already know Y is true

– P(baby is named John | baby is a boy) 0.002– P(baby is a boy | baby is named John ) 1

BabiesBaby boys

John

Page 8: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Bayes rule

P(X|Y) = P(X, Y) / P(Y)

P(baby is named John | baby is a boy)

=P(baby is named John, baby is a boy) / P(baby is a boy)

= 0.001 / 0.5 = 0.002

BabiesBaby boys

John

Page 9: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 10: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

90% Removed

k r ce r oz y t a e t h c o , d o a a o b a a e g v t a l a ss m n t i s n f d w i t l h e n s - w le e a r i w f e n t e h r w e , h v e r d l Wi

Page 11: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

80% Removed

D r s r s n h s d f w e e b e i e in th w i r o a t e t ar e t e k i t i c i ver e . as d on, ho ve n o n an o h t h sp r f a e of s a h o a u o e e n a n s - au h m r s s o h l e ld a e s n in i f li w h mas u i e i m g in i i o i t e o t w t a g z a N n d

Page 12: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

70% Removed

D uc ore frown d on i h r id n ter a h t ee a ent o e whi e in ffro y seeme t l a s ac c s e ad A l ei d h h lf wa s a s, th mo t, a ha th p o v at sa s e a a i i g t aught m r t h y ne s a g t at wa m es e e o i laug s e fr s p k f h gr nes i f t e fu an i c u l s m of r i u l f a he ef rt o il , v , ro ar t land il

Page 13: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

60% Removed

k ruc res f w n th r id t e fr z n t r y. T r d stri d b a rece t n t w i cove n o r s a d t e e e to e n h ot er, l o no s e ad ng l g . A st n eig d v el d. h n t e w a d s ti , i es , wi mo e t, so a d co t a pi it i as ev hat dn ss he a n it a ht r b t f l u e te bl t n ess a u e t r e sthe sm e t p i ugh o d a he fr d p ta in h g ne s llib l It as e a ful d mm c le w e nity lau h g a t e l t o and ff t . I a he W d s ge fr z a e N th a W ld

Page 14: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

50% Removed

D k r ce fore t o on i h r si the fr en wa erw tr s ha ee i ed y a ec nt wi d f the r whit erin o f an e e med t lean t war ch o er, ck a d mi ou in g i t. t le r ig d r hel d la d t el as d so i if , w t o m v ent, s lone d d hat h spi t f t was t ven th sa e s h as a h nt n i f aug t , t f l u t mo e ter ble than an a ss - lau ter hat as m h ast s ile f he ph n a ghte ol as he os a d r a ing e grimn ss o n a il y. a h m s er u in om a i d e it laug n a t fu i y f l e he ff t of if . It he W l , t e s ag , f e - ed N r n ld

Page 15: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

40% Removed

Dark s r c res ow e n eit er side the froze w erwa . T trees h d b n tr p d b c t w n o heir white c v ring off s and h y s med e n towa ds e c o her la k and i o s, i e f l gh . A as ilence reig ed o he and. The lan e f a a de l tion, life e s, i mo ment, so n nd cold h the pi t o it s n t ev n hato sa nes . h e wa hin n it of laughter, bu a lau h r e t rib e a dn - laught r a was mi hle s asthe i e of t e s a la hte ol as he ro t a d ar ak n f th g mne of inf llib i . It wa he mas erful n incom un b e w s om of t rnity l ugh n at e lity f fe nd e effort f ife. I as t e , savage, fro n-h a r hland Wi d.

Page 16: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

30% Removed

Da s ruce fores r on ith r i e the froz n ater ay Thet s ha een st ippe b re e wi d f heir white c vering of ost, an they e d o ean towards eac othe , lac a dom nou , n t e f in ight. A vast il nc re ned o e hela d. The l d s l wa d s a ion, if s, withoutm men o e and old that t e pi t of it was not e en tha o sadnes T ere s a hint in t of laug r, ut o a laugh e o er ble t n a y sadn s a la ghte at as irthless sthe mil o h phi x, a ghter cold as the fro t a d par ak ng f the grim es of n al b l ty was th m s r ul and co mu i able wi m of ter la ghi t t e futil ty o if an the ef rt of li e. It as the Wild, h avage, froze -hearte Northland Wil .

Page 17: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

20% Removed

Dark spruce forest frowned n either side the roze ate wa . Thetrees had ee stripp d by ecent wi d f thei white coverin offrost d they eemed to lean towa ds e ch o h r, bl ck and mino s, in the ad ng l ght. A vas sil n e reigned over theland. The land i sel was e olatio , ifeless, wit outmovem n , s lon n cold ha e spi it of s n t eve hat adn s. The e was hi i it of a ght , but f a laughter ore ible t an any s ne s - a laughter t a was mi hless as he mile of he sphinx, a aug ter col as the f ost nd arta in of th grimn ss of i fall bility It as the asterful andinc mmunicabl wisdo of e ernity ugh ng at the futilit of li and t e for of ife. It w the Wild, the avage, rozen-hea t d N rthla d W ld.

Page 18: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

10% Removed

Dark s ru e forest frowned on either side the frozen waterw y. Thetrees had bee stripped by a recent ind of t eir w ite covering of rost, and they seemed o lean towards each ot er, black andominous, in the fading li h . A vast silence reigned ver theland The land itself w s a deso ation, lifel ss, without ovement, s l n and cold hat the spiri of it wa not even thatof sa ness. here was a hint in it of laughte , but of a l ug termore terrible than any sadness - a ughte that as mirthless s he smile of the sphinx a laug ter cold as the rost a d p rt k ngof the grimness of infal i ility. It as the masterful andincommunica le wisdom of eternity laughing at th futility of lifean the effort f life. It was e Wild, the sa ag , froz n- earte Northland Wild.

Page 19: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

0% Removed

Dark spruce forest frowned on either side the frozen waterway. Thetrees had been stripped by a recent wind of their white covering offrost, and they seemed to lean towards each other, black andominous, in the fading light. A vast silence reigned over theland. The land itself was a desolation, lifeless, withoutmovement, so lone and cold that the spirit of it was not even thatof sadness. There was a hint in it of laughter, but of a laughtermore terrible than any sadness - a laughter that was mirthless asthe smile of the sphinx, a laughter cold as the frost and partakingof the grimness of infallibility. It was the masterful andincommunicable wisdom of eternity laughing at the futility of lifeand the effort of life. It was the Wild, the savage, frozen-hearted Northland Wild.

From Jack London’s “White Fang”

Page 20: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Language as Information

• Information is commonly measured in “bits”• Since language is highly redundant, perhaps it can be

viewed somewhat like information• Can language be measured or coded in “bits”?• Sure. Examples include…

– ASCII (7 bits per “symbol”)– Unicode (16 bits per symbol)– But coding schemes, such as ASCII or Unicode, do not account

for the redundancy In the language

• Two questions:1. Can a coding system be developed for a language (e.g., English)

that accounts for the redundancy in the language?2. If so, how may bits per symbol are required?

Page 21: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

How many bits?

• ASCII codes have seven bits• 27 = 128 codes• Codes include…

– 33 control codes– 95 symbols, including 26 uppercase letters, 26 lowercase letters,

space, 42 “other” symbols

• In general, if we have n symbols, the number of bits to encode them is log2n (Note: log2128 = 7)

• What about bare bones English – 26 letters plus space?• How many bits?

Page 22: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

How many bits? (2)

• It takes log227 = 4.75 bits/character to encode bare bones English

• But, what about redundancy in English?• Since English is highly redundant, is there a way to encode

letters in fewer bits?– Yes

• How many bits?– The answer (drum roll please)…

Page 23: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

How many bits? (3)

• The minimum number of bits to encode English is (approximately)…

– 1 bit/character

• How is this possible?– E.g., Huffman coding– ngrams

• More importantly, how is this answer computed?• Want to learn how? Read…

Shannon, C. E. (1951). Prediction and entropy of printed English. The Bell System Technical Journal, 30, 51-64.

Page 24: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Disambiguation

• A special case of prediction is disambiguation• Consider the telephone keypad…

• Is it possible to enter text using this keypad?• Yes. But the keys are ambiguous.

Page 25: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Ambiguity Continuum

53 keys

27 keys8 keys

1 key

LessAmbiguity

MoreAmbiguity

Page 26: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Coping With Ambiguity

• There are two approaches to disambiguating the telephone keypad

– Explicit

• Use additional keys or keystrokes to select the desired letter

• E.g., multitap– Implicit

• Add “intelligence” (i.e., a language model) to the interface to guess the intended letter

• E.g., T9, Letterwise

Page 27: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Multitap

• Press a key once for the 1st letter, twice for the 2nd letter, and so on

• Example...

84433.778844422255.22777666966.33366699.th e q u i c k b r o wn f o x

58867N7777.66688833777.84433.55529999N999.36664.ju mp s o v e r th e l az y do g

But, there is a problem. When consecutive letters are on the same key, additional disambiguation is needed. Two techniques: (i) timeout, (ii) a special “next letter” (N) key1 to explicitly segment the letters. (See above: “ps” in “jumps” and “zy” in “lazy”).

1 Nokia phones: timeout = 1.5 seconds, “next letter” = down-arrow

Page 28: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

T9

• Product of Tegic Communications (www.tegic.com), a subsidiary of Nuance Communications

• Licensed to many mobile phone companies• The idea is simple:

– one key = one character

• A language model works “behind the scenes” to disambiguate

• Example (next slide)...

Page 29: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Guess the Word

Number of word stems to consider…

3 x 3 x 3 x 4 x 3 x 3 x 3 x 4 = 11,664

C O M P U T E R

Page 30: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

“Quick Brown Fox” Using T9

843.78425.27696.369.58677.6837.843.5299.364.the quick brown fox jumps over the lazy dog

But, there is problem. The key sequences are ambiguous and other words may exist for some sequences. See below.

DecreasingProbability

843.78425.27696.369.58677.6837.843.5299.364.the quick brown fox jumps over the jazz dogtie stick crown lumps muds tie lazy fogvie vie

Page 31: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Keystrokes Per Character (KSPC)

• Earlier examples used the “quick brown fox” phrase, which has 44 characters (including one character after each word)

• Multitap and T9 require very different keystroke sequences• Compare…

MethodNumber of Keystrokes

Number of Characters

Keystrokes per

Character

Qwerty 44 44 1.000

Multitap 88 44 2.000

T9 45 44 1.023

Phrase: the quick brown fox jumps over the lazy dog.

Page 32: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Formulas

Page 33: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 34: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Feature Extraction

Feature Extraction

Language Model

Language Model

Word Lexicon

Word Lexicon

Confidence Scoring

Confidence Scoring

Pattern Classification

(Decoding, Search)

Pattern Classification

(Decoding, Search)

Acoustic Model

Acoustic Model

Input Speech “Hello World”

(0.9) (0.8)

Speech Recognition

SLU

TTS ASR

DM

SLG

Page 35: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Language Modeling in ASR

• Some sequences of words sounds alike, but not all of them are good English sentences.

– I went to a party – Eye went two a bar tea

Rudolph the Red Nose reigned here.

Rudolph the Red knows rain, dear.

Rudolph the red nose reindeer.

)()|(maxarg

)|(maxarg

WordspWordsAcousticsp

AcousticsWordspWords

Words

Words

Page 36: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Language Modeling in ASR• This lets the recognizer make the right guess when two different

sentences sound the same.

For example:• It’s fun to recognize speech?• It’s fun to wreck a nice beach?

Page 37: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Humans have a Language Model

Ultimate goal is that a speech recognizer performs a good as human being.

In psychology a lot of research has been done.• The *eel was on the shoe• The *eel was on the car

People capable to adjusting to right context• removes ambiguities • limits possible words

Already very good language models for dedicated applications (e.g. medical, a lot of standardization)

Page 38: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

A bad language model

Page 39: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

A bad language model

Page 40: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

A bad language model

He

rman

is reprin

ted w

ith pe

rmissio

n from L

augh

ing

Stock L

icensin

g Inc., O

ttawa

Ca

nad

a. All righ

ts reserved

.

Page 41: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

A bad language model

Page 42: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

What’s a Language Model

• A Language model is a probability distribution over word sequences

• P(“And nothing but the truth”) 0.001

• P(“And nuts sing on the roof”) 0

Page 43: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

What’s a language model for?

• Speech recognition• Machine translation • Handwriting recognition• Spelling correction• Optical character recognition• Typing in Chinese or Japanese

• (and anyone doing statistical modeling)

Page 44: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

How Language Models work

Hard to compute P(“And nothing but the truth”)

Step 1: Decompose probabilityP(“And nothing but the truth” )= P(“And”) P(“nothing” | “And”) x P(“but” | “And nothing”) x P(“the” | “And nothing but”) x P(“truth” | “And nothing but the”)

Step 2: Approximate with trigramsP(“And nothing but the truth” )≈ P(“And”) P(“nothing” | “And”) x P(“but” | “And nothing”) x P(“the” | “nothing but”) x P(“truth” | “but the”)

Page 45: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

example

How do we find probabilities?

Get real text, and start counting!

P(“the | nothing but”) C(“nothing but the”) / C(“nothing but”)

Training set:

“John read her book”

“I read a different book”

“John read a book by Mulan”

3

2

)(

)/,()|/(

2

1

)(

),()|(

3

2

)(

),()|(

2

2

)(

),()|(

3

2

)(

),()|(

bookC

sbookCbooksP

aC

bookaCabookP

readC

areadCreadaP

JohnC

readJohnCJohnreadP

sC

JohnsCsJohnP

Page 46: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

example

These bigram probabilities help us estimate the probability for the sentence as:

P(John read a book)

= P(John|<s>)P(read|John)P(book|a)P(</s>|book)

= 0.148

Then cross entropy: -1/4*2log(0.148) = 0.689

So perplexity = 20.689 = 1.61

Comparison: Wall street journal text (5000 words) has a bigram perplexity of 128

Page 47: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

gram example

To calculate this probability, we need to compute both the number of times "am" is preceded by "I" and the number of times "here" is preceded by "I am."

All four sounds the same, right decision can only be made by language model.

Page 48: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 49: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Evaluation

• How can you tell a good language model from a bad one?• Run a machine translation system, a speech recognizer (or

your application of choice), calculate word error rate– Slow– Specific to your system

Page 50: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Evaluation: Perplexity Intuition

• Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10

• Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000

• Ask a speech recognizer to recognize “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) each – perplexity 54

• Perplexity is weighted equivalent branching factor.

Page 51: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Evaluation: perplexity

• “A, B, C, D, E, F, G…Z”: perplexity is 26• “Alpha, bravo, charlie, delta…yankee, zulu”: perplexity is 26• Perplexity measures language model difficulty, not acoustic

difficulty• High perplexity means that the number of words branching

from a previous word is larger on average.• Low perplexity does not guarantee good performance. • For example B,C,D,E,G,P,T has 7 but does not take into

account acoustic confusability

Page 52: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Perplexity: Math

• Perplexity is geometric average inverse probability • Imagine “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000)• Model thinks all probabilities are equal (1/30,003) • Average inverse probability is 30,003

nn

iii wwPPerplexity

1

11:1|

Page 53: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Perplexity: Math

• Imagine “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000)

• Correct model gives these probabilities• ¾ of time assigns probability ¼, ¼ of time assigns probability 1/120,000

• Perplexity is 54 (compare to 30,003 for simple model)

• Remarkable fact: the true model for data has the lowest possible perplexity

Page 54: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Perplexity:Is lower better?

• Remarkable fact: the true model for data has the lowest possible perplexity

• Lower the perplexity, the closer we are to true model.

• Typically, perplexity correlates well with speech recognition word error rate

– Correlates better when both models are trained on same data

– Doesn’t correlate well when training data changes

Page 55: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Perplexity: The Shannon Game

• Ask people to guess the next letter, given context. Compute perplexity.

– (when we get to entropy, the “100” column corresponds to the “1 bit per character” estimate)

Char n-gram Low Char Upper char Low word Upper word1 9.1 16.3 191,237 4,702,5115 3.2 6.5 653 29,532

10 2.0 4.3 45 2,99815 2.3 4.3 97 2,998

100 1.5 2.5 10 142

Page 56: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Homework

• Write a program to estimate the Entropy of written text (Shannon, 1950)

• Input: a text document (you pick it, the larger the better)• Write a program that predicts the next letter given the past

letters on a different text (make it interactive?)– Hint: use character ngrams

• Check it does a perfect job on the training text• Due 5/21

Page 57: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Evaluation: entropy

• Entropy =

log2 perplexity

Should be called “cross-entropy of Should be called “cross-entropy of model on test data.”model on test data.” Remarkable fact: entropy is average Remarkable fact: entropy is average number of bits per word required to number of bits per word required to encode test data using this probability encode test data using this probability model, and an optimal coder. Called model, and an optimal coder. Called bits.bits.

nn

iii wwP

1

11:12 |log

Page 58: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

perplexity

Encode text W using –2logP(W) bits.

Then the cross-entropy H(W) is:

Where N is the length of the text.

The perplexity is then defined as:

)(log1

)( 2 WPN

WH

)(2)( WHWPP

Page 59: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Word Rank vs. Probability

0.00000100

0.00001000

0.00010000

0.00100000

0.01000000

0.10000000

1.00000000

1 10 100 1000 10000

Word Rank

Wor

d P

roba

bilit

y

Hmm… There appears to be a relationship between word rank and word probability. Plotting both on log scales, as above, reveals a linear, or straight line, relationship. How strong is this relation? (next slide)

Page 60: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 61: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: None

• Called Maximum Likelihood estimate.• Lowest perplexity trigram on training data.• Terrible on test data: If no occurrences of C(xyz),

probability is 0.

w

xywC

xyzC

xyC

xyzCxyzp

)(

)(

)(

)()|(

Page 62: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Add One

• What is P(sing|nuts)? Zero? Leads to infinite perplexity!• Add one smoothing:

Works very badly. DO NOT DO THIS

• Add delta smoothing:

Still very bad. DO NOT DO THIS

VxyC

xyzCxyzp

)(

1)()|(

VxyC

xyzCxyzp

)(

)()|(

Page 63: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Simple Interpolation

• Trigram is very context specific, very noisy• Unigram is context-independent, smooth• Interpolate Trigram, Bigram, Unigram for best combination

• Find 0<<1 by optimizing on “held-out” data• Almost good enough

)(

)()1(

)(

)(

)(

)()|(

C

zC

yC

yzC

xyC

xyzCxyzp

Page 64: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Simple Interpolation

• Split data into training, “heldout”, test• Try lots of different values for on heldout data, pick best

• Test on test data• Sometimes, can use tricks like “EM” (estimation maximization) to find values

• I prefer to use a generalized search algorithm, “Powell search” – see Numerical Recipes in C

Page 65: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Simple Interpolation

• How much data for training, heldout, test?• Some people say things like “1/3, 1/3, 1/3” or “80%, 10%,

10%” They are WRONG• Heldout should have (at least) 100-1000 words per

parameter.• Answer: enough test data to be statistically significant. (1000s

of words perhaps)

Page 66: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Simple Interpolation

• Be careful: WSJ data divided into stories. Some are easy, with lots of numbers, financial, others much harder. Use enough to cover many stories.

• Be careful: Some stories repeated in data sets.• Can take data from end – better – or randomly from within training. Temporal effects like “Swine flu”

Page 67: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Jelinek-Mercer

• Simple interpolation:

• Better: smooth a little after “The Dow”, lots after “Adobe acquired”

)|()1()(

)()|( yzP

xyC

xyzCxyzP smoothsmooth

)|())(1()(

)()()|( yzPxyC

xyC

xyzCxyCxyzP smoothsmooth

Page 68: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Jelinek-Mercer

• Put s into buckets by count• Find s by cross-validation on held-out data• Also called “deleted-interpolation”

)|())(1()(

)()()|( yzPxyC

xyC

xyzCxyCxyzP smoothsmooth

Page 69: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Katz

• Compute discount using “Good-Turing” estimate• Only use bigram if trigram is missing

• Works pretty well, except not good for 1 counts• is calculated so probabilities sum to 1

otherwiseyzPxy

xyzCifxyC

xyzDCxyzCxyzP

Katz

Katz

)|()(

0)()(

)()()|(

Page 70: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Interpolated Absolute Discount

• JM, Simple Interpolation overdiscount large counts, underdiscount small counts

• “San Francisco” 100 times, “San Alex” once, should we use a big discount or a small one?

– Absolute discounting takes the same from everyone

)|())(1()(

)()( yzPxyC

xyC

xyzCxyC smooth

)|()()(

)()|( intint yzPxy

xyC

DxyzCxyzP erpabserpabs

Page 71: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Interpolated Multiple Absolute Discounts• One discount is good

• Different discounts for different counts

• Multiple discounts: for 1 count, 2 counts, >2

)|()()(

)(int yzPxy

xyC

DxyzCerpabs

)|()()(

)(int

)( yzPxyxyC

DxyzCerpabs

xyzC

Page 72: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Kneser-Ney

P(Francisco | eggplant) vs P(stew | eggplant)• “Francisco” is common, so backoff, interpolated methods say

it is likely• But it only occurs in context of “San”• “Stew” is common, and in many contexts• Weight backoff by number of contexts word occurs in

Page 73: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Kneser-Ney

• Interpolated Absolute-discount• Modified backoff distribution• Consistently best technique

v

xyzC

wyvCw

wyzCwxy

xyC

DxyzC

0)(|

0)(|)(

)(

)( )(

Page 74: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Smoothing: Chart

Page 75: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 76: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Caching

• If you say something, you are likely to say it again later

• Interpolate trigram with cache

)(

)()|(

)|()1()|()|(

historylength

historyzChistoryzP

historyzPxyzPhistoryzP

cache

cachesmooth

Page 77: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Caching: Real Life

• Someone says “I swear to tell the truth”• System hears “I swerve to smell the soup”• Cache remembers!• Person says “The whole truth”, and, with cache, system hears “The whole soup.” – errors are locked in.

• Caching works well when users correct as they go, poorly or even hurts without correction.

Page 78: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Cache Results

0

5

10

15

20

25

30

35

40

100,000 1,000,000 10,000,000 all

Training Data Size

Per

plex

ity

Red

ucti

on

unigram + condbigram + condtrigramunigram + condtrigram

trigram

bigram

unigram

Page 79: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

5-grams

• Why stop at 3-grams?• If P(z|…rstuvwxy) P(z|xy) is good, then

P(z|…rstuvwxy) P(z|vwxy) is better!• Very important to smooth well• Interpolated Kneser-Ney works much better than Katz on 5-

gram, more than on 3-gram

Page 80: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

N-gram versus smoothing algorithm

5.5

6

6.5

7

7.5

8

8.5

9

9.5

10

1 2 3 4 5 6 7 8 9 10 20

n-gram order

En

trop

y

100,000 Katz

100,000 KN

1,000,000 Katz

1,000,000 KN

10,000,000 Katz

10,000,000 KN

all Katz

all KN

Page 81: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Speech recognizer mechanics

• Keep many hypotheses alive

• Find acoustic, language model scores– P(acoustics | truth = .3), P(truth | tell the) = .1– P(acoustics | soup = .2), P(soup | smell the) = .01

“…tell the” (.01)“…smell the” (.01)

“…tell the truth” (.01 .3 .1)“…smell the soup” (.01 .2 .01)

Page 82: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Speech recognizer slowdowns

• Speech recognizer uses tricks (dynamic programming) to merge hypotheses

Trigram: Fivegram:

“…tell the”“…smell the”

“…swear to tell the”“…swerve to smell the”“…swear too tell the”

“…swerve too smell the”“…swerve to tell the”

“…swerve too tell the”…

Page 83: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Speech recognizer vs. n-gram

• Recognizer can threshold out bad hypotheses• Trigram works so much better than bigram, better

thresholding, no slow-down • 4-gram, 5-gram start to become expensive

Page 84: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Speech recognizer with language model

• In theory,

• In practice, language model is a better predictor -- acoustic probabilities aren’t “real” probabilities

• In practice, penalize insertions

)()|(maxarg cewordsequenPcewordsequenacousticsPcewordsequen

)(8 1.)(

)|(maxarg cewordsequenlength

cewordsequen cewordsequenP

cewordsequenacousticsP

Page 85: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Skipping

• P(z|…rstuvwxy) P(z|vwxy)• Why not P(z|v_xy) – “skipping” n-gram – skips value of 3-

back word.• Example: “P(time | show John a good)” ->

P(time | show ____ a good)• P(…rstuvwxy) P(z|vwxy) + P(z|vw_y) + (1--)P(z|v_xy)

Page 86: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

5-gram Skipping Results

0

1

2

3

4

5

6

7

10000 100000 1000000 10000000 1E+08 1E+09

Training Size

Pe

rple

xit

y R

ed

uc

tio

n

vwyx, vxyw, wxyv,vywx, yvwx, xvwy,wvxyvw_y, v_xy, vwx_ --skipping

xvwy, wvxy, yvwx, --rearranging

vwyx, vxyw, wxyv --rearranging

vwyx, vywx, yvwx --rearranging

vw_y, v_xy -- skipping

(Best trigram skipping result: 11% reduction)

Page 87: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 88: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Clustering

• CLUSTERING = CLASSES (same thing)• What is P(“Tuesday | party on”)• Similar to P(“Monday | party on”)• Similar to P(“Tuesday | celebration on”)• Put words in clusters:

– WEEKDAY = Sunday, Monday, Tuesday, …– EVENT=party, celebration, birthday, …

Page 89: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Clustering overview

• Major topic, useful in many fields• Kinds of clustering

– Predictive clustering– Conditional clustering– IBM-style clustering

• How to get clusters– Be clever or it takes forever!

Page 90: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Predictive clustering

• Let “z” be a word, “Z” be its cluster• One cluster per word: hard clustering

– WEEKDAY = Sunday, Monday, Tuesday, …– MONTH = January, February, April, May, June, …

• P(z|xy) = P(Z|xy) P(z|xyZ)• P(Tuesday | party on) = P(WEEKDAY | party on)

P(Tuesday | party on WEEKDAY)• Psmooth(z|xy) Psmooth (Z|xy) Psmooth (z|xyZ)

Page 91: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Predictive clustering example

Find P(Tuesday | party on) Psmooth (WEEKDAY | party on) Psmooth (Tuesday | party on WEEKDAY)C( party on Tuesday) = 0C(party on Wednesday) = 10C(arriving on Tuesday) = 10C(on Tuesday) = 100

Psmooth (WEEKDAY | party on) is high

Psmooth (Tuesday | party on WEEKDAY) backs off to Psmooth (Tuesday | on WEEKDAY)

Page 92: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Cluster Results

-20

-15

-10

-5

0

5

10

15

20

100,000 1,000,000 10,000,000

Training Size

Per

ple

xity

Red

uct

ion

Kneser-Neytrigram

Predict

IBM

Full IBMPredict

All Combine

Page 93: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Clustering: how to get them

• Build them by hand– Works ok when almost no data

• Part of Speech (POS) tags– Tends not to work as well as automatic

• Automatic Clustering– Swap words between clusters to minimize perplexity

Page 94: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Clustering: automatic

Minimize perplexity of P(z|Y) Mathematical tricks speed it up

Use top-down splitting,

not bottom up merging!

Page 95: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Two actual WSJ classes

• MONDAYS • FRIDAYS • THURSDAY • MONDAY • EURODOLLARS • SATURDAY• WEDNESDAY• FRIDAY• TENTERHOOKS • TUESDAY • SUNDAY• CONDITION

• PARTY• FESCO• CULT• NILSON • PETA • CAMPAIGN • WESTPAC • FORCE • CONRAN • DEPARTMENT • PENH• GUILD

Page 96: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Sentence Mixture Models

• Lots of different sentence types:– Numbers (The Dow rose one hundred seventy three

points)– Quotations (Officials said “quote we deny all wrong doing

”quote)– Mergers (AOL and Time Warner, in an attempt to control

the media and the internet, will merge)

• Model each sentence type separately

Page 97: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Sentence Mixture Models

• Roll a die to pick sentence type, sk with probability k

• Probability of sentence, given sk

• Probability of sentence across types:

m

k

n

ikiiik swwwP

1 112 )|(

Page 98: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Sentence Model Smoothing

• Each topic model is smoothed with overall model• Sentence mixture model is smoothed with overall model

(sentence type 0)

m

k

n

i iiik

kiiikk wwwP

swwwP

0 1 12

12

)|()1(

)|(

Page 99: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Sentence Mixture Results

0

2

4

6

8

10

12

14

16

18

20

1 2 4 8 16 32 64 128

Number of Sentence Types

Per

ple

xity

Red

uct

ion all 3gram

all 5gram

10,000,000 3gram

10,000,000 5gram

1,000,000 3gram

1,000,000 5gram

100,000 5gram

100,000 3gram

Page 100: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Sentence Clustering

• Same algorithm as word clustering

• Assign each sentence to a type, sk

• Minimize perplexity of P(z|sk ) instead of P(z|Y)

Page 101: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 102: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Structured Language Model

“The contract ended with a loss of 7 cents after”

Thanks to Ciprian Chelba for this figure

Page 103: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

How to get structure data?

• Use a Treebank (a collection of sentences with structure hand annotated) like Wall Street Journal, Penn Tree Bank.

• Problem: need a treebank.• Or – use a treebank (WSJ) to train a parser; then parse new training data (e.g. Broadcast News)

• Re-estimate parameters to get lower perplexity models.

Page 104: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Parsing vs. Trigram Eugene Charniaks’s Experiments

All experiments are trained on one million words of Penn tree-bank data, and tested on 80,000 words.

18%

Thanks to Eugene Charniak for this slide

Model Perplexity

Trigram poor smoothing 167

Trigram deleted-interpolation 155

Trigram Kneser-Ney 145

Parsing 119

Page 105: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Structured Language Models

• Promising results• But: time consuming; language is right branching; 5-grams,

skipping, capture similar information. • Interesting applications to parsing

– Combines nicely with parsing MT systems

Page 106: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

N-best lists

• Make list of 100 best translation hypotheses using simple bigram or trigram

• Rescore using any model you want– Cheaply apply complex models– Perform Source research separately from Channel

• For long, complex sentences, need exponentially many more hypotheses

Page 107: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Lattices for MTCompact version of n-best list

From Ueffing, Och and Ney, EMNLP ‘02

Page 108: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Tools: CMU Language Modeling Toolkit

• Can handle bigram, trigrams, more• Can handle different smoothing schemes • Many separate tools – output of one tool is input to next: easy to use

• Free for research purposes• http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

Page 109: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Tools: SRI Language Modeling Toolkit

• More powerful than CMU toolkit• Can handles clusters, lattices, n-best lists, hidden tags

• Free for research use• http://www.speech.sri.com/projects/srilm

Page 110: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Small enough

• Real language models are often huge• 5-gram models typically larger than the training data

• Use count-cutoffs (eliminate parameters with fewer counts) or, better

• Use Stolcke pruning – finds counts that contribute least to perplexity reduction,

– P(City | New York) P(City | York)– P(Friday | God it’s) P(Friday | it’s)

• Remember, Kneser-Ney helped most when lots of 1 counts

Page 111: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Combining Data

• Often, you have some “in domain” data and some “out of domain data”

• Example: Microsoft is working on translating computer manuals

– Only about 3 million words of Brazilian computer manuals

• Can combine computer manual data with hundreds of millions of words of other data

– Newspapers, web, encylcopedias,usenet…

Page 112: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

How to combine

• Just concatenate – add them all together– Bad idea – need to weight the “in domain” data more heavily

• Take out of domain data and multiple copies of in domain data (weight the counts)

– Bad idea – doesn’t work well, and messes up most smoothing techniques

Page 113: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

How to combine

• A good way: take weighted average, e.g.

Pmanuals (z|xy) + Pweb(z|xy) + (1- - ) Pnewspaper(z|xy)

• Can apply to channel models too (e.g. combine Hansard with computer manuals for French translation)

• Lots of research in other techniques– Maxent-inspired models, non-linear interpolation (log

domain), cluster models, etc. Minimal improvement (but see work by Rukmini Iyer)

Page 114: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

• Handwriting Recognition– P(observed ink|words) P(words)

• Telephone Keypad input– P(numbers|words) P(words)

• Spelling Correction– P(observed keys|words) P(words)

• Chinese/Japanese text entry– P(phonetic representation|characters) P(characters)

Other Language Model Uses

Language Model

Page 115: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Some Experiments

• Joshua Goodman re-implemented almost all techniques

• Trained on 260,000,000 words of WSJ• Optimize parameters on heldout• Test on separate test section• Some combinations extremely time-consuming (days of CPU time)

– Don’t try this at home, or in anything you want to ship

• Rescored N-best lists to get results– Maximum possible improvement from 10% word error

rate absolute to 5%

Page 116: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Overall Results: Perplexity  

Katz skip

all-cache-sentence

all-cache-skipall-cache

all-cache-5gram

all-cache-cluster

all-cache-KN

KN 5gram

KN SkipKN Sentence

KN Cluster

Katz ClusterKatz Sentence

Katz 5-gramKN

Katz

70

75

80

85

90

95

100

105

110

115

8.8 9 9.2 9.4 9.6 9.8 10

Word Error Rate

Per

ple

xity

Page 117: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 118: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

LM types

Language models used in speech recognition can be classified into the following categories:

• Uniform models: the chance a word occurs is 1 / V. V is the size of the vocabulary

• Finite state machines• Grammar models: they use context free grammars• Stochastic models: they determine the chance of a word on it’s

preceding words (eg n-grams)

Page 119: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

CFGA grammar is defined by:

G = (V, T, P, S) where:

V contains the set of all non-terminal symbols. T contains the set of all terminal symbols. P is a set of production or production rules. S is a special symbol called the start symbol.

Example of rules:S -> NP VP

VP -> V NPNP -> NOUNNP -> NAMENOUN -> speechNAME -> Julie Ethan VERB -> loves chases

Page 120: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

CFG

Parsing

• bottom up where you start with the input sentence and try to reach the start symbol

• Top down, you start with the starting symbol and try to reach the input sentence by applying the appropriate rules. Left recursion is a problem. (A -> Aa)

Advantage bottom up:“What is the weather forecast for this

afternoon?”

A lot of parsing algorithms available from computer science

Problem: people don’t follow the rules of grammar strictly, especially in spoken language. Creating a grammar that covers all this constructions is unfeasible.

Page 121: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

probabilistic CFG

A mixture between formal language and probabilistic models is the PCFG

If there are m rules for left-hand side non terminal node

Then probability of these rules is

Where C denotes the number of times each rule is used.

mAAAA ...,,: 21

)(/)()|(1

i

m

ijj ACACGAP

Page 122: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Outline

• Prob theory intro• Text prediction• Intro to LM• Perplexity• Smoothing• Caching• Clustering• Parsing• CFG• Homework

Page 123: Introduction to Language Modeling Alex Acero. Acknowledgments Joshua Goodman, Scott MacKenzie for many slides.

Homework

• Write a program to estimate the Entropy of written text (Shannon, 1950)

• Input: a text document (you pick it, the larger the better)• Write a program that predicts the next letter given the past

letters on a different text (make it interactive?)– Hint: use character ngrams

• Check it does a perfect job on the training text• Due 5/21