ECE 598: The Speech Chain Lecture 12: Information Theory.

51
ECE 598: The Speech ECE 598: The Speech Chain Chain Lecture 12: Information Lecture 12: Information Theory Theory

Transcript of ECE 598: The Speech Chain Lecture 12: Information Theory.

Page 1: ECE 598: The Speech Chain Lecture 12: Information Theory.

ECE 598: The Speech ECE 598: The Speech ChainChain

Lecture 12: Information Lecture 12: Information TheoryTheory

Page 2: ECE 598: The Speech Chain Lecture 12: Information Theory.

TodayToday InformationInformation

Speech as CommunicationSpeech as Communication Shannon’s Measurement of InformationShannon’s Measurement of Information

EntropyEntropy Entropy = Average InformationEntropy = Average Information ““Complexity” or “sophistication” of a textComplexity” or “sophistication” of a text

Conditional Entropy Conditional Entropy Conditional Entropy = Average Conditional InformationConditional Entropy = Average Conditional Information Example: Confusion Matrix, Articulation TestingExample: Confusion Matrix, Articulation Testing Conditional Entropy vs. SNRConditional Entropy vs. SNR

Channel CapacityChannel Capacity Mutual Information = Entropy – Conditional EntropyMutual Information = Entropy – Conditional Entropy Channel Capacity = max{Mutual Information}Channel Capacity = max{Mutual Information}

Finite State Language ModelsFinite State Language Models GrammarsGrammars Regular Grammar = Finite State Grammar = Markov GrammarRegular Grammar = Finite State Grammar = Markov Grammar Entropy of a Finite State GrammarEntropy of a Finite State Grammar

N-Gram Language ModelsN-Gram Language Models Maximum-Likelihood EstimationMaximum-Likelihood Estimation Cross-Entropy of Text given its N-GramCross-Entropy of Text given its N-Gram

Page 3: ECE 598: The Speech Chain Lecture 12: Information Theory.

InformationInformation

Page 4: ECE 598: The Speech Chain Lecture 12: Information Theory.

Speech as CommunicationSpeech as Communication

wwnn, ŵn = words selected from = words selected from vocabulary V vocabulary V Size of the vocabulary, |V|, is assumed to be finite.Size of the vocabulary, |V|, is assumed to be finite.

No human language has a truly finite vocabulary!!!No human language has a truly finite vocabulary!!! For a more accurate analysis, we should do a phoneme-by-For a more accurate analysis, we should do a phoneme-by-

phoneme analysis, i.e., wphoneme analysis, i.e., wnn, ŵn = phonemes in language V= phonemes in language V If |V| is finite, then we can define p(wIf |V| is finite, then we can define p(wnn|h|hnn):):

hhnn = all relevant history, including = all relevant history, including previous words of the same utterance, wprevious words of the same utterance, w11,…,w,…,wn-1n-1 Dialog history (what did the other talker just say?)Dialog history (what did the other talker just say?) Shared knowledge, e.g., physical knowledge, cultural knowledgeShared knowledge, e.g., physical knowledge, cultural knowledge

0 ≤ p(w0 ≤ p(wnn|h|hnn) ≤ 1 ) ≤ 1 wn in wn in VV p(w p(wnn|h|hnn) = 1) = 1

I said these words:

[…,wn,…]I heard these

words: […,ŵn,…]

+

Acoustic Noise, Babble, Reverberation, …

Speech (Acoustic Signal) Noisy Speech

Page 5: ECE 598: The Speech Chain Lecture 12: Information Theory.

Shannon’s Criteria for a Shannon’s Criteria for a Measure of “Information”Measure of “Information”

Information should be…Information should be… Non-negativeNon-negative

I(wI(wnn|h|hnn) ≥ 0) ≥ 0

Zero if a word is perfectly predictable from Zero if a word is perfectly predictable from its historyits history I(wI(wnn|h|hnn) = 0 if and only if p(w) = 0 if and only if p(wnn|h|hnn) = 1) = 1

Large for unexpected wordsLarge for unexpected words I(I(wwnn|h|hnn) is large if p(w) is large if p(wnn|h|hnn) is small) is small

AdditiveAdditive I(wI(wn-1n-1,w,wnn) = I(w) = I(wn-1n-1) + I(w) + I(wnn))

Page 6: ECE 598: The Speech Chain Lecture 12: Information Theory.

Shannon’s Measure of Shannon’s Measure of InformationInformation

All of the criteria are satisfied by the All of the criteria are satisfied by the following definition of information:following definition of information:

I(wI(wnn) = log) = logaa(1/p(w(1/p(wnn|h|hnn)) = )) = loglogaa p(w p(wnn|h|hnn))

Page 7: ECE 598: The Speech Chain Lecture 12: Information Theory.

InformationInformationInformation is…Information is…

Non-negative: Non-negative:

p(wp(wnn|h|hnn)<1 )<1 loglogaap(wp(wnn|h|hnn)<0 )<0 I(wI(wnn|h|hnn)>0)>0 Zero if a word is perfectly predictable: Zero if a word is perfectly predictable:

p(wp(wnn|h|hnn)=1 )=1 loglogaap(wp(wnn|h|hnn)=0 )=0 I(wI(wnn|h|hnn)=0)=0 Large if a word is unpredictable: Large if a word is unpredictable:

I(wI(wnn)= )= loglogaap(wp(wnn|h|hnn) large if p(w) large if p(wnn|h|hnn) small) small Additive:Additive:

p(wp(wn-1n-1,w,wnn) = p(w) = p(wn-1n-1)p(w)p(wnn))

loglogaap(wp(wn-1n-1,w,wnn) = ) = loglogaap(wp(wn-1n-1) ) loglogaap(wp(wnn))

I(wI(wn-1n-1,w,wnn) = I(w) = I(wn-1n-1) + I(w) + I(wnn))

Page 8: ECE 598: The Speech Chain Lecture 12: Information Theory.

Bits, Nats, and DigitsBits, Nats, and Digits Consider a string of random coin tosses, Consider a string of random coin tosses,

“HTHHHTTHTHHTTT”“HTHHHTTHTHHTTT” p(wp(wnn|h|hnn) = ½) = ½ loglog22 p(w p(wnn|h|hnn) = 1 bit of information/symbol) = 1 bit of information/symbol ln p(wln p(wnn|h|hnn) = 0.69 nats/bit) = 0.69 nats/bit loglog1010 p(w p(wnn|h|hnn) = 0.3 digits/bit) = 0.3 digits/bit

Consider a random string of digits, “49873417”Consider a random string of digits, “49873417” p(wp(wnn|h|hnn) = 1/10) = 1/10 loglog1010 p(w p(wnn|h|hnn) = 1 digit/symbol) = 1 digit/symbol loglog22 p(w p(wnn|h|hnn) = 3.32 bits/digit) = 3.32 bits/digit ln p(wln p(wnn|h|hnn) = 2.3 nats/digit) = 2.3 nats/digit

Unless otherwise specified, information is usually Unless otherwise specified, information is usually measured in bits measured in bits

I(w|h) = I(w|h) = loglog2 2 p(w|h)p(w|h)

Page 9: ECE 598: The Speech Chain Lecture 12: Information Theory.

EntropyEntropy

Page 10: ECE 598: The Speech Chain Lecture 12: Information Theory.

Entropy = Average Entropy = Average InformationInformation

How unpredictable is the next word?How unpredictable is the next word? Entropy = Average UnpredictabilityEntropy = Average Unpredictability

H(p) = H(p) = ww p(w) I(w) p(w) I(w)

H(p) = H(p) = ww p(w) log p(w) log22p(w)p(w)

Notice that entropy is not a function Notice that entropy is not a function of the word, w…of the word, w…

It is a function of the probability It is a function of the probability distribution, p(w)distribution, p(w)

Page 11: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Uniform SourceExample: Uniform Source Entropy of a coin toss:Entropy of a coin toss:

H(p) = H(p) =

p(“heads”)logp(“heads”)log22p(“heads”)p(“heads”)p(“tails”)logp(“tails”)log22p(“tails”)p(“tails”)

==0.5 log0.5 log22(0.5)(0.5)0.5 log0.5 log22(0.5)(0.5)= 1 bit/symbol= 1 bit/symbol

Entropy of uniform source with N different words:Entropy of uniform source with N different words:p(w) = 1/Np(w) = 1/N

H(p) = H(p) = w=1w=1N N p(w)logp(w)log22p(w)p(w)

= log= log22NN In general: if all words are equally likely, then the In general: if all words are equally likely, then the

average unpredictability (“entropy”) is equal to average unpredictability (“entropy”) is equal to the unpredictability of any particular word (the the unpredictability of any particular word (the “information” conveyed by that word): log“information” conveyed by that word): log22N bits.N bits.

Page 12: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Non-Uniform Example: Non-Uniform SourceSource

Consider the toss of a weighted coin, with the following Consider the toss of a weighted coin, with the following probabilities:probabilities:

p(“heads”) = 0.6p(“heads”) = 0.6p(“tails”) = 0.4p(“tails”) = 0.4

Information communicated by each word:Information communicated by each word:I(“heads”) = I(“heads”) = loglog22 0.6 = 0.74 bits 0.6 = 0.74 bits

I(“tails”) = I(“tails”) = loglog22 0.4 = 1.3 bits 0.4 = 1.3 bits Entropy = average informationEntropy = average information

H(p) = H(p) = 0.6 log0.6 log22(0.6)(0.6)0.4 log0.4 log22(0.4)(0.4)= 0.97 bits/symbol on average= 0.97 bits/symbol on average

The entropy of a non-uniform source is always less than the The entropy of a non-uniform source is always less than the entropy of a uniform source with the same vocabulary.entropy of a uniform source with the same vocabulary. Entropy of a uniform source, N-word vocabulary, is logEntropy of a uniform source, N-word vocabulary, is log22N bitsN bits Information conveyed by a likely word is less than logInformation conveyed by a likely word is less than log22N bitsN bits Information conveyed by an unlikely word is more than logInformation conveyed by an unlikely word is more than log22N N

bits --- but that word is unlikely to occur!bits --- but that word is unlikely to occur!

Page 13: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Deterministic Example: Deterministic SourceSource

Consider the toss of a two-headed coin, with the Consider the toss of a two-headed coin, with the following probabilities:following probabilities:

p(“heads”) = 1.0p(“heads”) = 1.0p(“tails”) = 0.0p(“tails”) = 0.0

Information communicated by each word:Information communicated by each word:

I(“heads”) = I(“heads”) = loglog22 1.0 = 0 bits 1.0 = 0 bits

I(“tails”) = I(“tails”) = loglog22 0.0 = infinite bits!! 0.0 = infinite bits!! Entropy = average informationEntropy = average information

H(p) = H(p) = 1.0 log1.0 log22(1.0)(1.0)0.0 log0.0 log22(0.0)(0.0)= 0 bits/symbol on average= 0 bits/symbol on average

If you know in advance what each word will be…If you know in advance what each word will be… then you gain no information by listening to the then you gain no information by listening to the

message. message. The “entropy” (average information per symbol) is zero.The “entropy” (average information per symbol) is zero.

Page 14: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Textual Example: Textual ComplexityComplexity

Twas brillig, and the slithy tovesTwas brillig, and the slithy toves

did gyre and gimble in the wabe…did gyre and gimble in the wabe…

p(w) = 2/13 for w=“and,” w=“the”p(w) = 2/13 for w=“and,” w=“the” p(w) = 1/13 for the other 9 wordsp(w) = 1/13 for the other 9 words

H(p) = H(p) = ww p(w)logp(w)log22p(w)p(w)

= = 2(2/13)log2(2/13)log22(2/13)(2/13)9(1/13)log9(1/13)log22(1/13)(1/13)

= 3.4 bits/word= 3.4 bits/word

Page 15: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Textual Example: Textual ComplexityComplexity

How much wood would a woodchuck How much wood would a woodchuck chuck if a woodchuck could chuck chuck if a woodchuck could chuck

wood?wood?

p(w) = 2/13 for “wood, a, woodchuck, p(w) = 2/13 for “wood, a, woodchuck, chuck”chuck”

p(w) = 1/13 for the other 5 wordsp(w) = 1/13 for the other 5 words

H(p) = H(p) = ww p(w)logp(w)log22p(w)p(w)

= = 4(2/13)log4(2/13)log22(2/13)(2/13)5(1/13)log5(1/13)log22(1/13)(1/13)= 3.0 bits/word= 3.0 bits/word

Page 16: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: The Speech Example: The Speech ChannelChannel

How much wood wood a wood chuck chuck if a How much wood wood a wood chuck chuck if a wood chuck could chuck wood?wood chuck could chuck wood?

p(w) = 5/15 for w=“wood”p(w) = 5/15 for w=“wood” p(w) = 4/15 for w=“chuck”p(w) = 4/15 for w=“chuck” p(w) = 2/15 for w=“a”p(w) = 2/15 for w=“a” p(w) = 1/15 for “how, much, if, could”p(w) = 1/15 for “how, much, if, could”

H(p) = H(p) = ww p(w)logp(w)log22p(w) = 2.5 bits/wordp(w) = 2.5 bits/word

Page 17: ECE 598: The Speech Chain Lecture 12: Information Theory.

Conditional Conditional Entropy Entropy

(Equivocation)(Equivocation)

Page 18: ECE 598: The Speech Chain Lecture 12: Information Theory.

Conditional Entropy = Average Conditional Entropy = Average Conditional InformationConditional Information

Suppose that p(w|h) is “conditional” upon some Suppose that p(w|h) is “conditional” upon some history variable h. Then the information provided history variable h. Then the information provided by w isby w is

I(w|h) = I(w|h) = loglog2 2 p(w|h)p(w|h) Suppose that we also know the probability Suppose that we also know the probability

distribution of the history variable, p(h)distribution of the history variable, p(h) The joint probability of w and h isThe joint probability of w and h is

p(w,h) = p(w|h)p(h)p(w,h) = p(w|h)p(h) The average information provided by any word, w, The average information provided by any word, w,

averaged across all possible histories, is the averaged across all possible histories, is the “conditional entropy” H(p(w|h)):“conditional entropy” H(p(w|h)):

H(p(w|h)) = H(p(w|h)) = wwhh p(w,h) I(w|h) p(w,h) I(w|h)

= = h h p(h) p(h) ww p(w|h) log p(w|h) log22p(w|h)p(w|h)

Page 19: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: CommunicationExample: Communication

Suppose wSuppose w, ŵ are not always the same are not always the same … … but we can estimate the probability p(but we can estimate the probability p(ŵ|w|w) Entropy of the source is

H(p(w)) = H(p(w)) = ww p(w) log p(w) log22p(w)p(w) Conditional Entropy of the received information isConditional Entropy of the received information is

H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ|w)|w) Conditional Entropy of Received Message given Conditional Entropy of Received Message given

the Transmitted Message is called the Transmitted Message is called EquivocationEquivocation

I said these words:

[…,wn,…]I heard these

words: […,ŵn,…]

+

Acoustic Noise, Babble, Reverberation, …

Speech (Acoustic Signal) Noisy Speech

Page 20: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Articulation Example: Articulation TestingTesting

The “caller” reads a list of nonsense The “caller” reads a list of nonsense syllablessyllables Miller and Nicely, 1955: only one consonant per Miller and Nicely, 1955: only one consonant per

utterance is randomizedutterance is randomized Fletcher: CVC syllables, all three phonemes are Fletcher: CVC syllables, all three phonemes are

randomly selectedrandomly selected The “listener” writes down what she hearsThe “listener” writes down what she hears The lists are compared to compute p(The lists are compared to compute p(ŵ|w)|w)

“a tug”“a sug”“a fug”

“a tug”“a thug”“a fug”

+

Acoustic Noise, Babble, Reverberation, …

Speech (Acoustic Signal) Noisy Speech

Page 21: ECE 598: The Speech Chain Lecture 12: Information Theory.

Confusion Matrix: Consonants at -6dB Confusion Matrix: Consonants at -6dB SNRSNR

(Miller and Nicely, 1955)(Miller and Nicely, 1955)pp tt kk ff ss ʃʃ bb dd gg vv ðð zz ƷƷ mm nn

pp 8080 4343 6464 1717 1414 66 22 11 11 11 11 22

tt 7171 8484 5555 55 99 33 88 11 11 11

kk 6666 7676 107107 1212 88 99 44 11 11

ff 1818 1212 99 171755

4848 1111 11 77 22 11 22 22

1919 1717 1616 101044

6464 3232 77 55 44 55 66 44 55

ss 88 55 44 2323 3939 101077

4545 44 22 33 11 11 33 22 11

ʃʃ 11 66 33 44 66 2929 191955

33 11

bb 11 55 44 44 131366

1010 99 4747 1616 66 11 55 44

dd 88 55 8080 4545 1111 2020 2020 2626 11

gg 22 33 6363 6666 33 1919 3737 5656 33

vv 22 22 4848 55 55 141455

4545 1212 44

ðð 66 3131 66 1717 8686 5858 2121 55 66 44

zz 11 11 1717 2020 2727 1616 2828 9494 4444 11

ƷƷ 11 2626 1818 33 88 4545 121299

22

mm 11 44 44 11 33 171777

4646

NN 44 11 55 22 77 11 66 4747 161633

Perceived (ŵ)

Calle

d (

w)

Page 22: ECE 598: The Speech Chain Lecture 12: Information Theory.

Conditional Probabilities p(Conditional Probabilities p(ŵ|w), 1 signif. |w), 1 signif. dig.dig.

(Miller and Nicely, 1955)(Miller and Nicely, 1955)pp tt kk ff ss ʃʃ bb dd gg vv ðð zz ƷƷ mm nn

pp 0.30.3 0.0.22

0.30.3 0.10.1 0.10.1

tt 0.30.3 0.0.33

0.20.2

kk 0.20.2 0.0.33

0.40.4

ff 0.10.1 0.60.6 0.20.2

0.10.1 0.0.11

0.10.1 0.40.4 0.20.2 0.10.1

ss 0.10.1 0.20.2 0.40.4 0.20.2 0.10.1

ʃʃ 0.10.1 0.80.8

bb 0.60.6 0.20.2 0.10.1

dd 0.40.4 0.20.2 0.10.1 0.10.1 0.10.1 0.10.1

gg 0.30.3 0.30.3 0.10.1 0.10.1 0.20.2

vv 0.20.2 0.50.5 0.20.2

ðð 0.10.1 0.10.1 0.40.4 0.20.2 0.10.1

zz 0.10.1 0.10.1 0.10.1 0.10.1 0.10.1 0.40.4 0.20.2

ƷƷ 0.10.1 0.10.1 0.20.2 0.60.6

mm 0.80.8 0.20.2

nn 0.20.2 0.70.7

Perceived (ŵ)

Calle

d (

w)

Page 23: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Articulation Testing, Example: Articulation Testing, -6dB SNR-6dB SNR

At -6dB SNR, p(At -6dB SNR, p(ŵ|w) is nonzero for |w) is nonzero for about 4 different possible responsesabout 4 different possible responses

The equivocation is roughly The equivocation is roughly

H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ||w)w)

= = ww (1/18) (1/18) ŵ (1/4) log(1/4) log22 (1/4) (1/4)

= = 1818(1/18)(1/18)44(1/4)(1/4)loglog22(1/4)(1/4)

= 2 bits= 2 bits

Page 24: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: Perfect Example: Perfect TransmissionTransmission

At very high signal to noise ratio (for At very high signal to noise ratio (for humans, SNR > 18dB)…humans, SNR > 18dB)…

The listener understands exactly what The listener understands exactly what the talker said:the talker said:

p(p(ŵ|w)=1 (for |w)=1 (for ŵ=w) or =w) or p(p(ŵ|w)=0 (for |w)=0 (for ŵ≠w)≠w)

So H(p(So H(p(ŵ|w))=0 --- zero equivocation|w))=0 --- zero equivocation Meaning: If you know exactly what the Meaning: If you know exactly what the

talker said, then that’s what you’ll talker said, then that’s what you’ll write; there is no more randomness write; there is no more randomness leftleft

Page 25: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: No TransmissionExample: No Transmission At very low signal to noise ratio (for humans, At very low signal to noise ratio (for humans,

SNR < minus 18dB)…SNR < minus 18dB)… The listener doesn’t understand anything the The listener doesn’t understand anything the

talker said:talker said: p(p(ŵ|w)=p(|w)=p(ŵ) The listener has no idea what the talker said, so she The listener has no idea what the talker said, so she

has to guess.has to guess. Her guesses follow the natural pattern of the Her guesses follow the natural pattern of the

language: she is more likely to write /s/ or /t/ instead of language: she is more likely to write /s/ or /t/ instead of /// or / or //ðð/./.

So H(p(So H(p(ŵ|w))=H(p(|w))=H(p(ŵ))=H(p(w)) --- conditional ))=H(p(w)) --- conditional entropy is exactly the same as the original entropy is exactly the same as the original source entropysource entropy

Meaning: If you have no idea what the talker Meaning: If you have no idea what the talker said, then you haven’t learned anything by said, then you haven’t learned anything by listeninglistening

Page 26: ECE 598: The Speech Chain Lecture 12: Information Theory.

Equivocation as a Function of Equivocation as a Function of SNRSNR

SNR

Region 1: No Transmission; Random Guessing

Equivocation = Entropy of the Language

Region 2:Equivocation depends on SNR

Eq

uiv

oca

tion (

Bit

s)

Region 3:Error-Free Transmission

Equivocation = 0

Dashed Line = Entropy of the Language(e.g., for 18-consonant articulation testing,source entropy H(p(w))=log218 bits)

18dB-18dB

Page 27: ECE 598: The Speech Chain Lecture 12: Information Theory.

Mutual Mutual Information, Information,

Channel CapacityChannel Capacity

Page 28: ECE 598: The Speech Chain Lecture 12: Information Theory.

Definition: Mutual Definition: Mutual InformationInformation

On average, how much information On average, how much information gets gets correctly transmittedcorrectly transmitted from from caller to listener?caller to listener?

““Mutual Information” = Mutual Information” = Average Information in the Caller’s Average Information in the Caller’s

MessageMessage ……minus…minus… Conditional Randomness of the Conditional Randomness of the

Listener’s PerceptionListener’s Perception I(p(I(p(ŵ,w)) = H(p(w)) – H(p(,w)) = H(p(w)) – H(p(ŵ|w))|w))

Page 29: ECE 598: The Speech Chain Lecture 12: Information Theory.

Listeners use Context to Listeners use Context to “Guess” the Message“Guess” the Message

Consider gradually increasing the complexity of the Consider gradually increasing the complexity of the message in a noisy environmentmessage in a noisy environment

1.1. Caller says “yes, yes, no, yes, no.” Entropy of Caller says “yes, yes, no, yes, no.” Entropy of the message is 1 bit/word; listener gets enough the message is 1 bit/word; listener gets enough from lip-reading to correctly guess every word.from lip-reading to correctly guess every word.

2.2. Caller says “429986734.” Entropy of the Caller says “429986734.” Entropy of the message is 3.2 bits/word; listener can still message is 3.2 bits/word; listener can still understand just by lip reading.understand just by lip reading.

3.3. Caller says “Let’s go scuba diving in Puerto Caller says “Let’s go scuba diving in Puerto Vallarta this January.” Vallarta this January.”

Listener Listener effortlesslyeffortlessly understands the low-entropy parts understands the low-entropy parts of the message (“let’s go”), but of the message (“let’s go”), but

The high-entropy parts (“scuba diving,” “Puerto The high-entropy parts (“scuba diving,” “Puerto Vallarta”) are completely lost in the noiseVallarta”) are completely lost in the noise

Page 30: ECE 598: The Speech Chain Lecture 12: Information Theory.

The Mutual Information Ceiling The Mutual Information Ceiling EffectEffect

Source Message Entropy (Bits)Mu

tual In

form

ati

on

Tra

nsm

itte

d (

Bit

s)

Region 1: Perfect Transmission

Equivocation=0

Mutual Information= Source Entropy

Region 2:Mutual Information clipped at an SNR-dependent maximum bit rate called the “channel capacity.”

Equivocation = Source Entropy – Channel Capacity

Eq

uiv

ocati

on

(B

its)

Page 31: ECE 598: The Speech Chain Lecture 12: Information Theory.

Definition: Channel CapacityDefinition: Channel Capacity Capacity of a Channel = Maximum Capacity of a Channel = Maximum

number of bits per second that may number of bits per second that may be transmitted, error-free, through be transmitted, error-free, through that channelthat channel

C = maxC = maxpp I(p( I(p(ŵ,w)),w)) The maximum is taken over all The maximum is taken over all

possible source distributions, i.e., over possible source distributions, i.e., over all possible H(p(w))all possible H(p(w))

Page 32: ECE 598: The Speech Chain Lecture 12: Information Theory.

Information Theory Jargon Information Theory Jargon ReviewReview

Information = Unpredictability of a wordInformation = Unpredictability of a word

I(w) = I(w) = loglog22p(w)p(w) Entropy = Average Information of the words in a message Entropy = Average Information of the words in a message

H(p(w)) = H(p(w)) = ww p(w) log p(w) log22p(w)p(w) Conditional Entropy = Average Conditional InformationConditional Entropy = Average Conditional Information

H(p(w|h)) = H(p(w|h)) = h h p(h) p(h) ww p(w|h) log p(w|h) log22p(w|h)p(w|h) Equivocation = Conditional Entropy of the Received Equivocation = Conditional Entropy of the Received

Message given the Transmitted MessageMessage given the Transmitted Message

H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ|w)|w) Mutual Information = Entropy minus EquivocationMutual Information = Entropy minus Equivocation

I(p(I(p(ŵ,w)) = H(p(w)) – H(p(,w)) = H(p(w)) – H(p(ŵ|w))|w)) Channel Capacity = Maximum Mutual InformationChannel Capacity = Maximum Mutual Information

C(SNR) = maxC(SNR) = maxpp I(p( I(p(ŵ,w)),w))

Page 33: ECE 598: The Speech Chain Lecture 12: Information Theory.

Finite State Finite State Language ModelsLanguage Models

Page 34: ECE 598: The Speech Chain Lecture 12: Information Theory.

GrammarGrammar Discussion so far has ignored the “history,” Discussion so far has ignored the “history,”

hhnn. . How does hHow does hnn affect p(w affect p(wnn|h|hnn)?)? Topics that we won’t discuss today, but that Topics that we won’t discuss today, but that

computational linguists are working on:computational linguists are working on: Dialog contextDialog context Shared knowledgeShared knowledge

Topics that we will discuss:Topics that we will discuss: Previous words in the same utterance:Previous words in the same utterance:

p(wp(wnn|w|w11,…,w,…,wn-1n-1) ≠ p(w) ≠ p(wnn)) A grammar = something that decides A grammar = something that decides

whether or not (wwhether or not (w11,…,w,…,wNN) is a possible ) is a possible sentence.sentence.

Probabilistic grammar = something that Probabilistic grammar = something that calculates p(wcalculates p(w11,…,w,…,wNN))

Page 35: ECE 598: The Speech Chain Lecture 12: Information Theory.

GrammarGrammar Definition: A Grammar, Definition: A Grammar, GG, has four parts, has four parts

GG = { = { S, N, V, PS, N, V, P } } NN = A set of “non-terminal” nodes = A set of “non-terminal” nodes

Example: Example: NN = { Sentence, NP, VP, NOU, VER, DET, ADJ, ADV } = { Sentence, NP, VP, NOU, VER, DET, ADJ, ADV } VV = A set of “terminal” nodes, a.k.a. a “vocabulary” = A set of “terminal” nodes, a.k.a. a “vocabulary”

Example: Example: VV = { how, much, wood, would, a } = { how, much, wood, would, a } SS = The non-terminal node that sits at the top of every = The non-terminal node that sits at the top of every

parse treeparse tree Example: S = { Sentence }Example: S = { Sentence }

PP = A set of production rules = A set of production rules Example (CFG in “Chomsky normal form”):Example (CFG in “Chomsky normal form”):Sentence = NP VPSentence = NP VPNP = DET NPNP = DET NPNP = ADJ NPNP = ADJ NPNP = NOUNP = NOUNOU = woodNOU = woodNOU = woodchuckNOU = woodchuck

Page 36: ECE 598: The Speech Chain Lecture 12: Information Theory.

Types of GrammarTypes of Grammar A type 0 (“unrestricted”) grammar can have anything on either A type 0 (“unrestricted”) grammar can have anything on either

side of a production ruleside of a production rule A type 1 (“context sensitive grammar,” CSG) has rules of the A type 1 (“context sensitive grammar,” CSG) has rules of the

following form:following form:<context1> N <context2> = <context1> STUFF <context2><context1> N <context2> = <context1> STUFF <context2>……where… where… <context1> and <context2> are arbitrary unchanged contexts<context1> and <context2> are arbitrary unchanged contexts N is an arbitrary non-terminal, e.g., “NP”N is an arbitrary non-terminal, e.g., “NP” STUFF is an arbitrary sequence of terminals and non-terminals, e.g., STUFF is an arbitrary sequence of terminals and non-terminals, e.g.,

“the big ADJ NP” would be an acceptable STUFF“the big ADJ NP” would be an acceptable STUFF A type 2 grammar (“context free grammar,” CFG) has rules of the A type 2 grammar (“context free grammar,” CFG) has rules of the

following form:following form:N = STUFFN = STUFF

A type 3 grammar (“regular grammar,” RG) has rules of the A type 3 grammar (“regular grammar,” RG) has rules of the following form:following form:

N1 = T1 N2N1 = T1 N2 N1, N2 are non-terminalsN1, N2 are non-terminals T1 is a terminal node – a word!T1 is a terminal node – a word! Acceptable Example: NP = the NPAcceptable Example: NP = the NP Unacceptable Example: Sentence = NP VPUnacceptable Example: Sentence = NP VP

Page 37: ECE 598: The Speech Chain Lecture 12: Information Theory.

Regular Grammar = Finite Regular Grammar = Finite State Grammar (Markov State Grammar (Markov

Grammar)Grammar) Let every non-terminal be a “state”Let every non-terminal be a “state” Let every production rule be a “transition”Let every production rule be a “transition” Example:Example:

S = a SS = a S S = woodchuck VPS = woodchuck VP VP = could VPVP = could VP VP = chuck NPVP = chuck NP NP = how QPNP = how QP QP = much NPQP = much NP NP = woodNP = wood

S VP NP END

a could how much

woodchuck chuck wood

QP

Page 38: ECE 598: The Speech Chain Lecture 12: Information Theory.

Probabilistic Finite State Probabilistic Finite State GrammarGrammar

Every production rule has an associated conditional Every production rule has an associated conditional probability: p(production rule | LHS nonterminal)probability: p(production rule | LHS nonterminal)

Example:Example: S = a S 0.5S = a S 0.5 S = woodchuck VP 0.5S = woodchuck VP 0.5 VP = could VP 0.7VP = could VP 0.7 VP = chuck NP 0.3VP = chuck NP 0.3 NP = how QP 0.4NP = how QP 0.4 QP = much NP 1.0QP = much NP 1.0 NP = wood 0.6NP = wood 0.6

S VP NP END

a (0.5)

could (0.7) how (0.4)

much (1.0)

woodchuck (0.5) chuck (0.3) wood (0.6)

QP

Page 39: ECE 598: The Speech Chain Lecture 12: Information Theory.

Calculating the probability of Calculating the probability of texttext

p(“a woodchuck could chuck how much wood”) =p(“a woodchuck could chuck how much wood”) =(0.5)(0.5)(0.7)(0.3)(0.4)(1.0)(0.6) (0.5)(0.5)(0.7)(0.3)(0.4)(1.0)(0.6)

= 0.0126= 0.0126 p(“woodchuck chuck wood”) = p(“woodchuck chuck wood”) =

(0.5)(0.3)(0.6) = 0.09(0.5)(0.3)(0.6) = 0.09 p(“A woodchuck could chuck how much wood. p(“A woodchuck could chuck how much wood.

Woodchuck chuck wood.”) =Woodchuck chuck wood.”) =(0.0126)(0.09) = 0.01134(0.0126)(0.09) = 0.01134

p(some very long text corpus) = p(1p(some very long text corpus) = p(1stst sentence)p(2 sentence)p(2ndnd sentence)…sentence)…

S VP NP END

a (0.5)

could (0.7) how (0.4)

much (1.0)

woodchuck (0.5) chuck (0.3) wood (0.6)

QP

Page 40: ECE 598: The Speech Chain Lecture 12: Information Theory.

Cross-EntropyCross-Entropy

Cross-entropy of an N-word text, T, given a Cross-entropy of an N-word text, T, given a language model, G:language model, G:

H(T|G) = H(T|G) = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G,h|G,hnn))

= = n=1n=1NN (1/N) log (1/N) log22p(wp(wnn|G,h|G,hnn))

N = # words in the textN = # words in the text p(wp(wnn|T) = (# times wn occurs)/N|T) = (# times wn occurs)/N p(wp(wnn|G) = probability of word w|G) = probability of word wnn given its given its

history hhistory hnn, according to language model G, according to language model G

Page 41: ECE 598: The Speech Chain Lecture 12: Information Theory.

Cross-Entropy: ExampleCross-Entropy: Example

T = “A woodchuck could chuck wood.”T = “A woodchuck could chuck wood.” H(T|G) =H(T|G) =

= = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G,h|G,hnn))

= = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G,h|G,hnn))

= = (1/5) { log(1/5) { log22(0.5) + log(0.5) + log22(0.5) + log(0.5) + log22(0.7)+ log(0.7)+ log22(0.3)+ log(0.3)+ log22(0.6) }(0.6) }

= 4.989/5 = 0.998 bits= 4.989/5 = 0.998 bits Interpretation: language model G assigns entropy of 0.998 bits Interpretation: language model G assigns entropy of 0.998 bits

to the words in text T.to the words in text T. This is a very low cross-entropy: G predicts T very well.This is a very low cross-entropy: G predicts T very well.

S VP NP END

a (0.5)

could (0.7) how (0.4)

much (1.0)

woodchuck (0.5) chuck (0.3) wood (0.6)

QP

Page 42: ECE 598: The Speech Chain Lecture 12: Information Theory.

N-Gram N-Gram Language ModelsLanguage Models

Page 43: ECE 598: The Speech Chain Lecture 12: Information Theory.

N-Gram Language ModelsN-Gram Language Models

An N-gram is just a PFSG (probabilistic finite An N-gram is just a PFSG (probabilistic finite state grammar) in which each nonterminal is state grammar) in which each nonterminal is specified by the N-1 most recent terminals.specified by the N-1 most recent terminals.

Definition of an N-gram:Definition of an N-gram:

p(wp(wnn|h|hnn) = p(w) = p(wnn|w|wn-N+1n-N+1,…,w,…,wn-1n-1)) The most common choices for N are 0,1,2,3:The most common choices for N are 0,1,2,3:

Trigram (3-gram): p(wTrigram (3-gram): p(wnn|h|hnn) = p(w) = p(wnn|w|wn-2n-2,w,wn-1n-1)) Bigram (2-gram): p(wBigram (2-gram): p(wnn|h|hnn) = p(w) = p(wnn|w|wn-1n-1)) Unigram (1-gram): p(wUnigram (1-gram): p(wnn|h|hnn) = p(w) = p(wnn)) 0-gram: p(w0-gram: p(wnn|h|hnn) = 1/|V|) = 1/|V|

Page 44: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: A Woodchuck Example: A Woodchuck BigramBigram

T = “How much wood would a woodchuck chuck if a T = “How much wood would a woodchuck chuck if a woodchuck could chuck wood”woodchuck could chuck wood”

Nonterminals are labeled by wNonterminals are labeled by wn-1n-1. Edges are labeled with . Edges are labeled with wwnn, and with p(w, and with p(wnn|w|wn-1n-1))

0 how much wouldwood a woodchuckchuck if could

how (1.0)

much (1.0)

wood (1.0)

would (1.0)

a (1.0)

woodchuck(1.0)

wood(0.5)

if(0.5)

could(0.5)

chuck(1.0)

a(1.0)

chuck(0.5)

Page 45: ECE 598: The Speech Chain Lecture 12: Information Theory.

N-Gram: Maximum Likelihood N-Gram: Maximum Likelihood EstimationEstimation

An N-gram is defined by its vocabulary An N-gram is defined by its vocabulary VV, and by , and by the probabilities the probabilities = { p(w = { p(wnn|w|wn-N+1n-N+1,…,w,…,wn-1n-1) }) }

GG = { = { VV, , } } A text is a (long) string of words, T = { wA text is a (long) string of words, T = { w11,…,w,…,wNN } } The probability of a text given an N-gram isThe probability of a text given an N-gram is

p(T|G) = p(T|G) = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G, w|G, wn-N+1n-N+1,…,w,…,wn-1n-1))

= = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G, w|G, wn-N+1n-N+1,…,w,…,wn-1n-1))

The “Maximum Likelihood” N-gram model is the The “Maximum Likelihood” N-gram model is the set of probabilities, set of probabilities, , that maximizes p(T|G)., that maximizes p(T|G).

Page 46: ECE 598: The Speech Chain Lecture 12: Information Theory.

N-Gram: Maximum Likelihood N-Gram: Maximum Likelihood EstimationEstimation

The “Maximum Likelihood” N-gram model is the set of The “Maximum Likelihood” N-gram model is the set of probabilities, probabilities, , that maximizes p(T|G). These , that maximizes p(T|G). These probabilities are given by:probabilities are given by:

p(wp(wnn|w|wn-N+1n-N+1,…,w,…,wnn) = N(w) = N(wn-N+1n-N+1,…,w,…,wnn)/N(w)/N(wn-N+1n-N+1,…,w,…,wn-1n-1))

where where N(wN(wn-N+1n-N+1,…,w,…,wnn) is the “frequency” of the N-gram w) is the “frequency” of the N-gram wn-N+1n-N+1,…,w,…,wn n

(i.e., the number of times that the N-gram occurs in the data)(i.e., the number of times that the N-gram occurs in the data) N(wN(wn-N+1n-N+1,…,w,…,wn-1n-1) is the frequency of the (N-1)-gram w) is the frequency of the (N-1)-gram wn-N+1n-N+1,…,w,…,wn-n-

11

This is the set of probabilities you would have guessed, This is the set of probabilities you would have guessed, anyway!!anyway!!

For example, the woodchuck bigram assigned For example, the woodchuck bigram assigned

p(wp(wnn|w|wn-1n-1)=N(w)=N(wn-1n-1,w,wnn)/N(w)/N(wnn).).

Page 47: ECE 598: The Speech Chain Lecture 12: Information Theory.

Cross-Entropy of an N-gramCross-Entropy of an N-gram

Cross-entropy of an N-word text, T, given an N-Cross-entropy of an N-word text, T, given an N-gram G:gram G:

H(T|G) = H(T|G) = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G,w|G,wn-N+1n-N+1,…,w,…,wn-1n-1))

= = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G,w|G,wn-N+1n-N+1,…,w,…,wn-1n-1))

N = # words in the textN = # words in the text p(wp(wnn|T) = (# times wn occurs)/N|T) = (# times wn occurs)/N p(wp(wnn|G, w|G, wn-N+1n-N+1,…,w,…,wn-1n-1) = probability of word w) = probability of word wnn

given its history wgiven its history wn-N+1n-N+1,…,w,…,wn-1n-1, according to N-, according to N-gram language model Ggram language model G

Page 48: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: A Woodchuck Example: A Woodchuck BigramBigram

T = “a woodchuck could chuck wood.”T = “a woodchuck could chuck wood.” H(T|G) = H(T|G) = (1/5){log(1/5){log22(p(“a”))(p(“a”))

+log+log22(1.0)+log(1.0)+log22(0.5)+log(0.5)+log22(1.0)+log(1.0)+log22(0.5)}(0.5)}

= (2= (2loglog22p(“a”))/5 p(“a”))/5

= 0.4 bits/word plus the information of the first word= 0.4 bits/word plus the information of the first word

how much wouldwood a woodchuckchuck if could

much (1.0)

wood (1.0)

would (1.0)

a (1.0)

woodchuck(1.0)

wood(0.5)

if(0.5)

could(0.5)

chuck(1.0)a

(1.0)

chuck(0.5)

Page 49: ECE 598: The Speech Chain Lecture 12: Information Theory.

Example: A Woodchuck Example: A Woodchuck BigramBigram Information of the first word must be set by assumption. Information of the first word must be set by assumption.

Common assumptions include:Common assumptions include: Assume that the first word gives zero information (p(“a”)=1, Assume that the first word gives zero information (p(“a”)=1,

loglog22(p(“a”))=0(p(“a”))=0) --- this focuses attention on the bigram) --- this focuses attention on the bigram First word information given by its unigram probability First word information given by its unigram probability

(p(“a”)=2/13, (p(“a”)=2/13, loglog22(p(“a”))=2.7 bits(p(“a”))=2.7 bits) --- this gives a well-) --- this gives a well-

normalized entropynormalized entropy First word information given by its 0-gram probability First word information given by its 0-gram probability

(p(“a”)=1/9, (p(“a”)=1/9, loglog22(p(“a”))=3.2 bits(p(“a”))=3.2 bits) --- a different well-) --- a different well-

normalized entropynormalized entropy

how much wouldwood a woodchuckchuck if could

much (1.0)

wood (1.0)

would (1.0)

a (1.0)

woodchuck(1.0)

wood(0.5)

if(0.5)

could(0.5)

chuck(1.0)a

(1.0)

chuck(0.5)

Page 50: ECE 598: The Speech Chain Lecture 12: Information Theory.

N-Gram: ReviewN-Gram: Review The “Maximum Likelihood” N-gram model is the set of The “Maximum Likelihood” N-gram model is the set of

probabilities, probabilities, , that maximizes p(T|G). These , that maximizes p(T|G). These probabilities are given by:probabilities are given by:

p(wp(wnn|w|wn-N+1n-N+1,…,w,…,wnn) = N(w) = N(wn-N+1n-N+1,…,w,…,wnn)/N(w)/N(wn-N+1n-N+1,…,w,…,wn-1n-1))

where where N(wN(wn-N+1n-N+1,…,w,…,wnn) is the “frequency” of the N-gram w) is the “frequency” of the N-gram wn-N+1n-N+1,…,w,…,wn n

(i.e., the number of times that the N-gram occurs in the data)(i.e., the number of times that the N-gram occurs in the data) N(wN(wn-N+1n-N+1,…,w,…,wn-1n-1) is the frequency of the (N-1)-gram w) is the frequency of the (N-1)-gram wn-N+1n-N+1,…,w,…,wn-n-

11

This is the set of probabilities you would have guessed, This is the set of probabilities you would have guessed, anyway!!anyway!!

For example, the woodchuck bigram assigned For example, the woodchuck bigram assigned

p(wp(wnn|w|wn-1n-1)=N(w)=N(wn-1n-1,w,wnn)/N(w)/N(wnn).).

Page 51: ECE 598: The Speech Chain Lecture 12: Information Theory.

ReviewReview Information = Unpredictability of a wordInformation = Unpredictability of a word

I(w) = I(w) = loglog22p(w)p(w) Entropy = Average Information of the words in a message Entropy = Average Information of the words in a message

H(p(w)) = H(p(w)) = ww p(w) log p(w) log22p(w)p(w) Conditional Entropy = Average Conditional InformationConditional Entropy = Average Conditional Information

H(p(w|h)) = H(p(w|h)) = h h p(h) p(h) ww p(w|h) log p(w|h) log22p(w|h)p(w|h) Equivocation = Conditional Entropy of the Received Message given the Equivocation = Conditional Entropy of the Received Message given the

Transmitted MessageTransmitted Message

H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ|w)|w) Mutual Information = Entropy minus EquivocationMutual Information = Entropy minus Equivocation

I(p(I(p(ŵ,w)) = H(p(w)) – H(p(,w)) = H(p(w)) – H(p(ŵ|w))|w)) Channel Capacity = Maximum Mutual InformationChannel Capacity = Maximum Mutual Information

C(SNR) = maxC(SNR) = maxpp I(p( I(p(ŵ,w)),w))

Cross-Entropy of text T={wCross-Entropy of text T={w11,…,w,…,wNN} given language model G:} given language model G:

H(T|G) = = H(T|G) = = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G,h|G,hnn))

Maximum Likelihood Estimate of an N-gram language model:Maximum Likelihood Estimate of an N-gram language model:

p(wp(wnn|w|wn-N+1n-N+1,…,w,…,wnn) = N(w) = N(wn-N+1n-N+1,…,w,…,wnn)/N(w)/N(wn-N+1n-N+1,…,w,…,wn-1n-1))